전체 페이지뷰

2013년 11월 9일 토요일

file open library

“Table 2.13. Common file formats
File format. MIME type. Open source library. Remarks

Text plain/text Built-in

Microsoft Office (Word, PowerPoint, Excel)application/msword, application/vnd.ms-excel.
Apache POI Open Office textmining.org
textmining.org is for MS Word only.

Adobe Portable Document Format (PDF) application/pdf PDFBox
Text can’t be extracted from image-based PDFs without first using optical character recognition.

Rich Text Format (RTF) application/rtf Built-in to Java using RTFEditorKit  

HTML text/html JTidy  CyberNeko.   Many others
 
XML text/xml Many XML libraries available (Apache Xerces is popular)
Most applications should use SAX-based parsing instead of DOM-based to avoid creating duplicate data structures.”

this. article is from Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris. “Taming Text.” iBooks. 

댓글 없음:

댓글 쓰기