“Table 2.13. Common file formats
File format. MIME type. Open source library. Remarks
Text plain/text Built-in
Microsoft Office (Word, PowerPoint, Excel)application/msword, application/vnd.ms-excel.
Apache POI Open Office textmining.org
textmining.org is for MS Word only.
Adobe Portable Document Format (PDF) application/pdf PDFBox
Text can’t be extracted from image-based PDFs without first using optical character recognition.
Rich Text Format (RTF) application/rtf Built-in to Java using RTFEditorKit
HTML text/html JTidy CyberNeko. Many others
XML text/xml Many XML libraries available (Apache Xerces is popular)
Most applications should use SAX-based parsing instead of DOM-based to avoid creating duplicate data structures.”
this. article is from Grant S. Ingersoll, Thomas S. Morton, Andrew L. Farris. “Taming Text.” iBooks.
댓글 없음:
댓글 쓰기