Apache Tika 1.14
The most notable changes in Tika 1.14 over the previous release are:
- Extract all headers from MSG/RFC822 (TIKA-2122).
- 9.1 (TIKA-2113).
- Extract PDF DocInfo metadata into separate keys to preventoverwriting by XMP metadata (TIKA-2057).
- Re-enable fileUrl for tika-server (TIKA-2081). If you choose,to use this feature, beware of the security vulnerabilities!See: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-3271
- Add Tesseract's hOCR output format as an option, via Eric Pugh(TIKA-2093)
- Extract macros from MSOffice files (TIKA-2069).
- Maintain passed-in mime in TXTParser (TIKA-2047).
- Upgrade to POI.3-15 (TIKA-2013).
- 0.3 (TIKA-2051).
- Fix hyperlinks with formatting in DOC and DOCX (TIKA-1255and TIKA-2078)
- Tika now is integrated with the Tensorflow library from Googleand it can use its Inception v3 image classification model toidentify objects in images (TIKA-1993).
- Parser configuration is now type-safe and parameters for parserscan have assigned types (TIKA-1508, TIKA-1986).
- Prevent OOM/permanent hang on some corrupt CHM files (TIKA-2040).
- Upgrade ICU4J charset detection components to fix multithreadingbug (TIKA-2041).
- 1.4 (TIKA-2039).
- Maintain more significant digits in cells of "General" formatin XLS and XLSX (TIKA-2025).
- Avoid mark/reset issues when extracting or detecting embedded resourcesin RFC822 emails (TIKA-2037).
- Improving accuracy of Tesseract for better extraction of numericand alphanumeric text from images (TIKA-2021, TIKA-2031).
- Improve extraction of embedded documents from PPT, PPTX and XLSX(TIKA-2026).
- Add parser for applefile (AppleSingle) (TIKA-2022).
- Add mime types, mime magic and/or globs for:
- Endnote Import File (TIKA-2011)
- DJVU files (TIKA-2009)
- MS Owner File (TIKA-2008)
- Windows Media Metafile (TIKA-2004)
- iCal and vCalendar (TIKA-2006)
- MBOX (TIKA-2042)
- Stata DTA (TIKA-2064)
- Add configurable maximum threshold for number of events extractedfrom the XMP Media Management Schema in JempboxExtractor (TIKA-1999).
- Integrate TesseractOCR with full page image rendering for PDFs (TIKA-1994).
- Add mime detection via Nick C and parser for DBF files (TIKA-1513).
- Add mime detection and parsers for MSOffice 2003 XML Wordand Excel formats (TIKA-1958).
- Extract hyperlinks from PPT, PPTX, XSLX (TIKA-1454).
The following people have contributed to Tika 1.14 by submitting or commenting on the issues resolved in this release:
- Aeham Abushwashi
- Alan Hunter
- Alexander Kazakov
- Chris A. Mattmann
- Chris Knott
- Egbert
- Eli Trucco
- Eric Pugh
- Jean Coudon
- Jeff Swindle
- John Dougrez-Lewis
- John Haynes
- Joseph Naegele
- Josh Cummings
- Ken Krugler
- Kukushkin Alexander
- Lewis John McGibbney
- Luis Filipe Nassif
- Matthias Pigulla
- Nam-Quang Tran
- Nilay Chheda
- Philipp Steinkrueger
- Sara Miller
- Sebastian Iturra
- Thamme Gowda
- Tilman Hausherr
- Tim Allison
- Tim Barrett
- Vjeran Marcinko
- Yahav Amsalem
- Zarana Parekh
See https://s.apache.org/TRWa for more details on these contributions.