Apache Tika

Apache Tika 1.23

The most notable changes in Tika 1.23 over the previous release are:

NOTE: The PDFParser now relies on OCRDPI to render page images when users configure OCR on rendered page images. This will have the effect of increasing rendered image size (TIKA-2624).
NOTE: tika-server no longer returns 415 for file types for which there is no parser.
NOTE: tika-server's /rmeta endpoint now returns 200 if there is a parse exception to align its behavior with tika-app in batch mode. The stacktrace is stored as a metadata value.
Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002).
Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630).
Upgrade to POI 4.1.1 (TIKA-2851).
Upgrade to PDFBox 2.0.17 (TIKA-2951).
Ensure that the PDFParser respects custom configuration of Tesseractfrom tika-config.xml via Eric Pugh (TIKA-2970).
Add parser for XLIFF v1.2 files (TIKA-2975).
Add mime type detection support for WebAssembly (TIKA-2894),HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988);and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989).
Add an XLZ Parser (TIKA-2976).

The following people have contributed to Tika 1.23 by submitting or commenting on the issues resolved in this release:

See https://s.apache.org/asrx3 for more details on these contributions.