Apache Tika 2.5.0
The most notable changes in Tika 2.5.0 over the previous release are:
- Improved extraction of PDF subset info for PDF/UA, PDF/VT, and PDF/X. NOTE: we no longer append PDF/A information, e.g. 'version="A-1b"'to the 'dc:format'. Users must now get that information from the'pdfa:PDFVersion' key or from 'pdfaid:conformance' and 'pdfaid:part' (TIKA-3844).
- Avoid infinite loop in bookmark extraction from PDFs (TIKA-3832).
- Update to slf4j 2.0.1 (TIKA-3842).
- Added upsert option for the OpenSearch emitter (TIKA-3855).
- Extract PDF signature information at the document level into the metadata (TIKA-3852).
- Enable configuration of digests via AutoDetectParserConfig (TIKA-3853).
- Use commons-io byte array streams via PJ Fanning (TIKA-3843).
- Upgrade to PDFBox 2.0.27 (TIKA-3866).
- Upgrade to JempBox 1.8.17 (TIKA-3856).
- Add extraction of ODF version from ODF files (TIKA-3840).
- tika-parser-html-commons (BoilerPipeHandler) is no longer aa dependency of tika-parser-html-module. tika-app and tika-server-standard have added a dependency on tika-parser-html-commons. However, users who are managing custom dependencies and who want the BoilerPipeHandler will have to now include the tika-parser-html-commons dependency(TIKA-1484).
- Add unrar as an optional parser (TIKA-3800).
- Refactor FuzzingCLI to use PipesParser (TIKA-3799).
- ServiceLoader's loadServiceProviders() now guaranteesunique classes (TIKA-3797).
- Fix bug that prevented setting of includeHeadersAndFooters for xls, xlsx, doc and docx via tika-config (TIKA-3796).
- Fix bug that prevented specification of rendered image type via http header in the PDFParser (TIKA-3794).
- Fix bug causing some Exif dates to be decoded wrongly on timezones different than UTC (TIKA-3815).
- Numerous dependency upgrades (TIKA-3795).
The following people have contributed to Tika 2.4.1 by submitting or commenting on the issues resolved in this release:
- Aurélien Marocco
- Ben Gilbert
- Eduardas Kazakas
- Eugen Caruntu
- Giorgiana Ciobanu
- Lakatos Gyula
- Luís Filipe Nassif
- Nicholas DiPiazza
- PJ Fanning
- Robin Schimpf
- Tilman Hausherr
- Tim Allison
- Yurii
See https://s.apache.org/j2sms for more details on these contributions.