Apache Tika 2.3.0
The most notable changes in Tika 2.3.0 over the previous release are:
- Upgrade to Apache POI 5.2.0. This is the first upgrade to POI 5.x and represents a major refactoring. Users will experience significantly more logging from the POI parsers (TIKA-3164).
- Upgrade to log4j2 2.17.1 (TIKA-3638).
- Improve consistency in reporting package-entry divs acrossall parsers for embedded files (TIKA-3644). This leads to some more text (embedded file names) in files with many embedded attachments.
- Improve configuration of maps as params for parsers in TikaConfig (TIKA-3645).
- Improve identification of iWorks 13 files and add parsing for thumbnails, some metadata and attachments (TIKA-3634). Skip handling of .iwa files, which are not yet supported.
- Limit the default in-memory processing (maxMainMemoryBytes) in the PDFParser to 512MB as in the 1.x branch (TIKA-3642).
- Added IDML Parser from 1.x series to 2.x series (TIKA-3188).
- Extract annotation types and subtypes for PDFs into metadata (TIKA-3653).
- Add metadata value for PDFs that contain 3D annotations (TIKA-3653).
- Add parser for Translation Memory eXchange (TMX) files (TIKA-3660).
- Add Bill of Materials (Maven BOM) for centralized module version management (TIKA-3667).
The following people have contributed to Tika 2.3.0 by submitting or commenting on the issues resolved in this release:
- Bernhard Geisberger
- Carina Antunes
- Aman Mishra
- Aravinth
- Dave Meikle
- Dmitrii Kriukov
- Josh Burchard
- Kaka Lee
- Lewis John McGibbney
- Sergen Bağ
- Subhajit Das
- Tim Allison
See https://s.apache.org/syxl5 for more details on these contributions.