[go: up one dir, main page]

Configuring Tika

Out of the box, Apache Tika will attempt to start with all available Detectors and Parsers, running with sensible defaults. For most users, this default configuration will work well.

This page gives you information on how to configure the various components of Apache Tika, such as Parsers and Detectors, if you need fine-grained control over ordering, exclusions and the like.

Configuring Parsers

Through the Tika Config xml, it is possible to have a high degree of control over which parsers are or aren't used, in what order of preferences etc. It is also possible to override just certain parts, to (for example) have "default except for PDF".

Currently, it is only possible to have a single parser run against a document. There is on-going discussion around fallback parsers and combining the output of multiple parsers running on a document, but none of these are available yet.

To override some parser certain default behaviours, include the DefaultParser in your configuration, with excludes, then add other parser definitions in. To prevent the DefaultParser (with its auto-discovery) being used, simply omit it from your config, and list all other parsers you want instead.

To override just some default behaviour, you can use a Tika Config something like this:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <!-- Default Parser for most things, except for 2 mime types, and never
         use the Executable Parser -->
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>
      <parser-exclude class="org.apache.tika.parser.executable.ExecutableParser"/>
    </parser>
    <!-- Use a different parser for PDF -->
    <parser class="org.apache.tika.parser.EmptyParser">
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>

To configure things in code, the key classes to use to build up your own custom parser heirarchy are org.apache.tika.parser.DefaultParser, org.apache.tika.parser.CompositeParser and org.apache.tika.parser.ParserDecorator.

Configuring Detectors

Through the Tika Config xml, it is possible to have a high degree of control over which detectors are or aren't used, in what order of preferences etc. It is also possible to override just certain parts, to (for example) have "default except for no POIFS Container Detction".

To override some detector certain default behaviours, include the DefaultDetector , with any detector-exclude entries you need, in your configuration, then add other detectors definitions in. To prevent the DefaultParser (with its auto-discovery) being used, simply omit it from your config, and list all other detectors you want instead.

To override just some default behaviour, you can use a Tika Config something like this:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <detectors>
    <!-- All detectors except built-in container ones -->
    <detector class="org.apache.tika.detect.DefaultDetector">
      <detector-exclude class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
      <detector-exclude class="org.apache.tika.parser.microsoft.POIFSContainerDetector"/>
    </detector>
  </detectors>
</properties>

Or to just only use certain detectors, you can use a Tika Config something like this:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <detectors>
    <!-- Only use these two detectors, and ignore all others -->
    <detector class="org.apache.tika.parser.pkg.ZipContainerDetector"/>
    <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
</properties>

In code, the key classes to use to build up your own custom detector heirarchy are org.apache.tika.detect.DefaultDetector and org.apache.tika.detect.CompositeDetector.

Configuring Mime Types

TODO Mention non-standard paths, and custom mime type files

Configuring Language Identifiers

At this time, there is no unified way to configure language identifiers. While the work on that is ongoing, for now you will need to review the Tika Javadocs to see how individual identifiers are configured.

Configuring Translators

At this time, there is no unified way to configure Translators. While the work on that is ongoing, for now you will need to review the Tika Javadocs to see how individual Translators are configured.

Configuring the Service Loader

Tika has a number of service provider types such as parsers, detectors, and translators. The org.apache.tika.config.ServiceLoader class provides a registry of each type of provider. This allows Tika to create implementations such as org.apache.tika.parser.DefaultParser, org.apache.tika.language.translate.DefaultTranslator, and org.apache.tika.detect.DefaultDetector that can match the appropriate provider to an incoming piece of content.

The ServiceLoader's registry can be populated either statically or dynamically.

Static

Static loading is the default which requires no configuration. This configuration options is used in Tika deployments where the Tika JAR files reside together in the same classloader hierarchy. The services provides are loaded from provider configuration files located within the tika-parsers JAR file at META-INF/services.

Dynamic

Dynamic loading may be required if the tika service providers will reside in different classloaders such as in OSGi. To allow a provider created in tika-config.xml to utilize dynamically loaded services you need to configure the ServiceLoader to be dynamic with the following configuration:

<properties>
  <service-loader dynamic="true"/>
  ....
</properties>

Load Error Handling

The ServiceLoader can contains a handler to deal with errors that occur during provider initialization. For example if a class fails to initialize LoadErrorHandler deals with the exception that is thrown. This handler can be configured to:

  • IGNORE - (Default) Do nothing when providers fail to initialize.
  • WARN - Log a warning when providers fail to initialize.
  • THROW - Throw an exception when providers fail to initialize.

For example to set the LoadErrorHandler to WARN then use the following configuration:

<properties>
  <service-loader loadErrorHandler="WARN"/>
  ....
</properties>

Using a Tika Configuration XML file

However you call Tika, the System Property of tika.config is checked first, and the Environment Variable of TIKA_CONFIG is tried next. Setting one of those will cause Tika to use your given Tika Config XML file.

If you are calling Tika from your own code, then you can pass in the location of your Tika Config XML file when you construct your TikaConfig instance. From that, you can fetch your configured parser, detectors etc.

TikaConfig config = new TikaConfig("/path/to/tika-config.xml");
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);

For users of the Tika App, in addition to the sytem property and the environement variable, you can also use the --config=[tika-config.xml] option to select a different Tika Config XML file to use

For users of the Tika Server, in addition to the sytem property and the environement variable, you can also use -c [tika-config.xml] or --config [tika-config.xml] options to select a different Tika Config XML file to use