The language models are build with the latest version of redirects, disambiguations, and instance-types artifacts, downloaded from the DBpedia Databus. The Catalan, Finish, Lithuanian, and Romanian languages was integrated to the creation model list.
This tool now uses the wikistatsextractor by the great folks over at DiffBot. This means: no more Hadoop and Pig! Running the biggest model (English) takes around 2h on a single machine with around 32GB of RAM. We recommend running this script on an SSD with around 100GB of free space.
- Git
- Maven 3
You can use this tool for creating models of DBpedia Spotlight in your language.
-
docker run -it dbpediaspotlight/model-quickstarter bash
Generate the models outside the container
- If you want to generate the models outside the container, just map volumes for the folders `/model-quickstarter/wdir`, `/model-quickstarter/data` and `/model-quickstarter/models`, e.g.docker run -v /home/user/data/model/wdir:/model-quickstarter/wdir -v /home/user/data/model/data:/model-quickstarter/data -v /home/user/data/model/models:/model-quickstarter/models -it dbpediaspotlight/model-quickstarter bash
-
cd model-quickstarter/
-
Copy & paste one of the following commands to begin the corresponding language model creation process.
Language | Language code | Locator code | Analyzer+Stemmer language prefix | Command |
---|---|---|---|---|
Catalan | ca | ES | Catalan | ./index_db.sh wdir ca_ES ca/stopwords.list Catalan models/ca |
Danish | da | DK | Danish | ./index_db.sh wdir da_DK da/stopwords.list Danish models/da |
German | de | DE | German | ./index_db.sh -b de/ignore.list wdir de_DE de/stopwords.list German models/de |
English | en | US | English | ./index_db.sh -b en/ignore.list wdir en_US en/stopwords.list English models/en |
Spanish | es | ES | Spanish | ./index_db.sh -b es/ignore.list wdir es_ES es/stopwords.list Spanish models/es |
Finish | fi | FI | Finnish | ./index_db.sh wdir fi_FI fi/stopwords.list Finnish models/fi |
French | fr | FR | French | ./index_db.sh -b fr/ignore.list wdir fr_FR fr/stopwords.list French models/fr |
Hungarian | hu | HU | Hungarian | ./index_db.sh wdir hu_HU hu/stopwords.list Hungarian models/hu |
Italian | it | IT | Italian | ./index_db.sh wdir it_IT it/stopwords.list Italian models/it |
Lithuanian | lt | LT | Lithuanian | ./index_db.sh wdir lt_LT lt/stopwords.list Lithuanian models/lt |
Dutch | nl | NL | Dutch | ./index_db.sh -b nl/ignore.list wdir nl_NL nl/stopwords.list Dutch models/nl |
Norwegian | no | NO | Norwegian | ./index_db.sh -b no/ignore.list wdir no_NO no/stopwords.list Norwegian models/no |
Portuguese | pt | BR | Portuguese | ./index_db.sh -b pt/ignore.list wdir pt_BR pt/stopwords.list Portuguese models/pt |
Romanian | ro | RO | Romanian | ./index_db.sh wdir ro_RO ro/stopwords.list Romanian models/ro |
Russian | ru | RU | Russian | ./index_db.sh wdir ru_RU ru/stopwords.list Russian models/ru |
Swedish | sv | SE | Swedish | ./index_db.sh -b sv/ignore.list wdir sv_SE sv/stopwords.list Swedish models/sv |
Turkish | tr | TR | Turkish | ./index_db.sh -b tr/ignore.list wdir tr_TR tr/stopwords.list Turkish models/tr |
You can find pre-built datasets created using the model-quickstarter here:
If you use the current (statistical version) of DBpedia Spotlight or the data/models created using this repository, please cite the following paper.
@inproceedings{isem2013daiber,
title = {Improving Efficiency and Accuracy in Multilingual Entity Extraction},
author = {Joachim Daiber and Max Jakob and Chris Hokamp and Pablo N. Mendes},
year = {2013},
booktitle = {Proceedings of the 9th International Conference on Semantic Systems (I-Semantics)}
}