crawl

Preprocessing Common Crawl

This code downloads, preprocesses and splits per language the data from Common Crawl.

This script uses the scripts and language identifier of [1].

This code inherits its requirements form fastText.

Set the variable WET_PATHS_URL to the crawl you want to process. Please also set the variables NUM_LANGID and NUM_DEDUP in download_crawl.sh according to the capacity of your machine. Langid processes are mostly limited by CPU usage, while dedup processes are likely to be limited by RAM usage (each use 2GB of RAM).

Reference

If you use this code, please cite:

[1] E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
dedup.cc		dedup.cc
download_crawl.sh		download_crawl.sh
filter_dedup.sh		filter_dedup.sh
filter_utf8.cc		filter_utf8.cc
process_wet_file.sh		process_wet_file.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl

crawl

README.md

Preprocessing Common Crawl

Reference

Files

crawl

Directory actions

More options

Directory actions

More options

Latest commit

History

crawl

Folders and files

parent directory

README.md

Preprocessing Common Crawl

Reference