A package for working with files containing word embeddings (aka word vectors). Written for:
- providing a common interface for different file formats;
- providing a flexible function for building "embedding matrices" that you can use for initializing the Embedding layer of your deep learning model;
- taking as less RAM as possible: no need to load 3M vectors like with gensim.load_word2vec_format when you only need 20K;
- satisfying my (inexplicable) urge of writing a Python package.
- Supports textual and Google's binary format plus a custom convenient format (.vvm) supporting constant-time access of word vectors (by word).
- Allows to easily implement, test and integrate new file formats.
- Supports virtually any text encoding and vector data type (though you should probably use only UTF-8 as encoding).
- Well-documented and type-annotated (meaning great IDE support).
- Extensively tested.
- Progress bars (by default) for every time-consuming operation.
pip install embfile
import embfile
with embfile.open("path/to/file.bin") as f: # infer file format from file extension
print(f.vocab_size, f.vector_size)
# Load some word vectors in a dictionary (raise KeyError if any word is missing)
word2vec = f.load(['ciao', 'hello'])
# Like f.load() but allows missing words (and returns them in a Set)
word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])
# Build a matrix for initializing an Embedding layer either from
# a list of words or from a dictionary {word: index}. Handles the
# initialization of eventual missing word vectors (see "oov_initializer")
matrix, word2index, missing_words = embfile.build_matrix(f, words)
The examples shows how to use embfile to initialize the Embedding
layer of
a deep learning model. They are just illustrative, don't skip the documentation.
- Keras using Tokenizer
- Keras using TextVectorization (tensorflow >= 2.1)
Read the full documentation at https://embfile.readthedocs.io/.