Overview

A package for working with files containing word embeddings (aka word vectors). Written for:

providing a common interface for different file formats;
providing a flexible function for building "embedding matrices" that you can use for initializing the Embedding layer of your deep learning model;
taking as less RAM as possible: no need to load 3M vectors like with gensim.load_word2vec_format when you only need 20K;
satisfying my (inexplicable) urge of writing a Python package.

Features

Supports textual and Google's binary format plus a custom convenient format (.vvm) supporting constant-time access of word vectors (by word).
Allows to easily implement, test and integrate new file formats.
Supports virtually any text encoding and vector data type (though you should probably use only UTF-8 as encoding).
Well-documented and type-annotated (meaning great IDE support).
Extensively tested.
Progress bars (by default) for every time-consuming operation.

Installation

pip install embfile

Quick start

import embfile

with embfile.open("path/to/file.bin") as f:     # infer file format from file extension

    print(f.vocab_size, f.vector_size)

    # Load some word vectors in a dictionary (raise KeyError if any word is missing)
    word2vec = f.load(['ciao', 'hello'])

    # Like f.load() but allows missing words (and returns them in a Set)
    word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])

    # Build a matrix for initializing an Embedding layer either from
    # a list of words or from a dictionary {word: index}. Handles the
    # initialization of eventual missing word vectors (see "oov_initializer")
    matrix, word2index, missing_words = embfile.build_matrix(f, words)

Examples

The examples shows how to use embfile to initialize the Embedding layer of a deep learning model. They are just illustrative, don't skip the documentation.

Keras using Tokenizer
Keras using TextVectorization (tensorflow >= 2.1)

Documentation

Read the full documentation at https://embfile.readthedocs.io/.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
benchmark		benchmark
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.cookiecutterrc		.cookiecutterrc
.coveragerc		.coveragerc
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.rst		CHANGELOG.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

Installation

Quick start

Examples

Documentation

About

Releases

Packages

Languages

License

janluke/embfile

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Installation

Quick start

Examples

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages