EETQ

中文README

Easy & Efficient Quantization for Transformers

Features

New🔥: Implement gemv in w8a16, performance improvement 10~30%.
INT8 weight only PTQ
- High-performance GEMM kernels from FasterTransformer, original code
- No need for quantization training
Optimized attention layer using Flash-Attention V2
Easy to use, adapt to your pytorch model with one line of code

Getting started

Environment

cuda:>=11.4
python:>=3.8
gcc:>= 7.4.0
torch:>=1.14.0
transformers:>=4.27.0

The above environment is the minimum configuration, and it is best to use a newer version.

Installation

Recommend using Dockerfile.

$ git clone https://github.com/NetEase-FuXi/EETQ.git
$ cd EETQ/
$ git submodule update --init --recursive
$ pip install .

If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:

$ MAX_JOBS=4 pip install .

Usage

Use EETQ in transformers.

from transformers import AutoModelForCausalLM, EetqConfig
path = "/path/to/model"
quantization_config = EetqConfig("int8")
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto", quantization_config=quantization_config)

A quantized model can be saved via "saved_pretrained" and be reused again via the "from_pretrained".

quant_path = "/path/to/save/quantized/model"
model.save_pretrained(quant_path)
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")

Quantize torch model

from eetq.utils import eet_quantize
eet_quantize(torch_model)

Quantize torch model and optimize with flash attention

...
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16)
from eetq.utils import eet_accelerator
eet_accelerator(model, quantize=True, fused_attn=True, dev="cuda:0")
model.to("cuda:0")

# inference
res = model.generate(...)

Use EETQ in TGI. see this PR.

text-generation-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

Use EETQ in LoRAX. See docs here.

lorax-launcher --model-id mistralai/Mistral-7B-v0.1 --quantize eetq ...

Load quantized model in vllm (doing) Support vllm

python -m vllm.entrypoints.openai.api_server --model /path/to/quantized/model  --quantization eetq --trust-remote-code

Examples

Model:

examples/models/llama_transformers_example.py

Performance

llama-13b (test on 3090) prompt=1024, max_new_tokens=50

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
csrc		csrc
docs/images		docs/images
examples		examples
python		python
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EETQ

Table of Contents

Features

Getting started

Environment

Installation

Usage

Examples

Performance

About

Releases 1

Packages

Contributors 6

Languages

License

NetEase-FuXi/EETQ

Folders and files

Latest commit

History

Repository files navigation

EETQ

Table of Contents

Features

Getting started

Environment

Installation

Usage

Examples

Performance

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 6

Languages

Packages