| Rust Documentation | Python Documentation | Discord |
Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.
- More models: please submit requests here.
- X-LoRA: Scalings
topk
and softmaxtopk
(#48). - Parallel linear layers (sharding) (#50).
- Vision models: Idefics 2 (#309).
Running the new Llama 3 model
cargo run --release --features ... -- -i plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama
Running the new Phi 3 model with 128K context window
cargo run --release --features ... -- -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
Fast:
- Quantized model support: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit for faster inference and optimized memory usage.
- Continuous batching.
- Prefix caching.
- Device mapping: load and run some layers on the device and the rest on the CPU.
Accelerator support:
- Apple silicon support with the Metal framework.
- CPU inference with
mkl
,accelerate
support and optimized backend. - CUDA support with flash attention and cuDNN.
Easy:
- Lightweight OpenAI API compatible HTTP server.
- Python API.
- Grammar support with Regex and Yacc.
- ISQ (In situ quantization): run
.safetensors
models directly from Hugging Face Hub by quantizing them after loading instead of creating a GGUF file. This loads the ISQ-able weights on CPU before quantizing with ISQ and then moving back to the device to avoid memory spikes.
Powerful:
- Fast LoRA support with weight merging.
- First X-LoRA inference platform with first class support.
- Speculative Decoding: Mix supported models as the draft model or the target model
- Dynamic LoRA adapter swapping at runtime with adapter preloading: examples and docs
This is a demo of interactive mode with streaming running Mistral GGUF:
demo_new.mp4
Supported models:
- Mistral 7B (v0.1 and v0.2)
- Gemma
- Llama, including Llama 3
- Mixtral 8x7B
- Phi 2
- Phi 3
- Qwen 2
Please see this section for details on quantization and LoRA support.
Rust Library API
Rust multithreaded API for easy integration into any application.
- Docs
- Examples
- To install: Add
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
Python API
Python API for mistral.rs.
from mistralrs import Runner, Which, ChatCompletionRequest
runner = Runner(
which=Which.GGUF(
tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
tokenizer_json=None,
repeat_last_n=64,
)
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="mistral",
messages=[{"role":"user", "content":"Tell me a story about the Rust type system."}],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)
HTTP Server
OpenAI API compatible API server
Llama Index integration
- CUDA:
- Enable with
cuda
feature:--features cuda
- Flash attention support with
flash-attn
feature, only applicable to non-quantized models:--features flash-attn
- cuDNNsupport with
cudnn
feature:--features cudnn
- Enable with
- Metal:
- Enable with
metal
feature:--features metal
- Enable with
- CPU:
- Intel MKL with
mkl
feature:--features mkl
- Apple Accelerate with
accelerate
feature:--features accelerate
- Intel MKL with
Enabling features is done by passing --features ...
to the build system. When using cargo run
or maturin develop
, pass the --features
flag before the --
separating build flags from runtime flags.
- To enable a single feature like
metal
:cargo build --release --features metal
. - To enable multiple features, specify them in quotes:
cargo build --release --features "cuda flash-attn cudnn"
.
Device | Mistral.rs Completion T/s | Llama.cpp Completion T/s | Model | Quant |
---|---|---|---|---|
A10 GPU, CUDA | 78 | 78 | mistral-7b | 4_K_M |
Intel Xeon 8358 CPU, AVX | 6 | 19 | mistral-7b | 4_K_M |
Raspberry Pi 5 (8GB), Neon | 2 | 3 | mistral-7b | 2_K |
A100 GPU, CUDA | 110 | 119 | mistral-7b | 4_K_M |
Please submit more benchmarks via raising an issue!
To install mistral.rs, one should ensure they have Rust installed by following this link. Additionally, the Hugging Face token should be provided in ~/.cache/huggingface/token
when using the server to enable automatic download of gated models.
-
Install required packages
openssl
(ex.,sudo apt install libssl-dev
)pkg-config
(ex.,sudo apt install pkg-config
)
-
Install Rust: https://rustup.rs/
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env
-
Set HF token correctly (skip if already set or your model is not gated, or if you want to use the
token_source
parameters in Python or the command line.)mkdir ~/.cache/huggingface touch ~/.cache/huggingface/token echo <HF_TOKEN_HERE> > ~/.cache/huggingface/token
-
Download the code
git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs
-
Build or install
-
Base build command
cargo build --release
-
Build with CUDA support
cargo build --release --features cuda
-
Build with CUDA and Flash Attention V2 support
cargo build --release --features "cuda flash-attn"
-
Build with Metal support
cargo build --release --features metal
-
Build with Accelerate support
cargo build --release --features accelerate
-
Build with MKL support
cargo build --release --features mkl
-
Install with
cargo install
for easy command line usagePass the same values to
--features
as you would forcargo build
cargo install --path mistralrs-server --features cuda
-
-
The build process will output a binary
misralrs-server
at./target/release/mistralrs-server
which may be copied into the working directory with the following command:cp ./target/release/mistralrs-server ./mistralrs-server
-
Installing Python support
You can install Python support by following the guide here.
Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:
literal:<value>
: Load from a specified literalenv:<value>
: Load from a specified environment variablepath:<value>
: Load from a specified filecache
: default: Load from the HF token at ~/.cache/huggingface/token or equivalent.none
: Use no HF token
This is passed in the following ways:
- Command line:
./mistralrs-server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
- Python:
Here is an example of setting the token source.
If token cannot be loaded, no token will be used (i.e. effectively using none
).
You can also instruct mistral.rs to load models locally by modifying the *_model_id
arguments or options:
./mistralrs-server --port 1234 plain -m . -a mistral
or
./mistralrs-server gguf -m . -t . -f Phi-3-mini-128k-instruct-q4_K_M.gguf
Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
--model-id
(server) ormodel_id
(python/rust) or--tok-model-id
(server) ortok_model_id
(python/rust):config.json
tokenizer_config.json
tokenizer.json
(if not specified separately).safetensors
files.
--quantized-model-id
(server) orquantized_model_id
(python/rust):- Specified
.gguf
or.ggml
file.
- Specified
--x-lora-model-id
(server) orxlora_model_id
(python/rust):xlora_classifier.safetensors
xlora_config.json
- Adapters
.safetensors
andadapter_config.json
files in their respective directories
--adapters-model-id
(server) oradapters_model_id
(python/rust):- Adapters
.safetensors
andadapter_config.json
files in their respective directories
- Adapters
To start a server serving Mistral GGUF on localhost:1234
,
./mistralrs-server --port 1234 --log output.log gguf -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -t mistralai/Mistral-7B-Instruct-v0.1 -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
Mistral.rs uses subcommands to control the model type. They are generally of format <XLORA/LORA>-<QUANTIZATION>
. Please run ./mistralrs-server --help
to see the subcommands.
Additionally, for models without quantization, the model architecture should be provided as the --arch
or -a
argument in contrast to GGUF models which encode the architecture in the file. It should be one of the following:
mistral
gemma
mixtral
llama
phi2
phi3
qwen2
Interactive mode:
You can launch interactive mode, a simple chat application running in the terminal, by passing -i
:
./mistralrs-server -i gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
- X-LoRA with no quantization
To start an X-LoRA server with the exactly as presented in the paper:
./mistralrs-server --port 1234 x-lora-plain -o orderings/xlora-paper-ordering.json -x lamm-mit/x-lora
- LoRA with a model from GGUF
To start an LoRA server with adapters from the X-LoRA paper (you should modify the ordering file to use only one adapter, as the adapter static scalings are all 1 and so the signal will become distorted):
./mistralrs-server --port 1234 lora-gguf -o orderings/xlora-paper-ordering.json -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q8_0.gguf -a lamm-mit/x-lora
Normally with a LoRA model you would use a custom ordering file. However, for this example we use the ordering from the X-LoRA paper because we are using the adapters from the X-LoRA paper.
- With a model from GGUF
To start a server running Mistral from GGUF:
./mistralrs-server --port 1234 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf
- With a model from GGML
To start a server running Llama from GGML:
./mistralrs-server --port 1234 ggml -t meta-llama/Llama-2-13b-chat-hf -m TheBloke/Llama-2-13B-chat-GGML -f llama-2-13b-chat.ggmlv3.q4_K_M.bin
- Plain model from safetensors
To start a server running Mistral from safetensors.
./mistralrs-server --port 1234 gguf -m mistralai/Mistral-7B-Instruct-v0.1
We provide a method to select models with a .toml
file. The keys are the same as the command line, with no_kv_cache
and tokenizer_json
being "global" keys.
Example:
./mistralrs_server --port 1234 toml -f toml-selectors/gguf.toml
Command line docs
Command line docs here
Quantization support
Model | GGUF | GGML |
---|---|---|
Mistral 7B | ✅ | |
Gemma | ||
Llama | ✅ | ✅ |
Mixtral 8x7B | ✅ | |
Phi 2 | ✅ | |
Phi 3 | ✅ | |
Qwen 2 |
Device mapping support
Model | Supported |
---|---|
Normal | ✅ |
GGUF | ✅ |
GGML |
X-LoRA and LoRA support
Model | X-LoRA | X-LoRA+GGUF | X-LoRA+GGML |
---|---|---|---|
Mistral 7B | ✅ | ✅ | |
Gemma | ✅ | ||
Llama | ✅ | ✅ | ✅ |
Mixtral 8x7B | ✅ | ✅ | |
Phi 2 | ✅ | ||
Phi 3 | ✅ | ✅ | |
Qwen 2 |
Using derivative models
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help
after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:
- Normal: Model id
- Quantized: Quantized model id, quantized filename, and tokenizer id
- X-LoRA: Model id, X-LoRA ordering
- X-LoRA quantized: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
- LoRA: Model id, LoRA ordering
- LoRA quantized: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json
file, it is not necessary to specify. Otherwise, templates can be found in chat_templates
and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.
For example, when using a Zephyr model:
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-*
architecture, and LoRA support by selecting the lora-*
architecture. Please find docs for adapter models here
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
If you have any problems or want to contribute something, please raise an issue or pull request!
Consider enabling RUST_LOG=debug
environment variable.
If you want to add a new model, please see our guide.
- Setting the compiler path:
- Set the
NVCC_CCBIN
environment variable during build.
- Set the
- Error:
recompile with -fPIE
:- Some Linux distributions require compiling with
-fPIE
. - Set the
CUDA_NVCC_FLAGS
environment variable to-fPIE
during build:CUDA_NVCC_FLAGS=-fPIE
- Some Linux distributions require compiling with
This project would not be possible without the excellent work at candle
. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.