Starred repositories
NanoGPT (124M) quality in 7.8 8xH100-minutes
A programming framework for agentic AI 🤖
Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model".
🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
🤖 Build voice-based LLM agents. Modular + open source.
PaddleSlim is an open-source library for deep model compression and architecture search.
A Pytorch Knowledge Distillation library for benchmarking and extending works in the domains of Knowledge Distillation, Pruning, and Quantization.
A pair of tiny foundational models trained in Brazilian Portuguese.🦙🦙
[NeurIPS 2023 spotlight] Official implementation of HGRN in our NeurIPS 2023 paper - Hierarchically Gated Recurrent Neural Network for Sequence Modeling
An unnecessarily tiny implementation of GPT-2 in NumPy.
A safetensors extension to efficiently store sparse quantized tensors on disk
Official implementation of Half-Quadratic Quantization (HQQ)
Reorder-based post-training quantization for large language model
A high-throughput and memory-efficient inference and serving engine for LLMs
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Awesome LLM compression research papers and tools.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Ext…
Code, dataset, and analysis samples that utilize the OpenFEMA API.
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichuan, TinyLlama, etc.
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models