| A survey of quantization methods for efficient neural network inference A Gholami, S Kim, Z Dong, Z Yao, MW Mahoney, K Keutzer Low-Power Computer Vision, 291-326, 2022 | 2092 | 2022 |
| I-BERT: Integer-only BERT quantization S Kim, A Gholami, Z Yao, MW Mahoney, K Keutzer International conference on machine learning, 5506-5518, 2021 | 517 | 2021 |
| AI and memory wall A Gholami, Z Yao, S Kim, C Hooper, MW Mahoney, K Keutzer IEEE Micro, 2024 | 443 | 2024 |
| KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization C Hooper, S Kim, H Mohammadzadeh, MW Mahoney, YS Shao, ... NeurIPS 2024, 2024 | 384 | 2024 |
| SqueezeLLM: Dense-and-Sparse Quantization S Kim, C Hooper, A Gholami, Z Dong, X Li, S Shen, MW Mahoney, ... ICML 2024, 2023 | 348 | 2023 |
| Learned Token Pruning for Transformers S Kim, S Shen, D Thorsley, A Gholami, W Kwon, J Hassoun, K Keutzer Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and …, 2022 | 252 | 2022 |
| A Fast Post-Training Pruning Framework for Transformers W Kwon, S Kim, MW Mahoney, J Hassoun, K Keutzer, A Gholami Advances in Neural Information Processing Systems 35, 2022 | 244 | 2022 |
| Squeezeformer: An efficient transformer for automatic speech recognition S Kim, A Gholami, A Shaw, N Lee, K Mangalam, J Malik, MW Mahoney, ... Advances in Neural Information Processing Systems 35, 2022 | 218 | 2022 |
| Full Stack Optimization of Transformer Inference: a Survey S Kim, C Hooper, T Wattanawong, M Kang, R Yan, H Genc, G Dinh, ... arXiv preprint arXiv:2302.14017, 2023 | 182* | 2023 |
| Speculative decoding with big little decoder S Kim, K Mangalam, S Moon, J Malik, MW Mahoney, A Gholami, ... Advances in Neural Information Processing Systems 36, 2024 | 179* | 2024 |
| An LLM Compiler for Parallel Function Calling S Kim, S Moon, R Tabrizi, N Lee, MW Mahoney, K Keutzer, A Gholami ICML 2024, 2023 | 136 | 2023 |
| Applications and techniques for fast machine learning in science AMC Deiana, N Tran, J Agar, M Blott, G Di Guglielmo, J Duarte, P Harris, ... Frontiers in big Data 5, 787421, 2022 | 96 | 2022 |
| LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement N Lee, T Wattanawong, S Kim, K Mangalam, S Shen, G Anumanchipali, ... ACL 2024, 2024 | 88 | 2024 |
| Hessian-aware pruning and optimal neural implant S Yu, Z Yao, A Gholami, Z Dong, S Kim, MW Mahoney, K Keutzer Proceedings of the IEEE/CVF Winter Conference on Applications of Computer …, 2022 | 83 | 2022 |
| Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks LE Erdogan, N Lee, S Kim, S Moon, H Furuta, G Anumanchipalli, ... arXiv preprint arXiv:2503.09572, 2025 | 77 | 2025 |
| TinyAgent: Function Calling at the Edge LE Erdogan, N Lee, S Jha, S Kim, R Tabrizi, S Moon, C Hooper, ... EMNLP 2024 (Demo), 2024 | 44 | 2024 |
| SPEED: Speculative Pipelined Execution for Efficient Decoding C Hooper, S Kim, H Mohammadzadeh, H Genc, K Keutzer, A Gholami, ... arXiv preprint arXiv:2310.12072, 2023 | 40 | 2023 |
| Integer-Only Zero-Shot Quantization for Efficient Speech Recognition S Kim, A Gholami, Z Yao, N Lee, P Wang, A Nrusimha, B Zhai, T Gao, ... ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and …, 2022 | 37 | 2022 |
| Squeezed Attention: Accelerating Long Context Length LLM Inference C Hooper, S Kim, H Mohammadzadeh, M Maheswaran, J Paik, ... ACL 2025, 2024 | 30 | 2024 |
| WindTunnel: towards differentiable ML pipelines beyond a single model GI Yu, S Amizadeh, S Kim, A Pagnoni, C Zhang, BG Chun, M Weimer, ... Proceedings of the VLDB Endowment 15 (1), 11-20, 2021 | 20* | 2021 |