| Measuring Massive Multitask Language Understanding D Hendrycks, C Burns, S Basart, A Zou, M Mazeika, D Song, J Steinhardt ICLR, 2020 | 7399 | 2020 |
| Universal and Transferable Adversarial Attacks on Aligned Language Models A Zou, Z Wang, N Carlini, N Milad, JZ Kolter, M Fredrikson arXiv preprint arXiv:2307.15043, 2023 | 2586 | 2023 |
| Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... TMLR, 2022 | 2210 | 2022 |
| Lessons from the Trenches on Reproducible Evaluation of Language Models S Biderman, H Schoelkopf, L Sutawika, L Gao, J Tow, B Abbasi, AF Aji, ... arXiv preprint arXiv:2405.14782, 2024 | 1528* | 2024 |
| HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ... ICML, 2024 | 792* | 2024 |
| Representation Engineering: A Top-Down Approach to AI Transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023 | 743* | 2023 |
| Scaling Out-of-Distribution Detection for Real-World Settings D Hendrycks, S Basart, M Mazeika, A Zou, J Kwon, M Mostajabi, ... ICML, 2021 | 739 | 2021 |
| The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... ICML, 2024 | 337 | 2024 |
| Humanity's last exam L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, CBC Zhang, M Shaaban, ... arXiv preprint arXiv:2501.14249, 2025 | 305 | 2025 |
| Improving Alignment and Robustness with Circuit Breakers A Zou, L Phan, J Wang, D Duenas, M Lin, M Andriushchenko, R Wang, ... NeurIPS, 2024 | 218* | 2024 |
| Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark A Pan, CJ Shern, A Zou, N Li, S Basart, T Woodside, J Ng, H Zhang, ... ICML, 2023 | 217 | 2023 |
| PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures D Hendrycks, A Zou, M Mazeika, L Tang, D Song, J Steinhardt CVPR, 2021 | 210 | 2021 |
| AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ... ICLR, 2024 | 144 | 2024 |
| What Would Jiminy Cricket Do? Towards Agents That Behave Morally M Mazeika, A Zou, S Patel, C Zhu, J Navarro, D Song, B Li, J Steinhardt, ... NeurIPS, 2021 | 88* | 2021 |
| Tamper-Resistant Safeguards for Open-Weight LLMs R Tamirisa, B Bharathi, L Phan, A Zhou, A Gatti, T Suresh, M Lin, J Wang, ... ICLR, 2024 | 86 | 2024 |
| The Trojan Detection Challenge M Mazeika, D Hendrycks, H Li, X Xu, S Hough, A Zou, A Rajabi, Q Yao, ... NeurIPS, 2022 | 66 | 2022 |
| Forecasting Future World Events with Neural Networks A Zou, T Xiao, R Jia, J Kwon, M Mazeika, R Li, D Song, J Steinhardt, ... NeurIPS, 2022 | 52 | 2022 |
| others. 2023. Representation engineering: A top-down approach to ai transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 1 | 52 | 1 |
| On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective Y Huang, C Gao, S Wu, H Wang, X Wang, Y Zhou, Y Wang, J Ye, J Shi, ... arXiv preprint arXiv:2502.14296, 2025 | 44 | 2025 |
| Unlocking Deterministic Robustness Certification on ImageNet K Hu, A Zou, Z Wang, K Leino, M Fredrikson NeurIPS, 2023 | 30* | 2023 |