Andy Zou

Cited by

	All	Since 2021
Citations	17948	17914
h-index	21	21
i10-index	25	25

11000

5500

2750

8250

20212022202320242025202653 230 1479 5545 10210 356

Public access

View all

2 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Dan HendrycksDirector of the Center for AI Safety (advisor for xAI and Scale)Verified email at berkeley.edu
Matt FredriksonCarnegie Mellon UniversityVerified email at cs.cmu.edu
Zico KolterCarnegie Mellon UniversityVerified email at cs.cmu.edu

Andy Zou

PhD Student, Carnegie Mellon University

Verified email at andrew.cmu.edu - Homepage

ML Safety AI Safety


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Measuring Massive Multitask Language Understanding D Hendrycks, C Burns, S Basart, A Zou, M Mazeika, D Song, J Steinhardt ICLR, 2020	7399	2020
Universal and Transferable Adversarial Attacks on Aligned Language Models A Zou, Z Wang, N Carlini, N Milad, JZ Kolter, M Fredrikson arXiv preprint arXiv:2307.15043, 2023	2586	2023
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... TMLR, 2022	2210	2022
Lessons from the Trenches on Reproducible Evaluation of Language Models S Biderman, H Schoelkopf, L Sutawika, L Gao, J Tow, B Abbasi, AF Aji, ... arXiv preprint arXiv:2405.14782, 2024	1528*	2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ... ICML, 2024	792*	2024
Representation Engineering: A Top-Down Approach to AI Transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023	743*	2023
Scaling Out-of-Distribution Detection for Real-World Settings D Hendrycks, S Basart, M Mazeika, A Zou, J Kwon, M Mostajabi, ... ICML, 2021	739	2021
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... ICML, 2024	337	2024
Humanity's last exam L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, CBC Zhang, M Shaaban, ... arXiv preprint arXiv:2501.14249, 2025	305	2025
Improving Alignment and Robustness with Circuit Breakers A Zou, L Phan, J Wang, D Duenas, M Lin, M Andriushchenko, R Wang, ... NeurIPS, 2024	218*	2024
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark A Pan, CJ Shern, A Zou, N Li, S Basart, T Woodside, J Ng, H Zhang, ... ICML, 2023	217	2023
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures D Hendrycks, A Zou, M Mazeika, L Tang, D Song, J Steinhardt CVPR, 2021	210	2021
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ... ICLR, 2024	144	2024
What Would Jiminy Cricket Do? Towards Agents That Behave Morally M Mazeika, A Zou, S Patel, C Zhu, J Navarro, D Song, B Li, J Steinhardt, ... NeurIPS, 2021	88*	2021
Tamper-Resistant Safeguards for Open-Weight LLMs R Tamirisa, B Bharathi, L Phan, A Zhou, A Gatti, T Suresh, M Lin, J Wang, ... ICLR, 2024	86	2024
The Trojan Detection Challenge M Mazeika, D Hendrycks, H Li, X Xu, S Hough, A Zou, A Rajabi, Q Yao, ... NeurIPS, 2022	66	2022
Forecasting Future World Events with Neural Networks A Zou, T Xiao, R Jia, J Kwon, M Mazeika, R Li, D Song, J Steinhardt, ... NeurIPS, 2022	52	2022
others. 2023. Representation engineering: A top-down approach to ai transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 1	52	1
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective Y Huang, C Gao, S Wu, H Wang, X Wang, Y Zhou, Y Wang, J Ye, J Shi, ... arXiv preprint arXiv:2502.14296, 2025	44	2025
Unlocking Deterministic Robustness Certification on ImageNet K Hu, A Zou, Z Wang, K Leino, M Fredrikson NeurIPS, 2023	30*	2023

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors