[go: up one dir, main page]

Follow
Andy Zou
Andy Zou
Verified email at andrew.cmu.edu - Homepage
Title
Cited by
Cited by
Year
Measuring Massive Multitask Language Understanding
D Hendrycks, C Burns, S Basart, A Zou, M Mazeika, D Song, J Steinhardt
ICLR, 2020
73992020
Universal and Transferable Adversarial Attacks on Aligned Language Models
A Zou, Z Wang, N Carlini, N Milad, JZ Kolter, M Fredrikson
arXiv preprint arXiv:2307.15043, 2023
25862023
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ...
TMLR, 2022
22102022
Lessons from the Trenches on Reproducible Evaluation of Language Models
S Biderman, H Schoelkopf, L Sutawika, L Gao, J Tow, B Abbasi, AF Aji, ...
arXiv preprint arXiv:2405.14782, 2024
1528*2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ...
ICML, 2024
792*2024
Representation Engineering: A Top-Down Approach to AI Transparency
A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 2023
743*2023
Scaling Out-of-Distribution Detection for Real-World Settings
D Hendrycks, S Basart, M Mazeika, A Zou, J Kwon, M Mostajabi, ...
ICML, 2021
7392021
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ...
ICML, 2024
3372024
Humanity's last exam
L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, CBC Zhang, M Shaaban, ...
arXiv preprint arXiv:2501.14249, 2025
3052025
Improving Alignment and Robustness with Circuit Breakers
A Zou, L Phan, J Wang, D Duenas, M Lin, M Andriushchenko, R Wang, ...
NeurIPS, 2024
218*2024
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
A Pan, CJ Shern, A Zou, N Li, S Basart, T Woodside, J Ng, H Zhang, ...
ICML, 2023
2172023
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
D Hendrycks, A Zou, M Mazeika, L Tang, D Song, J Steinhardt
CVPR, 2021
2102021
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ...
ICLR, 2024
1442024
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
M Mazeika, A Zou, S Patel, C Zhu, J Navarro, D Song, B Li, J Steinhardt, ...
NeurIPS, 2021
88*2021
Tamper-Resistant Safeguards for Open-Weight LLMs
R Tamirisa, B Bharathi, L Phan, A Zhou, A Gatti, T Suresh, M Lin, J Wang, ...
ICLR, 2024
862024
The Trojan Detection Challenge
M Mazeika, D Hendrycks, H Li, X Xu, S Hough, A Zou, A Rajabi, Q Yao, ...
NeurIPS, 2022
662022
Forecasting Future World Events with Neural Networks
A Zou, T Xiao, R Jia, J Kwon, M Mazeika, R Li, D Song, J Steinhardt, ...
NeurIPS, 2022
522022
others. 2023. Representation engineering: A top-down approach to ai transparency
A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 1
521
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
Y Huang, C Gao, S Wu, H Wang, X Wang, Y Zhou, Y Wang, J Ye, J Shi, ...
arXiv preprint arXiv:2502.14296, 2025
442025
Unlocking Deterministic Robustness Certification on ImageNet
K Hu, A Zou, Z Wang, K Leino, M Fredrikson
NeurIPS, 2023
30*2023
The system can't perform the operation now. Try again later.
Articles 1–20