Stephen Casper

Cited by

	All	Since 2021
Citations	3685	3674
h-index	25	25
i10-index	35	35

2200

1100

550

1650

20212022202320242025202614 46 215 1132 2174 77

Public access

View all

3 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Dylan Hadfield-MenellMassachusetts Institute of TechnologyVerified email at csail.mit.edu
Gabriel KreimanProfessor, Harvard Medical School and Children's HospitalVerified email at tch.harvard.edu
Stuart RussellProfessor of Computer Science, University of California, BerkeleyVerified email at cs.berkeley.edu
David Scott KruegerAssistant Professor, University of Montreal, MilaVerified email at cam.ac.uk
Daniel FilanPhD Student, UC BerkeleyVerified email at berkeley.edu
Andrew CritchUC Berkeley, Department of Electrical Engineering and Computer SciencesVerified email at eecs.berkeley.edu
Shlomi HodWeizenbaum InstituteVerified email at bu.edu
Soroush PourHarmony IntelligenceVerified email at soroushjp.com
Anson HoEpoch AIVerified email at epoch.ai
Ben BucknallDPhil Student, University of OxfordVerified email at robots.ox.ac.uk
Javier RandoAnthropicVerified email at anthropic.com
Jérémy ScheurerApollo ResearchVerified email at apolloresearch.ai
Carson EzellUndergraduate Student, Harvard UniversityVerified email at college.harvard.edu
Cody WildGoogle DeepMindVerified email at google.com
Arush TagadePhD Student, George Washington UniversityVerified email at gwu.edu
Joe KwonMITVerified email at csail.mit.edu
Tilman RäukerPivotal ResearchVerified email at pivotal-research.org
Rusheb ShahApollo ResearchVerified email at apolloresearch.ai
Lee D SharkeyGoodfireVerified email at goodfire.ai
Satyapriya KrishnaHarvard UniversityVerified email at g.harvard.edu

Stephen Casper

PhD student, MIT

Verified email at mit.edu - Homepage

AI safety AI responsibility safeguards evals technical governance


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... TMLR, 2023	891	2023
Rethinking machine unlearning for large language models S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, Y Yao, CY Liu, X Xu, ... Nature Machine Intelligence, 1-14, 2025	331	2025
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks T Räuker, A Ho, S Casper, D Hadfield-Menell 2023 ieee conference on secure and trustworthy machine learning (satml), 464-483, 2023	311	2023
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... TMLR, 2024	296	2024
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023	182	2023
Black-box access is insufficient for rigorous ai audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... Proceedings of the 2024 ACM Conference on Fairness, Accountability, and …, 2024	180	2024
Explore, establish, exploit: Red teaming language models from scratch S Casper, J Lin, J Kwon, G Culp, D Hadfield-Menell arXiv preprint arXiv:2306.09442, 2023	139	2023
The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence P Slattery, AK Saeri, EAC Grundy, J Graham, M Noetel, R Uuk, J Dao, ... arXiv preprint arXiv:2408.12622, 2024	132	2024
International ai safety report Y Bengio, S Mindermann, D Privitera, T Besiroglu, R Bommasani, ... arXiv preprint arXiv:2501.17805, 2025	122	2025
Open problems in mechanistic interpretability L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ... TMLR, 2025	122	2025
Eight methods to evaluate robust unlearning in llms A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell arXiv preprint arXiv:2402.16835, 2024	122	2024
Latent adversarial training improves robustness to persistent harmful behaviors in llms A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... TMLR, 2024	92*	2024
Open problems in technical ai governance A Reuel, B Bucknall, S Casper, T Fist, L Soder, O Aarne, L Hammond, ... TMLR, 2024	65	2024
Defending against unforeseen failure modes with latent adversarial training S Casper, L Schulze, O Patel, D Hadfield-Menell TMLR, 2024	60	2024
International scientific report on the safety of advanced ai (interim report) Y Bengio, S Mindermann, D Privitera, T Besiroglu, R Bommasani, ... arXiv preprint arXiv:2412.05282, 2024	49	2024
Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness? K Liu, S Casper, D Hadfield-Menell, J Andreas Proceedings of the 2023 Conference on Empirical Methods in Natural Language …, 2023	49	2023
Open problems in machine unlearning for ai safety F Barez, T Fu, A Prabhu, S Casper, A Sanyal, A Bibi, A O'Gara, R Kirk, ... arXiv preprint arXiv:2501.04952, 2025	44	2025
Red teaming deep neural networks with feature synthesis tools S Casper, T Bu, Y Li, J Li, K Zhang, K Hariharan, D Hadfield-Menell Advances in Neural Information Processing Systems 36, 80470-80516, 2023	44*	2023
Clusterability in neural networks D Filan, S Casper, S Hod, C Wild, A Critch, S Russell arXiv preprint arXiv:2103.03386, 2021	44	2021
Robust feature-level adversaries are interpretability tools S Casper, M Nadeau, D Hadfield-Menell, G Kreiman Advances in Neural Information Processing Systems 35, 33093-33106, 2022	40	2022

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors