[go: up one dir, main page]

Follow
Stephen Casper
Stephen Casper
PhD student, MIT
Verified email at mit.edu - Homepage
Title
Cited by
Cited by
Year
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
TMLR, 2023
8912023
Rethinking machine unlearning for large language models
S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, Y Yao, CY Liu, X Xu, ...
Nature Machine Intelligence, 1-14, 2025
3312025
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks
T Räuker, A Ho, S Casper, D Hadfield-Menell
2023 ieee conference on secure and trustworthy machine learning (satml), 464-483, 2023
3112023
Foundational challenges in assuring alignment and safety of large language models
U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ...
TMLR, 2024
2962024
Scalable and transferable black-box jailbreaks for language models via persona modulation
R Shah, S Pour, A Tagade, S Casper, J Rando
arXiv preprint arXiv:2311.03348, 2023
1822023
Black-box access is insufficient for rigorous ai audits
S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ...
Proceedings of the 2024 ACM Conference on Fairness, Accountability, and …, 2024
1802024
Explore, establish, exploit: Red teaming language models from scratch
S Casper, J Lin, J Kwon, G Culp, D Hadfield-Menell
arXiv preprint arXiv:2306.09442, 2023
1392023
The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence
P Slattery, AK Saeri, EAC Grundy, J Graham, M Noetel, R Uuk, J Dao, ...
arXiv preprint arXiv:2408.12622, 2024
1322024
International ai safety report
Y Bengio, S Mindermann, D Privitera, T Besiroglu, R Bommasani, ...
arXiv preprint arXiv:2501.17805, 2025
1222025
Open problems in mechanistic interpretability
L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ...
TMLR, 2025
1222025
Eight methods to evaluate robust unlearning in llms
A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell
arXiv preprint arXiv:2402.16835, 2024
1222024
Latent adversarial training improves robustness to persistent harmful behaviors in llms
A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ...
TMLR, 2024
92*2024
Open problems in technical ai governance
A Reuel, B Bucknall, S Casper, T Fist, L Soder, O Aarne, L Hammond, ...
TMLR, 2024
652024
Defending against unforeseen failure modes with latent adversarial training
S Casper, L Schulze, O Patel, D Hadfield-Menell
TMLR, 2024
602024
International scientific report on the safety of advanced ai (interim report)
Y Bengio, S Mindermann, D Privitera, T Besiroglu, R Bommasani, ...
arXiv preprint arXiv:2412.05282, 2024
492024
Cognitive dissonance: Why do language model outputs disagree with internal representations of truthfulness?
K Liu, S Casper, D Hadfield-Menell, J Andreas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language …, 2023
492023
Open problems in machine unlearning for ai safety
F Barez, T Fu, A Prabhu, S Casper, A Sanyal, A Bibi, A O'Gara, R Kirk, ...
arXiv preprint arXiv:2501.04952, 2025
442025
Red teaming deep neural networks with feature synthesis tools
S Casper, T Bu, Y Li, J Li, K Zhang, K Hariharan, D Hadfield-Menell
Advances in Neural Information Processing Systems 36, 80470-80516, 2023
44*2023
Clusterability in neural networks
D Filan, S Casper, S Hod, C Wild, A Critch, S Russell
arXiv preprint arXiv:2103.03386, 2021
442021
Robust feature-level adversaries are interpretability tools
S Casper, M Nadeau, D Hadfield-Menell, G Kreiman
Advances in Neural Information Processing Systems 35, 33093-33106, 2022
402022
The system can't perform the operation now. Try again later.
Articles 1–20