| Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 960 | 2023 |
| Frontier models are capable of in-context scheming A Meinke, B Schoen, J Scheurer, M Balesni, R Shah, M Hobbhahn arXiv preprint arXiv:2412.04984, 2024 | 188* | 2024 |
| Black-box access is insufficient for rigorous ai audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... Proceedings of the 2024 ACM Conference on Fairness, Accountability, and …, 2024 | 187 | 2024 |
| Training language models with language feedback at scale J Scheurer, JA Campos, T Korbak, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2303.16755, 2023 | 141 | 2023 |
| Large language models can strategically deceive their users when put under pressure J Scheurer, M Balesni, M Hobbhahn arXiv preprint arXiv:2311.07590, 2023 | 138* | 2023 |
| Training Language Models with Language Feedback J Scheurer, JA Campos, JS Chan, A Chen, K Cho, E Perez arXiv preprint arXiv:2204.14146, 2022 | 116* | 2022 |
| Improving code generation by training with natural language feedback A Chen, J Scheurer, T Korbak, JA Campos, JS Chan, SR Bowman, K Cho, ... arXiv preprint arXiv:2303.16749, 2023 | 98 | 2023 |
| Me, myself, and ai: The situational awareness dataset (sad) for llms R Laine, B Chughtai, J Betley, K Hariharan, M Balesni, J Scheurer, ... Advances in Neural Information Processing Systems 37, 64010-64118, 2024 | 73 | 2024 |
| Towards evaluations-based safety cases for ai scheming M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ... arXiv preprint arXiv:2411.03336, 2024 | 38* | 2024 |
| A causal framework for AI regulation and auditing L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ... Publisher: Preprints, 2024 | 28* | 2024 |
| Stress testing deliberative alignment for anti-scheming training B Schoen, E Nitishinskaya, M Balesni, A Højmark, F Hofstätter, J Scheurer, ... arXiv preprint arXiv:2509.15541, 2025 | 27* | 2025 |
| Semantic Segmentation of Histopathological Slides for the Classification of Cutaneous Lymphoma and Eczema J Scheurer, C Ferrari, LBT Bom, M Beer, W Kempf, L Haug Annual Conference on Medical Image Understanding and Analysis, 26-42, 2020 | 22 | 2020 |
| Instance-wise algorithm configuration with graph neural networks R Valentin, C Ferrari, J Scheurer, A Amrollahi, C Wendler, MB Paulus arXiv preprint arXiv:2202.04910, 2022 | 13* | 2022 |
| Tracrbench: Generating interpretability testbeds with large language models H Thurnherr, J Scheurer arXiv preprint arXiv:2409.13714, 2024 | 6* | 2024 |
| Few-shot adaptation works with unpredictable data JS Chan, M Pieler, J Jao, J Scheurer, E Perez Proceedings of the 61st Annual Meeting of the Association for Computational …, 2023 | 6 | 2023 |
| Forecasting Frontier Language Model Agent Capabilities G Pimpale, A Højmark, J Scheurer, M Hobbhahn arXiv preprint arXiv:2502.15850, 2025 | 5* | 2025 |
| Analyzing Probabilistic Methods for Evaluating Agent Capabilities A Højmark, G Pimpale, A Panickssery, M Hobbhahn, J Scheurer arXiv preprint arXiv:2409.16125, 2024 | 4 | 2024 |
| Practical Pitfalls of Causal Scrubbing J Scheurer, H Philipp, M Tony, T Jacques, L David https://www.lesswrong.com/posts/DFarDnQjMnjsKvW8s/practical-pitfalls-of …, 2023 | | 2023 |
| Meta Reward Learning for Recommender Systems: Towards Value Alignment J Scheurer | | 2021 |
| Meta-Learning an Image Editing Style J Scheurer | | 2019 |