| Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback S Casper*, X Davies*, C Shi, TK Gilbert, J Scheurer, J Rando, ... TMLR (outstanding paper finalist), 2023 | 892 | 2023 |
| Defining and Characterizing Reward Hacking J Skalse*, NHR Howe, D Krasheninnikov, D Krueger* Advances in Neural Information Processing Systems 36, 2022 | 526 | 2022 |
| Harms from Increasingly Agentic Algorithmic Systems A Chan, R Salganik, A Markelius, C Pang, N Rajkumar, D Krasheninnikov, ... Proceedings of the 2023 ACM Conference on Fairness, Accountability, and …, 2023 | 266* | 2023 |
| Preferences Implicit in the State of the World R Shah*, D Krasheninnikov*, J Alexander, P Abbeel, A Dragan International Conference on Learning Representations, 2019 | 99* | 2019 |
| Benefits of Assistance over Reward Learning R Shah, P Freire, N Alex, R Freedman, D Krasheninnikov, L Chan, ... NeurIPS Workshop on Cooperative AI (best paper), 2020 | 43 | 2020 |
| Stress-Testing Capability Elicitation With Password-Locked Models R Greenblatt*, F Roger*, D Krasheninnikov, D Krueger Advances in Neural Information Processing Systems 38, 2024 | 33 | 2024 |
| Implicit meta-learning may lead language models to trust more reliable sources (out-of-context meta-learning) D Krasheninnikov*, E Krasheninnikov*, B Mlodozeniec, T Maharaj, ... ICML 2024, arXiv:2310.15047, 2023 | 27* | 2023 |
| Assistance with large language models D Krasheninnikov*, E Krasheninnikov*, D Krueger NeurIPS ML Safety Workshop, 2022 | 17 | 2022 |
| Detecting High-Stakes Interactions with Activation Probes A McKenzie, U Pawar, P Blandfort, W Bankes, D Krueger, ES Lubana, ... NeurIPS 2025; Applied Interpretability Workshop at ICML 2025 (outstanding paper), 2025 | 7 | 2025 |
| Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks M Brumley, J Kwon, D Krueger, D Krasheninnikov, U Anwar NeurIPS Workshop on Foundation Model Interventions (MINT), 2024 | 6 | 2024 |
| Combining reward information from multiple sources D Krasheninnikov, R Shah, H van Hoof NeurIPS Workshop on Learning with Rich Experience, 2019 | 6 | 2019 |
| A sober look at steering vectors for llms J Braun, D Krasheninnikov, U Anwar, R Kirk, D Tan, DS Krueger LessWrong, November 23, 2024 | 5 | 2024 |
| Understanding (Un) Reliability of Steering Vectors in Language Models J Braun, C Eickhoff, D Krueger, SA Bahrainian, D Krasheninnikov ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025 | 3 | 2025 |
| Fresh in memory: Training-order recency is linearly encoded in language model activations D Krasheninnikov, RE Turner, D Krueger MemFM workshop at ICML 2025 (best paper runner-up), 2025 | 2* | 2025 |
| Steering Clear: A Systematic Study of Activation Steering in a Toy Setup D Krasheninnikov, D Krueger NeurIPS Workshop on Foundation Model Interventions (MINT), 2024 | 2 | 2024 |
| The Impact of Off-Policy Training Data on Probe Generalisation N Kirch, S Dower, A Skapars, ES Lubana, D Krasheninnikov EurIPS 2025 PAIG workshop (spotlight), 2025 | | 2025 |