[go: up one dir, main page]

Follow
Dmitrii Krasheninnikov
Dmitrii Krasheninnikov
Verified email at cam.ac.uk - Homepage
Title
Cited by
Cited by
Year
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
S Casper*, X Davies*, C Shi, TK Gilbert, J Scheurer, J Rando, ...
TMLR (outstanding paper finalist), 2023
8922023
Defining and Characterizing Reward Hacking
J Skalse*, NHR Howe, D Krasheninnikov, D Krueger*
Advances in Neural Information Processing Systems 36, 2022
5262022
Harms from Increasingly Agentic Algorithmic Systems
A Chan, R Salganik, A Markelius, C Pang, N Rajkumar, D Krasheninnikov, ...
Proceedings of the 2023 ACM Conference on Fairness, Accountability, and …, 2023
266*2023
Preferences Implicit in the State of the World
R Shah*, D Krasheninnikov*, J Alexander, P Abbeel, A Dragan
International Conference on Learning Representations, 2019
99*2019
Benefits of Assistance over Reward Learning
R Shah, P Freire, N Alex, R Freedman, D Krasheninnikov, L Chan, ...
NeurIPS Workshop on Cooperative AI (best paper), 2020
432020
Stress-Testing Capability Elicitation With Password-Locked Models
R Greenblatt*, F Roger*, D Krasheninnikov, D Krueger
Advances in Neural Information Processing Systems 38, 2024
332024
Implicit meta-learning may lead language models to trust more reliable sources (out-of-context meta-learning)
D Krasheninnikov*, E Krasheninnikov*, B Mlodozeniec, T Maharaj, ...
ICML 2024, arXiv:2310.15047, 2023
27*2023
Assistance with large language models
D Krasheninnikov*, E Krasheninnikov*, D Krueger
NeurIPS ML Safety Workshop, 2022
172022
Detecting High-Stakes Interactions with Activation Probes
A McKenzie, U Pawar, P Blandfort, W Bankes, D Krueger, ES Lubana, ...
NeurIPS 2025; Applied Interpretability Workshop at ICML 2025 (outstanding paper), 2025
72025
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
M Brumley, J Kwon, D Krueger, D Krasheninnikov, U Anwar
NeurIPS Workshop on Foundation Model Interventions (MINT), 2024
62024
Combining reward information from multiple sources
D Krasheninnikov, R Shah, H van Hoof
NeurIPS Workshop on Learning with Rich Experience, 2019
62019
A sober look at steering vectors for llms
J Braun, D Krasheninnikov, U Anwar, R Kirk, D Tan, DS Krueger
LessWrong, November 23, 2024
52024
Understanding (Un) Reliability of Steering Vectors in Language Models
J Braun, C Eickhoff, D Krueger, SA Bahrainian, D Krasheninnikov
ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025
32025
Fresh in memory: Training-order recency is linearly encoded in language model activations
D Krasheninnikov, RE Turner, D Krueger
MemFM workshop at ICML 2025 (best paper runner-up), 2025
2*2025
Steering Clear: A Systematic Study of Activation Steering in a Toy Setup
D Krasheninnikov, D Krueger
NeurIPS Workshop on Foundation Model Interventions (MINT), 2024
22024
The Impact of Off-Policy Training Data on Probe Generalisation
N Kirch, S Dower, A Skapars, ES Lubana, D Krasheninnikov
EurIPS 2025 PAIG workshop (spotlight), 2025
2025
The system can't perform the operation now. Try again later.
Articles 1–16