[go: up one dir, main page]

Follow
Neel Nanda
Neel Nanda
Mechanistic Interpretability Team Lead, Google DeepMind
Verified email at deepmind.com - Homepage
Title
Cited by
Cited by
Year
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ...
arXiv preprint arXiv:2204.05862, 2022
34432022
A Mathematical Framework for Transformer Circuits
N Elhage, 𝗡 𝗡𝗮𝗻𝗱𝗮, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ...
Transformer Circuits Thread, 2021
1216*2021
In-context Learning and Induction Heads
C Olsson, N Elhage, 𝗡 𝗡𝗮𝗻𝗱𝗮, N Joseph, N DasSarma, T Henighan, ...
Transformer Circuits Thread, 2022
1202*2022
Progress Measures For Grokking Via Mechanistic Interpretability
𝗡 𝗡𝗮𝗻𝗱𝗮, L Chan, T Liberum, J Smith, J Steinhardt
ICLR 2023 Spotlight, 2023
756*2023
Predictability and surprise in large generative models
D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ...
Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022
4642022
Refusal in Language Models Is Mediated by a Single Direction
A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, 𝗡 𝗡𝗮𝗻𝗱𝗮
NeurIPS 2024, 2024
431*2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing
W Gurnee, 𝗡 𝗡𝗮𝗻𝗱𝗮, M Pauly, K Harvey, D Troitskii, D Bertsimas
Transactions on Machine Learning Research, 2023
299*2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
𝗡 𝗡𝗮𝗻𝗱𝗮, A Lee, M Wattenberg
BlackboxNLP at EMNLP 2023, Honourable Mention for Best Paper, 2023
289*2023
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ...
Oral at BlackboxNLP 2024, 2024
2712024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
F Zhang, 𝗡 𝗡𝗮𝗻𝗱𝗮
ICLR 2024, 2023
1722023
Open Problems in Mechanistic Interpretability
L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ...
arXiv preprint arXiv:2501.16496, 2025
169*2025
TransformerLens: A Library for Mechanistic Interpretability of Language Models
𝗡 𝗡𝗮𝗻𝗱𝗮, J Bloom
https://github.com/neelnanda-io/TransformerLens, 2023
159*2023
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ...
arXiv preprint arXiv:2407.14435, 2024
1562024
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
B Chughtai, L Chan, 𝗡 𝗡𝗮𝗻𝗱𝗮
ICML 2023, 2023
148*2023
Improving Dictionary Learning with Gated Sparse Autoencoders
S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ...
NeurIPS 2024, 2024
146*2024
Linear Representations of Sentiment in Large Language Models
C Tigges, OJ Hollinsworth, A Geiger, 𝗡 𝗡𝗮𝗻𝗱𝗮
BlackboxNLP 2024, 2023
141*2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
T Lieberum, M Rahtz, J Kramár, 𝗡 𝗡𝗮𝗻𝗱𝗮, G Irving, R Shah, V Mikulik
arXiv preprint arXiv:2307.09458, 2023
1322023
Transcoders Find Interpretable LLM Feature Circuits
J Dunefsky, P Chlenski, 𝗡 𝗡𝗮𝗻𝗱𝗮
NeurIPS 2024, 2024
106*2024
Attribution Patching: Activation Patching At Industrial Scale
𝗡 𝗡𝗮𝗻𝗱𝗮
https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023
106*2023
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
I Arcuschin, J Janiak, R Krzyzanowski, S Rajamanoharan, 𝗡 𝗡𝗮𝗻𝗱𝗮, ...
arXiv preprint arXiv:2503.08679, 2025
98*2025
The system can't perform the operation now. Try again later.
Articles 1–20