Neel Nanda

Cited by

	All	Since 2021
Citations	11441	11416
h-index	37	37
i10-index	50	50

7000

3500

1750

5250

2022202320242025202696 1106 3424 6606 168

Public access

View all

3 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Arthur ConmyGoogle DeepMindVerified email at google.com
Senthooran RajamanoharanGoogle DeepMindVerified email at google.com
Tom LieberumGoogle DeepMindVerified email at deepmind.com
Wes GurneeAnthropicVerified email at mit.edu
Andy ArditiNortheastern UniversityVerified email at northeastern.edu
Janos KramarGoogle DeepMindVerified email at google.com
Bart BussmannIndependentVerified email at student.uva.nl
Bilal ChughtaiGoogle DeepMindVerified email at google.com
Joseph Isaac BloomUK AI Safety InstituteVerified email at dsit.gov.uk
Catherine OlssonAnthropicVerified email at mit.edu
Lawrence ChanPhD Student, UC BerkeleyVerified email at berkeley.edu
Rohin ShahResearch Scientist, Google DeepMindVerified email at deepmind.com
Christopher OlahAnthropicVerified email at google.com

Neel Nanda

Mechanistic Interpretability Team Lead, Google DeepMind

Verified email at deepmind.com - Homepage

AI ML AI Alignment Interpretability Mechanistic Interpretability


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Y Bai, A Jones, K Ndousse, A Askell, A Chen, N DasSarma, D Drain, ... arXiv preprint arXiv:2204.05862, 2022	3443	2022
A Mathematical Framework for Transformer Circuits N Elhage, 𝗡 𝗡𝗮𝗻𝗱𝗮, C Olsson, T Henighan, N Joseph, B Mann, A Askell, ... Transformer Circuits Thread, 2021	1216*	2021
In-context Learning and Induction Heads C Olsson, N Elhage, 𝗡 𝗡𝗮𝗻𝗱𝗮, N Joseph, N DasSarma, T Henighan, ... Transformer Circuits Thread, 2022	1202*	2022
Progress Measures For Grokking Via Mechanistic Interpretability 𝗡 𝗡𝗮𝗻𝗱𝗮, L Chan, T Liberum, J Smith, J Steinhardt ICLR 2023 Spotlight, 2023	756*	2023
Predictability and surprise in large generative models D Ganguli, D Hernandez, L Lovitt, A Askell, Y Bai, A Chen, T Conerly, ... Proceedings of the 2022 ACM Conference on Fairness, Accountability, and …, 2022	464	2022
Refusal in Language Models Is Mediated by a Single Direction A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, 𝗡 𝗡𝗮𝗻𝗱𝗮 NeurIPS 2024, 2024	431*	2024
Finding Neurons in a Haystack: Case Studies with Sparse Probing W Gurnee, 𝗡 𝗡𝗮𝗻𝗱𝗮, M Pauly, K Harvey, D Troitskii, D Bertsimas Transactions on Machine Learning Research, 2023	299*	2023
Emergent Linear Representations in World Models of Self-Supervised Sequence Models 𝗡 𝗡𝗮𝗻𝗱𝗮, A Lee, M Wattenberg BlackboxNLP at EMNLP 2023, Honourable Mention for Best Paper, 2023	289*	2023
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... Oral at BlackboxNLP 2024, 2024	271	2024
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods F Zhang, 𝗡 𝗡𝗮𝗻𝗱𝗮 ICLR 2024, 2023	172	2023
Open Problems in Mechanistic Interpretability L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ... arXiv preprint arXiv:2501.16496, 2025	169*	2025
TransformerLens: A Library for Mechanistic Interpretability of Language Models 𝗡 𝗡𝗮𝗻𝗱𝗮, J Bloom https://github.com/neelnanda-io/TransformerLens, 2023	159*	2023
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ... arXiv preprint arXiv:2407.14435, 2024	156	2024
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations B Chughtai, L Chan, 𝗡 𝗡𝗮𝗻𝗱𝗮 ICML 2023, 2023	148*	2023
Improving Dictionary Learning with Gated Sparse Autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... NeurIPS 2024, 2024	146*	2024
Linear Representations of Sentiment in Large Language Models C Tigges, OJ Hollinsworth, A Geiger, 𝗡 𝗡𝗮𝗻𝗱𝗮 BlackboxNLP 2024, 2023	141*	2023
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla T Lieberum, M Rahtz, J Kramár, 𝗡 𝗡𝗮𝗻𝗱𝗮, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023	132	2023
Transcoders Find Interpretable LLM Feature Circuits J Dunefsky, P Chlenski, 𝗡 𝗡𝗮𝗻𝗱𝗮 NeurIPS 2024, 2024	106*	2024
Attribution Patching: Activation Patching At Industrial Scale 𝗡 𝗡𝗮𝗻𝗱𝗮 https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023	106*	2023
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful I Arcuschin, J Janiak, R Krzyzanowski, S Rajamanoharan, 𝗡 𝗡𝗮𝗻𝗱𝗮, ... arXiv preprint arXiv:2503.08679, 2025	98*	2025

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors