| Sparse autoencoders find highly interpretable features in language models H Cunningham, A Ewart, L Riggs, R Huben, L Sharkey arXiv preprint arXiv:2309.08600, 2023 | 659 | 2023 |
| Goal misgeneralization in deep reinforcement learning LL Di Langosco, J Koch, LD Sharkey, J Pfau, D Krueger International Conference on Machine Learning, 12004-12019, 2022 | 201 | 2022 |
| Black-box access is insufficient for rigorous ai audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... Proceedings of the 2024 ACM Conference on Fairness, Accountability, and …, 2024 | 180 | 2024 |
| Open problems in mechanistic interpretability L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ... arXiv preprint arXiv:2501.16496, 2025 | 122 | 2025 |
| Sparse autoencoders find highly interpretable features in language models R Huben, H Cunningham, LR Smith, A Ewart, L Sharkey The Twelfth International Conference on Learning Representations, 2023 | 112 | 2023 |
| National palliative care capacities around the world: results from the World Health Organization Noncommunicable Disease Country Capacity Survey L Sharkey, B Loring, M Cowan, L Riley, EL Krakauer Palliative medicine 32 (1), 106-113, 2018 | 93 | 2018 |
| Sparse autoencoders find highly interpretable features in language models, 2023 H Cunningham, A Ewart, L Riggs, R Huben, L Sharkey URL https://arxiv. org/abs/2309.08600 2, 2023 | 76 | 2023 |
| Taking features out of superposition with sparse autoencoders L Sharkey, D Braun, B Millidge AI Alignment Forum 6, 12-13, 2022 | 43 | 2022 |
| Identifying functionally important features with end-to-end sparse dictionary learning D Braun, J Taylor, N Goldowsky-Dill, L Sharkey Advances in Neural Information Processing Systems 37, 107286-107325, 2024 | 39 | 2024 |
| Interpreting neural networks through the polytope lens S Black, L Sharkey, L Grinsztajn, E Winsor, D Braun, J Merizian, K Parker, ... arXiv preprint arXiv:2211.12312, 2022 | 32 | 2022 |
| Open problems in mechanistic interpretability, 2025 L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ... URL https://arxiv. org/abs/2501.16496, 0 | 31 | |
| Sparse autoencoders do not find canonical units of analysis P Leask, B Bussmann, M Pearce, J Bloom, C Tigges, NA Moubayed, ... arXiv preprint arXiv:2502.04878, 2025 | 24 | 2025 |
| Addressing feature suppression in saes B Wright, L Sharkey AI Alignment Forum 6, 2024 | 21 | 2024 |
| Taking features out of superposition with sparse autoencoders. 2022 L Sharkey, D Braun, B Millidge URL https://www. alignmentforum. org/posts/z6QQJbtpkEAX3Aojj/interim …, 2023 | 21 | 2023 |
| Lucius Bushnaq, Charlotte Stix, and Marius Hobbhahn. 2024 L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni A causal framework for ai regulation and auditing, 2024 | 17 | 2024 |
| Objective robustness in deep reinforcement learning J Koch, L Langosco, J Pfau, J Le, L Sharkey arXiv preprint arXiv:2105.14111 2, 2021 | 15 | 2021 |
| Interpretability in parameter space: Minimizing mechanistic description length with attribution-based parameter decomposition D Braun, L Bushnaq, S Heimersheim, J Mendel, L Sharkey arXiv preprint arXiv:2501.14926, 2025 | 13 | 2025 |
| Bilinear MLPs enable weight-based mechanistic interpretability MT Pearce, T Dooms, A Rigg, JM Oramas, L Sharkey arXiv preprint arXiv:2410.08417, 2024 | 12 | 2024 |
| AI Behind Closed Doors: a Primer on The Governance of Internal Deployment C Stix, M Pistillo, G Sastry, M Hobbhahn, A Ortega, M Balesni, ... arXiv preprint arXiv:2504.12170, 2025 | 11 | 2025 |
| Interpretability as compression: Reconsidering sae explanations of neural activations with mdl-saes K Ayonrinde, MT Pearce, L Sharkey arXiv preprint arXiv:2410.11179, 2024 | 11 | 2024 |