[go: up one dir, main page]

Follow
Lee D Sharkey
Lee D Sharkey
Goodfire
Verified email at goodfire.ai - Homepage
Title
Cited by
Cited by
Year
Sparse autoencoders find highly interpretable features in language models
H Cunningham, A Ewart, L Riggs, R Huben, L Sharkey
arXiv preprint arXiv:2309.08600, 2023
6592023
Goal misgeneralization in deep reinforcement learning
LL Di Langosco, J Koch, LD Sharkey, J Pfau, D Krueger
International Conference on Machine Learning, 12004-12019, 2022
2012022
Black-box access is insufficient for rigorous ai audits
S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ...
Proceedings of the 2024 ACM Conference on Fairness, Accountability, and …, 2024
1802024
Open problems in mechanistic interpretability
L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ...
arXiv preprint arXiv:2501.16496, 2025
1222025
Sparse autoencoders find highly interpretable features in language models
R Huben, H Cunningham, LR Smith, A Ewart, L Sharkey
The Twelfth International Conference on Learning Representations, 2023
1122023
National palliative care capacities around the world: results from the World Health Organization Noncommunicable Disease Country Capacity Survey
L Sharkey, B Loring, M Cowan, L Riley, EL Krakauer
Palliative medicine 32 (1), 106-113, 2018
932018
Sparse autoencoders find highly interpretable features in language models, 2023
H Cunningham, A Ewart, L Riggs, R Huben, L Sharkey
URL https://arxiv. org/abs/2309.08600 2, 2023
762023
Taking features out of superposition with sparse autoencoders
L Sharkey, D Braun, B Millidge
AI Alignment Forum 6, 12-13, 2022
432022
Identifying functionally important features with end-to-end sparse dictionary learning
D Braun, J Taylor, N Goldowsky-Dill, L Sharkey
Advances in Neural Information Processing Systems 37, 107286-107325, 2024
392024
Interpreting neural networks through the polytope lens
S Black, L Sharkey, L Grinsztajn, E Winsor, D Braun, J Merizian, K Parker, ...
arXiv preprint arXiv:2211.12312, 2022
322022
Open problems in mechanistic interpretability, 2025
L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ...
URL https://arxiv. org/abs/2501.16496, 0
31
Sparse autoencoders do not find canonical units of analysis
P Leask, B Bussmann, M Pearce, J Bloom, C Tigges, NA Moubayed, ...
arXiv preprint arXiv:2502.04878, 2025
242025
Addressing feature suppression in saes
B Wright, L Sharkey
AI Alignment Forum 6, 2024
212024
Taking features out of superposition with sparse autoencoders. 2022
L Sharkey, D Braun, B Millidge
URL https://www. alignmentforum. org/posts/z6QQJbtpkEAX3Aojj/interim …, 2023
212023
Lucius Bushnaq, Charlotte Stix, and Marius Hobbhahn. 2024
L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni
A causal framework for ai regulation and auditing, 2024
172024
Objective robustness in deep reinforcement learning
J Koch, L Langosco, J Pfau, J Le, L Sharkey
arXiv preprint arXiv:2105.14111 2, 2021
152021
Interpretability in parameter space: Minimizing mechanistic description length with attribution-based parameter decomposition
D Braun, L Bushnaq, S Heimersheim, J Mendel, L Sharkey
arXiv preprint arXiv:2501.14926, 2025
132025
Bilinear MLPs enable weight-based mechanistic interpretability
MT Pearce, T Dooms, A Rigg, JM Oramas, L Sharkey
arXiv preprint arXiv:2410.08417, 2024
122024
AI Behind Closed Doors: a Primer on The Governance of Internal Deployment
C Stix, M Pistillo, G Sastry, M Hobbhahn, A Ortega, M Balesni, ...
arXiv preprint arXiv:2504.12170, 2025
112025
Interpretability as compression: Reconsidering sae explanations of neural activations with mdl-saes
K Ayonrinde, MT Pearce, L Sharkey
arXiv preprint arXiv:2410.11179, 2024
112024
The system can't perform the operation now. Try again later.
Articles 1–20