| Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 875 | 2023 |
| Artificial intelligence in dermatology: a primer AT Young, M Xiong, J Pfau, MJ Keiser, ML Wei Journal of Investigative Dermatology 140 (8), 1504-1512, 2020 | 283 | 2020 |
| Goal misgeneralization in deep reinforcement learning LL Di Langosco, J Koch, LD Sharkey, J Pfau, D Krueger International Conference on Machine Learning, 12004-12019, 2022 | 201 | 2022 |
| Let's think dot by dot: Hidden computation in transformer language models J Pfau, W Merrill, SR Bowman arXiv preprint arXiv:2404.15758, 2024 | 135 | 2024 |
| Taking AI welfare seriously R Long, J Sebo, P Butlin, K Finlinson, K Fish, J Harding, J Pfau, T Sims, ... arXiv preprint arXiv:2411.00986, 2024 | 51 | 2024 |
| Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models AT Young, K Fernandez, J Pfau, R Reddy, NA Cao, MY von Franque, ... NPJ digital medicine 4 (1), 10, 2021 | 50 | 2021 |
| Steering without side effects: Improving post-deployment control of language models AC Stickland, A Lyzhov, J Pfau, S Mahdi, SR Bowman arXiv preprint arXiv:2406.15518, 2024 | 27 | 2024 |
| Robust Semantic Interpretability: Revisiting Concept Activation Vectors J Pfau, A Young, J Wei, M Wei, M Keiser arXiv preprint arXiv:2104.02768, 2020 | 26 | 2020 |
| Artificial intelligence in teledermatology M Xiong, J Pfau, AT Young, ML Wei Current Dermatology Reports 8 (3), 85-90, 2019 | 23 | 2019 |
| Self-consistency of large language models under ambiguity H Bartsch, O Jorgensen, D Rosati, J Hoelscher-Obermaier, J Pfau arXiv preprint arXiv:2310.13439, 2023 | 20 | 2023 |
| Let’s think dot by dot: Hidden computation in transformer language models, 2024 J Pfau, W Merrill, SR Bowman URL https://arxiv. org/abs/2404.15758 2404, 0 | 17 | |
| Objective robustness in deep reinforcement learning J Koch, L Langosco, J Pfau, J Le, L Sharkey arXiv preprint arXiv:2105.14111 2, 2021 | 15 | 2021 |
| Global saliency: aggregating saliency maps to assess dataset artefact bias J Pfau, AT Young, ML Wei, MJ Keiser arXiv preprint arXiv:1910.07604, 2019 | 15 | 2019 |
| Eliciting language model behaviors using reverse language models J Pfau, A Infanger, A Sheshadri, A Panda, J Michael, C Huebner Socially Responsible Language Modelling Research, 2023 | 13 | 2023 |
| An alignment safety case sketch based on debate MD Buhl, J Pfau, B Hilton, G Irving arXiv preprint arXiv:2505.03989, 2025 | 9 | 2025 |
| When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback J Pan, R Shar, J Pfau, A Talwalkar, H He, V Chen arXiv preprint arXiv:2502.18413, 2025 | 7 | 2025 |
| Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550 S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint ARXIV.2307.15217, 0 | 2 | |
| An alignment safety case sketch based on debate M Davidsen Buhl, J Pfau, B Hilton, G Irving arXiv e-prints, arXiv: 2505.03989, 2025 | | 2025 |
| Steering Without Side Effects: Improving Post-Deployment Control of Language Models A Cooper Stickland, A Lyzhov, J Pfau, S Mahdi, SR Bowman arXiv e-prints, arXiv: 2406.15518, 2024 | | 2024 |