[go: up one dir, main page]

Follow
Jacob Pfau
Jacob Pfau
NYU, UK AISI
Verified email at nyu.edu - Homepage
Title
Cited by
Cited by
Year
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
8752023
Artificial intelligence in dermatology: a primer
AT Young, M Xiong, J Pfau, MJ Keiser, ML Wei
Journal of Investigative Dermatology 140 (8), 1504-1512, 2020
2832020
Goal misgeneralization in deep reinforcement learning
LL Di Langosco, J Koch, LD Sharkey, J Pfau, D Krueger
International Conference on Machine Learning, 12004-12019, 2022
2012022
Let's think dot by dot: Hidden computation in transformer language models
J Pfau, W Merrill, SR Bowman
arXiv preprint arXiv:2404.15758, 2024
1352024
Taking AI welfare seriously
R Long, J Sebo, P Butlin, K Finlinson, K Fish, J Harding, J Pfau, T Sims, ...
arXiv preprint arXiv:2411.00986, 2024
512024
Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models
AT Young, K Fernandez, J Pfau, R Reddy, NA Cao, MY von Franque, ...
NPJ digital medicine 4 (1), 10, 2021
502021
Steering without side effects: Improving post-deployment control of language models
AC Stickland, A Lyzhov, J Pfau, S Mahdi, SR Bowman
arXiv preprint arXiv:2406.15518, 2024
272024
Robust Semantic Interpretability: Revisiting Concept Activation Vectors
J Pfau, A Young, J Wei, M Wei, M Keiser
arXiv preprint arXiv:2104.02768, 2020
262020
Artificial intelligence in teledermatology
M Xiong, J Pfau, AT Young, ML Wei
Current Dermatology Reports 8 (3), 85-90, 2019
232019
Self-consistency of large language models under ambiguity
H Bartsch, O Jorgensen, D Rosati, J Hoelscher-Obermaier, J Pfau
arXiv preprint arXiv:2310.13439, 2023
202023
Let’s think dot by dot: Hidden computation in transformer language models, 2024
J Pfau, W Merrill, SR Bowman
URL https://arxiv. org/abs/2404.15758 2404, 0
17
Objective robustness in deep reinforcement learning
J Koch, L Langosco, J Pfau, J Le, L Sharkey
arXiv preprint arXiv:2105.14111 2, 2021
152021
Global saliency: aggregating saliency maps to assess dataset artefact bias
J Pfau, AT Young, ML Wei, MJ Keiser
arXiv preprint arXiv:1910.07604, 2019
152019
Eliciting language model behaviors using reverse language models
J Pfau, A Infanger, A Sheshadri, A Panda, J Michael, C Huebner
Socially Responsible Language Modelling Research, 2023
132023
An alignment safety case sketch based on debate
MD Buhl, J Pfau, B Hilton, G Irving
arXiv preprint arXiv:2505.03989, 2025
92025
When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback
J Pan, R Shar, J Pfau, A Talwalkar, H He, V Chen
arXiv preprint arXiv:2502.18413, 2025
72025
Open problems and fundamental limitations of reinforcement learning from human feedback. CoRR, abs/2307.15217, 2023. doi: 10.48550
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint ARXIV.2307.15217, 0
2
An alignment safety case sketch based on debate
M Davidsen Buhl, J Pfau, B Hilton, G Irving
arXiv e-prints, arXiv: 2505.03989, 2025
2025
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
A Cooper Stickland, A Lyzhov, J Pfau, S Mahdi, SR Bowman
arXiv e-prints, arXiv: 2406.15518, 2024
2024
The system can't perform the operation now. Try again later.
Articles 1–19