[go: up one dir, main page]

Follow
Paul Röttger
Title
Cited by
Cited by
Year
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
P Röttger, HR Kirk, B Vidgen, G Attanasio, F Bianchi, D Hovy
NAACL 2024 (Main), 2023
4332023
HateCheck: Functional Tests for Hate Speech Detection Models
P Röttger, B Vidgen, D Nguyen, Z Waseem, H Margetts, J Pierrehumbert
ACL 2021 (Main) - 🏆 Stanford HAI AI Audit Challenge, 2021
3912021
The Benefits, Risks and Bounds of Personalizing the Alignment of Large Language Models to Individuals
HR Kirk, B Vidgen, P Röttger, SA Hale
Nature Machine Intelligence, 2024
358*2024
Safety-Tuned Llamas: Lessons from Improving the Safety of Large Language Models that Follow Instructions
F Bianchi, M Suzgun, G Attanasio, P Röttger, D Jurafsky, T Hashimoto, ...
ICLR 2024 (Poster), 2023
3322023
The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals about the Subjective and Multicultural Alignment of Large Language Models
HR Kirk, A Whitefield, P Röttger, AM Bean, K Margatina, R Mosquera, ...
NeurIPS 2024 (Oral) - 🏆 Best Paper (Datasets & Benchmarks), 2024
237*2024
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
P Röttger, B Vidgen, D Hovy, JB Pierrehumbert
NAACL 2022 (Main), 2022
2302022
SemEval-2023 Task 10: Explainable Detection of Online Sexism
HR Kirk, W Yin, B Vidgen, P Röttger
ACL 2023 (Main) - 🏆 Best Task Paper, 2023
1922023
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
P Röttger, V Hofmann, V Pyatkin, M Hinck, HR Kirk, H Schütze, D Hovy
ACL 2024 (Main) - 🏆 Outstanding Paper, 2024
1752024
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
X Wang, B Ma, C Hu, L Weber-Genzel, P Röttger, F Kreuter, D Hovy, ...
ACL 2024 (Findings), 2024
1122024
Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media
P Röttger, JB Pierrehumbert
EMNLP 2021 (Findings), 2021
992021
Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate
HR Kirk, B Vidgen, P Röttger, T Thrush, SA Hale
NAACL 2022 (Main), 2021
862021
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models
P Röttger, H Seelawi, D Nozza, Z Talat, B Vidgen
WOAH at NAACL 2022, 2022
792022
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
B Vidgen, HR Kirk, R Qian, N Scherrer, A Kannappan, SA Hale, P Röttger
arXiv, 2023
752023
Introducing v0.5 of the AI Safety Benchmark from MLCommons
B Vidgen, A Agrawal, AM Ahmed, V Akinwande, N Al-Nuaimi, N Alfaraj, ...
arXiv, 2024
722024
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values
HR Kirk, AM Bean, B Vidgen, P Röttger, SA Hale
EMNLP 2023 (Main), 2023
712023
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
P Röttger, F Pernisi, B Vidgen, D Hovy
AAAI 2025, 2024
682024
The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics
M Orlikowski, P Röttger, P Cimiano, D Hovy
ACL 2023 (Main), 2023
502023
Scaling Language Model Size Yields Diminishing Returns for Single-Message Political Persuasion
K Hackenburg, B Tappin, P Röttger, S Hale, J Bright, H Margetts
PNAS, 2025
48*2025
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
X Wang, C Hu, B Ma, P Röttger, B Plank
COLM 2024, 2024
382024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
C Holtermann*, P Röttger*, T Dill, A Lauscher
ACL 2024 (Findings), 2024
352024
The system can't perform the operation now. Try again later.
Articles 1–20