[go: up one dir, main page]

Follow
Mantas Mazeika
Mantas Mazeika
Center for AI Safety
Verified email at illinois.edu
Title
Cited by
Cited by
Year
Measuring massive multitask language understanding
D Hendrycks, C Burns, S Basart, A Zou, M Mazeika, D Song, J Steinhardt
arXiv preprint arXiv:2009.03300, 2020
69292020
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ...
arXiv preprint arXiv:2206.04615, 2022
22102022
Deep anomaly detection with outlier exposure
D Hendrycks, M Mazeika, T Dietterich
arXiv preprint arXiv:1812.04606, 2018
21922018
Using self-supervised learning can improve model robustness and uncertainty
D Hendrycks, M Mazeika, S Kadavath, D Song
Advances in neural information processing systems 32, 2019
13172019
Using pre-training can improve model robustness and uncertainty
D Hendrycks, K Lee, M Mazeika
International Conference on Machine Learning, 2712-2721, 2019
10612019
Measuring coding challenge competence with apps
D Hendrycks, S Basart, S Kadavath, M Mazeika, A Arora, E Guo, C Burns, ...
arXiv preprint arXiv:2105.09938, 2021
9742021
Using trusted data to train deep networks on labels corrupted by severe noise
D Hendrycks, M Mazeika, D Wilson, K Gimpel
Advances in neural information processing systems 31, 2018
7512018
Scaling out-of-distribution detection for real-world settings
D Hendrycks, S Basart, M Mazeika, A Zou, J Kwon, M Mostajabi, ...
arXiv preprint arXiv:1911.11132, 2019
7332019
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal
M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ...
arXiv preprint arXiv:2402.04249, 2024
7132024
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models.
B Wang, W Chen, H Pei, C Xie, M Kang, C Zhang, C Xu, Z Xiong, R Dutta, ...
NeurIPS, 2023
6862023
Representation engineering: A top-down approach to ai transparency
A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 2023
6742023
An overview of catastrophic ai risks
D Hendrycks, M Mazeika, T Woodside
arXiv preprint arXiv:2306.12001, 2023
3712023
The wmdp benchmark: Measuring and reducing malicious use with unlearning
N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ...
arXiv preprint arXiv:2403.03218, 2024
3372024
Humanity's Last Exam
L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, CBC Zhang, M Shaaban, ...
arXiv preprint arXiv:2501.14249, 2025
3052025
Pixmix: Dreamlike pictures comprehensively improve safety measures
D Hendrycks, A Zou, M Mazeika, L Tang, B Li, D Song, J Steinhardt
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2022
2032022
International AI Safety Report
Y Bengio, S Mindermann, D Privitera, T Besiroglu, R Bommasani, ...
arXiv preprint arXiv:2501.17805, 2025
1222025
X-risk analysis for ai research
D Hendrycks, M Mazeika
arXiv preprint arXiv:2206.05862, 2022
1182022
Tamper-Resistant Safeguards for Open-Weight LLMs
R Tamirisa, B Bharathi, L Phan, A Zhou, A Gatti, T Suresh, M Lin, J Wang, ...
arXiv preprint arXiv:2408.00761, 2024
862024
What would jiminy cricket do? towards agents that behave morally
D Hendrycks, M Mazeika, A Zou, S Patel, C Zhu, J Navarro, D Song, B Li, ...
arXiv preprint arXiv:2110.13136, 2021
852021
A benchmark for anomaly segmentation
D Hendrycks, S Basart, M Mazeika, M Mostajabi, J Steinhardt, D Song
812019
The system can't perform the operation now. Try again later.
Articles 1–20