| Measuring massive multitask language understanding D Hendrycks, C Burns, S Basart, A Zou, M Mazeika, D Song, J Steinhardt arXiv preprint arXiv:2009.03300, 2020 | 6929 | 2020 |
| Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... arXiv preprint arXiv:2206.04615, 2022 | 2210 | 2022 |
| Deep anomaly detection with outlier exposure D Hendrycks, M Mazeika, T Dietterich arXiv preprint arXiv:1812.04606, 2018 | 2192 | 2018 |
| Using self-supervised learning can improve model robustness and uncertainty D Hendrycks, M Mazeika, S Kadavath, D Song Advances in neural information processing systems 32, 2019 | 1317 | 2019 |
| Using pre-training can improve model robustness and uncertainty D Hendrycks, K Lee, M Mazeika International Conference on Machine Learning, 2712-2721, 2019 | 1061 | 2019 |
| Measuring coding challenge competence with apps D Hendrycks, S Basart, S Kadavath, M Mazeika, A Arora, E Guo, C Burns, ... arXiv preprint arXiv:2105.09938, 2021 | 974 | 2021 |
| Using trusted data to train deep networks on labels corrupted by severe noise D Hendrycks, M Mazeika, D Wilson, K Gimpel Advances in neural information processing systems 31, 2018 | 751 | 2018 |
| Scaling out-of-distribution detection for real-world settings D Hendrycks, S Basart, M Mazeika, A Zou, J Kwon, M Mostajabi, ... arXiv preprint arXiv:1911.11132, 2019 | 733 | 2019 |
| Harmbench: A standardized evaluation framework for automated red teaming and robust refusal M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ... arXiv preprint arXiv:2402.04249, 2024 | 713 | 2024 |
| DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. B Wang, W Chen, H Pei, C Xie, M Kang, C Zhang, C Xu, Z Xiong, R Dutta, ... NeurIPS, 2023 | 686 | 2023 |
| Representation engineering: A top-down approach to ai transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023 | 674 | 2023 |
| An overview of catastrophic ai risks D Hendrycks, M Mazeika, T Woodside arXiv preprint arXiv:2306.12001, 2023 | 371 | 2023 |
| The wmdp benchmark: Measuring and reducing malicious use with unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... arXiv preprint arXiv:2403.03218, 2024 | 337 | 2024 |
| Humanity's Last Exam L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, CBC Zhang, M Shaaban, ... arXiv preprint arXiv:2501.14249, 2025 | 305 | 2025 |
| Pixmix: Dreamlike pictures comprehensively improve safety measures D Hendrycks, A Zou, M Mazeika, L Tang, B Li, D Song, J Steinhardt Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern …, 2022 | 203 | 2022 |
| International AI Safety Report Y Bengio, S Mindermann, D Privitera, T Besiroglu, R Bommasani, ... arXiv preprint arXiv:2501.17805, 2025 | 122 | 2025 |
| X-risk analysis for ai research D Hendrycks, M Mazeika arXiv preprint arXiv:2206.05862, 2022 | 118 | 2022 |
| Tamper-Resistant Safeguards for Open-Weight LLMs R Tamirisa, B Bharathi, L Phan, A Zhou, A Gatti, T Suresh, M Lin, J Wang, ... arXiv preprint arXiv:2408.00761, 2024 | 86 | 2024 |
| What would jiminy cricket do? towards agents that behave morally D Hendrycks, M Mazeika, A Zou, S Patel, C Zhu, J Navarro, D Song, B Li, ... arXiv preprint arXiv:2110.13136, 2021 | 85 | 2021 |
| A benchmark for anomaly segmentation D Hendrycks, S Basart, M Mazeika, M Mostajabi, J Steinhardt, D Song | 81 | 2019 |