| Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... Transactions on Machine Learning Research (TMLR). Outstanding Finalist 🏆, 2023 | 914 | 2023 |
| Scalable Extraction of Training Data from Aligned, Production Language Models M Nasr*, J Rando*, N Carlini, J Hayase, M Jagielski, AF Cooper, ... International Conference on Learning Representations (ICLR), 2025 | 649* | 2025 |
| Red-Teaming the Stable Diffusion Safety Filter J Rando, D Paleka, D Lindner, L Heim, F Tramèr ML Safety Workshop at NeurIPS. Best Paper Award 🏆, 2022 | 304 | 2022 |
| Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... Transactions on Machine Learning Research (TMLR), 2024 | 296 | 2024 |
| Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando SoLaR Workshop at NeurIPS, 2023 | 198 | 2023 |
| Universal Jailbreak Backdoors from Poisoned Human Feedback J Rando, F Tramèr International Conference on Learning Representations (ICLR), 2023 | 136 | 2023 |
| An Adversarial Perspective on Machine Unlearning for AI Safety J Łucki, B Wei, Y Huang, P Henderson, F Tramèr, J Rando SoLaR Workshop at NeurIPS. Best Technical Paper 🏆, 2024 | 99 | 2024 |
| Llama guard 3 vision: Safeguarding human-ai image understanding conversations J Chi, U Karn, H Zhan, E Smith, J Rando, Y Zhang, K Plawiak, ZD Coudert, ... arXiv preprint arXiv:2411.10414, 2024 | 85 | 2024 |
| Attributions toward artificial agents in a modified Moral Turing Test E Aharoni, S Fernandes, DJ Brady, C Alexander, M Criner, K Queen, ... Scientific reports 14 (1), 8458, 2024 | 62 | 2024 |
| Passgpt: Password modeling and (guided) generation with large language models J Rando, F Perez-Cruz, B Hitaj European Symposium on Research in Computer Security, 164-183, 2023 | 47 | 2023 |
| "That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks E Mosca, S Agarwal, J Rando-Ramirez, G Groh Annual Meeting of the Association for Computational Linguistics (ACL), 2022 | 47 | 2022 |
| Personas as a Way to Model Truthfulness in Language Models N Joshi*, J Rando*, A Saparov, N Kim, H He Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023 | 45 | 2023 |
| Persistent Pre-Training Poisoning of LLMs Y Zhang*, J Rando*, I Evtimov, J Chi, EM Smith, N Carlini, F Tramèr, ... International Conference on Learning Representations (ICLR), 2024 | 36 | 2024 |
| Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI R Hönig, J Rando, N Carlini, F Tramèr International Conference on Learning Representations (ICLR). Spotlight 🏆, 2024 | 33 | 2024 |
| Competition report: Finding universal jailbreak backdoors in aligned llms J Rando, F Croce, K Mitka, S Shabalin, M Andriushchenko, N Flammarion, ... arXiv preprint arXiv:2404.14461, 2024 | 29* | 2024 |
| Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition E Debenedetti*, J Rando*, D Paleka*, SF Florin, D Albastroiu, N Cohen, ... NeurIPS Datasets and Benchmarks. Spotlight 🏆, 2024 | 23 | 2024 |
| Adversarial ml problems are getting harder to solve and to evaluate J Rando, J Zhang, N Carlini, F Tramèr arXiv preprint arXiv:2502.02260, 2025 | 19 | 2025 |
| Measuring Non-Adversarial Reproduction of Training Data in Large Language Models M Aerni*, J Rando*, E Debenedetti, N Carlini, D Ippolito, F Tramèr International Conference on Learning Representations (ICLR), 2024 | 15 | 2024 |
| Poisoning attacks on LLMs require a near-constant number of poison samples A Souly, J Rando, E Chapman, X Davies, B Hasircioglu, E Shereen, ... arXiv preprint arXiv:2510.07192, 2025 | 14 | 2025 |
| Uneven coverage of natural disasters in Wikipedia: The case of floods V Lorini, J Rando, D Saez-Trumper, C Castillo ISCRAM 2020, 2020 | 14 | 2020 |