Javier Rando

Cited by

	All	Since 2021
Citations	3092	3087
h-index	17	17
i10-index	20	20

1800

900

450

1350

2022202320242025202610 147 1084 1758 64

Public access

View all

2 articles

0 articles

available

not available

Based on funding mandates

Co-authors

Florian TramèrAssistant Professor of Computer Science, ETH ZurichVerified email at inf.ethz.ch
Nicholas CarliniAnthropicVerified email at anthropic.com
Daniel PalekaETH ZurichVerified email at inf.ethz.ch
Stephen CasperPhD student, MITVerified email at mit.edu
Fernando Perez-CruzSr Adviser, Innovation at Bank for International SettlementsVerified email at bis.org
He HeNew York UniversityVerified email at cs.nyu.edu

Javier Rando

Other namesJavier Rando Ramirez

Anthropic

Verified email at anthropic.com - Homepage

Artificial Intelligence Language Models Safety Security Privacy


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... Transactions on Machine Learning Research (TMLR). Outstanding Finalist 🏆, 2023	914	2023
Scalable Extraction of Training Data from Aligned, Production Language Models M Nasr, J Rando, N Carlini, J Hayase, M Jagielski, AF Cooper, ... International Conference on Learning Representations (ICLR), 2025	649*	2025
Red-Teaming the Stable Diffusion Safety Filter J Rando, D Paleka, D Lindner, L Heim, F Tramèr ML Safety Workshop at NeurIPS. Best Paper Award 🏆, 2022	304	2022
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... Transactions on Machine Learning Research (TMLR), 2024	296	2024
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando SoLaR Workshop at NeurIPS, 2023	198	2023
Universal Jailbreak Backdoors from Poisoned Human Feedback J Rando, F Tramèr International Conference on Learning Representations (ICLR), 2023	136	2023
An Adversarial Perspective on Machine Unlearning for AI Safety J Łucki, B Wei, Y Huang, P Henderson, F Tramèr, J Rando SoLaR Workshop at NeurIPS. Best Technical Paper 🏆, 2024	99	2024
Llama guard 3 vision: Safeguarding human-ai image understanding conversations J Chi, U Karn, H Zhan, E Smith, J Rando, Y Zhang, K Plawiak, ZD Coudert, ... arXiv preprint arXiv:2411.10414, 2024	85	2024
Attributions toward artificial agents in a modified Moral Turing Test E Aharoni, S Fernandes, DJ Brady, C Alexander, M Criner, K Queen, ... Scientific reports 14 (1), 8458, 2024	62	2024
Passgpt: Password modeling and (guided) generation with large language models J Rando, F Perez-Cruz, B Hitaj European Symposium on Research in Computer Security, 164-183, 2023	47	2023
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks E Mosca, S Agarwal, J Rando-Ramirez, G Groh Annual Meeting of the Association for Computational Linguistics (ACL), 2022	47	2022
Personas as a Way to Model Truthfulness in Language Models N Joshi, J Rando, A Saparov, N Kim, H He Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023	45	2023
Persistent Pre-Training Poisoning of LLMs Y Zhang, J Rando, I Evtimov, J Chi, EM Smith, N Carlini, F Tramèr, ... International Conference on Learning Representations (ICLR), 2024	36	2024
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI R Hönig, J Rando, N Carlini, F Tramèr International Conference on Learning Representations (ICLR). Spotlight 🏆, 2024	33	2024
Competition report: Finding universal jailbreak backdoors in aligned llms J Rando, F Croce, K Mitka, S Shabalin, M Andriushchenko, N Flammarion, ... arXiv preprint arXiv:2404.14461, 2024	29*	2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition E Debenedetti, J Rando, D Paleka*, SF Florin, D Albastroiu, N Cohen, ... NeurIPS Datasets and Benchmarks. Spotlight 🏆, 2024	23	2024
Adversarial ml problems are getting harder to solve and to evaluate J Rando, J Zhang, N Carlini, F Tramèr arXiv preprint arXiv:2502.02260, 2025	19	2025
Measuring Non-Adversarial Reproduction of Training Data in Large Language Models M Aerni, J Rando, E Debenedetti, N Carlini, D Ippolito, F Tramèr International Conference on Learning Representations (ICLR), 2024	15	2024
Poisoning attacks on LLMs require a near-constant number of poison samples A Souly, J Rando, E Chapman, X Davies, B Hasircioglu, E Shereen, ... arXiv preprint arXiv:2510.07192, 2025	14	2025
Uneven coverage of natural disasters in Wikipedia: The case of floods V Lorini, J Rando, D Saez-Trumper, C Castillo ISCRAM 2020, 2020	14	2020

The system can't perform the operation now. Try again later.

Articles 1–20

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by

Co-authors