| Lora: Low-rank adaptation of large language models. EJ Hu, Y Shen, P Wallis, Z Allen-Zhu, Y Li, S Wang, L Wang, W Chen ICLR 1 (2), 3, 2022 | 25622* | 2022 |
| Sparks of artificial general intelligence: Early experiments with gpt-4 S Bubeck, V Chandrasekaran, R Eldan, J Gehrke, E Horvitz, E Kamar, ... arXiv preprint arXiv:2303.12712, 2023 | 5489 | 2023 |
| Phi-4 technical report M Abdin, J Aneja, H Behl, S Bubeck, R Eldan, S Gunasekar, M Harrison, ... arXiv preprint arXiv:2412.08905, 2024 | 2761 | 2024 |
| A convergence theory for deep learning via over-parameterization Z Allen-Zhu, Y Li, Z Song International conference on machine learning, 242-252, 2019 | 1985 | 2019 |
| Learning and generalization in overparameterized neural networks, going beyond two layers Z Allen-Zhu, Y Li, Y Liang Advances in neural information processing systems 32, 2019 | 1039 | 2019 |
| A theoretical analysis of NDCG type ranking measures Y Wang, L Wang, Y Li, D He, TY Liu Conference on learning theory, 25-54, 2013 | 953 | 2013 |
| Textbooks are all you need S Gunasekar, Y Zhang, J Aneja, CCT Mendes, A Del Giorno, S Gopi, ... arXiv preprint arXiv:2306.11644, 2023 | 952 | 2023 |
| Convergence analysis of two-layer neural networks with relu activation Y Li, Y Yuan Advances in neural information processing systems 30, 2017 | 889 | 2017 |
| Learning overparameterized neural networks via stochastic gradient descent on structured data Y Li, Y Liang Advances in neural information processing systems 31, 2018 | 838 | 2018 |
| Textbooks are all you need ii: phi-1.5 technical report Y Li, S Bubeck, R Eldan, A Del Giorno, S Gunasekar, YT Lee arXiv preprint arXiv:2309.05463, 2023 | 695 | 2023 |
| A latent variable model approach to pmi-based word embeddings S Arora, Y Li, Y Liang, T Ma, A Risteski Transactions of the Association for Computational Linguistics 4, 385-399, 2016 | 694* | 2016 |
| Towards understanding ensemble, knowledge distillation and self-distillation in deep learning Z Allen-Zhu, Y Li arXiv preprint arXiv:2012.09816, 2020 | 619 | 2020 |
| Can generalist foundation models outcompete special-purpose tuning? case study in medicine H Nori, YT Lee, S Zhang, D Carignan, R Edgar, N Fusi, N King, J Larson, ... arXiv preprint arXiv:2311.16452, 2023 | 535 | 2023 |
| Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions S Chen, S Chewi, J Li, Y Li, A Salim, AR Zhang arXiv preprint arXiv:2209.11215, 2022 | 481 | 2022 |
| Towards explaining the regularization effect of initial large learning rate in training neural networks Y Li, C Wei, T Ma Advances in neural information processing systems 32, 2019 | 432 | 2019 |
| An alternative view: When does SGD escape local minima? B Kleinberg, Y Li, Y Yuan International conference on machine learning, 2698-2707, 2018 | 430 | 2018 |
| Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations Y Li, T Ma, H Zhang Conference On Learning Theory, 2-47, 2018 | 425 | 2018 |
| Gradient descent on neural networks typically occurs at the edge of stability JM Cohen, S Kaur, Y Li, JZ Kolter, A Talwalkar arXiv preprint arXiv:2103.00065, 2021 | 419 | 2021 |
| Phi-2: The surprising power of small language models M Javaheripi, S Bubeck, M Abdin, J Aneja, S Bubeck, CCT Mendes, ... Microsoft Research Blog 1 (3), 3, 2023 | 401 | 2023 |
| Tinystories: How small can language models be and still speak coherent english? R Eldan, Y Li arXiv preprint arXiv:2305.07759, 2023 | 380 | 2023 |