| Chatglm: A family of large language models from glm-130b to glm-4 all tools T GLM, A Zeng, B Xu, B Wang, C Zhang, D Yin, D Zhang, D Rojas, G Feng, ... arXiv preprint arXiv:2406.12793, 2024 | 1397* | 2024 |
| Survey on factuality in large language models C Wang, X Liu, Y Yue, Q Guo, X Hu, X Tang, T Zhang, C Jiayang, Y Yao, ... ACM Computing Surveys 58 (1), 1-37, 2025 | 351* | 2025 |
| Knowledge conflicts for llms: A survey R Xu, Z Qi, Z Guo, C Wang, H Wang, Y Zhang, W Xu EMNLP 2024, 2024 | 229 | 2024 |
| WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning Z Qi, X Liu, IL Iong, H Lai, X Sun, W Zhao, Y Yang, X Yang, J Sun, S Yao, ... ICLR 2025, 2024 | 117* | 2024 |
| Visualagentbench: Towards large multimodal models as visual foundation agents X Liu, T Zhang, Y Gu, IL Iong, Y Xu, X Song, S Zhang, H Lai, X Liu, H Zhao, ... ICLR 2025, 2024 | 74* | 2024 |
| Autoglm: Autonomous foundation agents for guis X Liu, B Qin, D Liang, G Dong, H Lai, H Zhang, H Zhao, IL Iong, J Sun, ... arXiv preprint arXiv:2411.00820, 2024 | 70* | 2024 |
| Mr-ben: A meta-reasoning benchmark for evaluating system-2 thinking in llms Z Zeng, Y Liu, Y Wan, J Li, P Chen, J Dai, Y Yao, R Xu, Z Qi, W Zhao, ... NeurIPS 2024, 2024 | 43* | 2024 |
| Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts S Zhang, H Zhao, X Liu, Q Zheng, Z Qi, X Gu, X Zhang, Y Dong, J Tang arXiv preprint arXiv:2405.04520, 2024 | 36* | 2024 |
| LongRAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall Z Qi, R Xu, Z Guo, C Wang, H Zhang, W Xu arXiv preprint arXiv:2410.23000, 2024 | 26 | 2024 |
| Walking in Others' Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias R Xu, Z Zhou, T Zhang, Z Qi, S Yao, K Xu, W Xu, H Qiu arXiv preprint arXiv:2407.15366, 2024 | 24 | 2024 |
| Preemptive answer" attacks" on chain-of-thought reasoning R Xu, Z Qi, W Xu arXiv preprint arXiv:2405.20902, 2024 | 21 | 2024 |
| A survey of post-training scaling in large language models H Lai, X Liu, J Gao, J Cheng, Z Qi, Y Xu, S Yao, D Zhang, J Du, Z Hou, ... Proceedings of the 63rd Annual Meeting of the Association for Computational …, 2025 | 16 | 2025 |
| Debateqa: Evaluating question answering on debatable knowledge R Xu, X Qi, Z Qi, W Xu, Z Guo arXiv preprint arXiv:2408.01419, 2024 | 14 | 2024 |
| Bias and volatility: A statistical framework for evaluating large language model stereotypes and generation inconsistency Y Liu, K Yang, Z Qi, X Liu, Y Yu, CX Zhai Advances in Neural Information Processing Systems (NeurIPS), 2024 | 11* | 2024 |
| AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework H Zhang, X Liu, B Lv, X Sun, B Jing, IL Iong, Z Hou, Z Qi, H Lai, Y Xu, R Lu, ... arXiv preprint arXiv:2510.04206, 2025 | 1 | 2025 |