[go: up one dir, main page]

Follow
Wei Xiong
Wei Xiong
OpenAI, UIUC
Verified email at illinois.edu - Homepage
Title
Cited by
Cited by
Year
Raft: Reward ranked finetuning for generative foundation model alignment
( α-β), H Dong*, W Xiong*, D Goyal, Z Yihan, C Winnie, R Pan, S Diao, ...
TMLR (Invited Presentation @ ICLR 2025), 2023
648*2023
Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
W Xiong*, H Dong*, C Ye*, Z Wang, H Zhong, H Ji, N Jiang, T Zhang
ICML 2024, 2023
353*2023
Interpretable preferences via multi-objective reward modeling and mixture-of-experts
H Wang*, W Xiong*, T Xie, H Zhao, T Zhang
ACL 2024, 2024
3022024
RLHF workflow: From reward modeling to online rlhf
( α-β), H Dong*, W Xiong*, B Pang*, H Wang*, H Zhao, Y Zhou, N Jiang, ...
TMLR, 2024
297*2024
Mitigating the alignment tax of rlhf
( α-β), Y Lin*, H Lin*, W Xiong*, S Diao*, J Liu, J Zhang, R Pan, H Wang, ...
ACL 2024, 2023
225*2023
Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards
H Wang*, Y Lin*, W Xiong*, R Yang, S Diao, S Qiu, H Zhao, T Zhang
ACL 2024, 2024
1372024
Dpo meets ppo: Reinforced token optimization for rlhf
H Zhong, Z Shan, G Feng, W Xiong, X Cheng, L Zhao, D He, J Bian, ...
ICML 2025, 2024
992024
A theoretical analysis of nash learning from human feedback under general kl-regularized preference
( α-β), C Ye*, W Xiong*, Y Zhang*, N Jiang, T Zhang
NeurIPS 2024, 2024
86*2024
A posterior sampling framework for interactive decision making
( α-β), H Zhong*, W Xiong*, S Zheng, L Wang, Z Wang, Z Yang, T Zhang
Mathematics of Operations Research (MOR), 2022
83*2022
A minimalist approach to llm reasoning: from rejection sampling to reinforce
W Xiong, J Yao, Y Xu, B Pang, L Wang, D Sahoo, J Li, N Jiang, T Zhang, ...
Technical Report, 2025
782025
Strengthening multimodal large language model with bootstrapped preference optimization
R Pi, T Han, W Xiong, J Zhang, R Liu, R Pan, T Zhang
ECCV 2024, 382-398, 2024
772024
Lmflow: An extensible toolkit for finetuning and inference of large foundation models
S Diao, R Pan, H Dong, KS Shum, J Zhang, W Xiong, T Zhang
NAACL 2024, best paper award in demo paper track, 2023
752023
Building math agents with multi-turn iterative preference learning
W Xiong, C Shi, J Shen, A Rosenberg, Z Qin, D Calandriello, M Khalman, ...
ICLR 2025, 2024
612024
Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game
W Xiong, H Zhong, C Shi, C Shen, L Wang, T Zhang
ICLR 2023, 2022
602022
Pessimistic minimax value iteration: Provably efficient equilibrium learning from offline datasets
H Zhong, W Xiong, J Tan, L Wang, T Zhang, Z Wang, Z Yang
ICML 2022, 2022
592022
Maximize to explore: One objective function fusing estimation, planning, and exploration
Z Liu, M Lu, W Xiong, H Zhong, H Hu, S Zhang, S Zheng, Z Yang, Z Wang
NeurIPS 2024, 2024
51*2024
Decentralized multi-player multi-armed bandits with no collision information
C Shi, W Xiong, C Shen, J Yang
AISTATS 2020, 2020
452020
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes
C Ye, W Xiong, Q Gu, T Zhang
ICML 2023, 2022
422022
Self-rewarding correction for mathematical reasoning
W Xiong, H Zhang, C Ye, L Chen, N Jiang, T Zhang
arXiv preprint arXiv:2502.19613, 2025
412025
Rrm: Robust reward model training mitigates reward hacking
T Liu, W Xiong, J Ren, L Chen, J Wu, R Joshi, Y Gao, J Shen, Z Qin, T Yu, ...
ICLR 2025, 2024
412024
The system can't perform the operation now. Try again later.
Articles 1–20