工作经历 / Experience Experience
上海人工智能实验室 · 开源决策智能实验室 (OpenDILab) · 研究工程师 Shanghai AI Laboratory · OpenDILab · Research Engineer
2023.04 — 至今 2023.04 — Present
商汤科技 · 研究院 · 强化学习算法研究员 SenseTime · Research Institute · RL Algorithm Researcher
2021.07 — 2023.04

核心项目与研究 / Projects & Research Projects & Research
LightRFT:轻量化、全模态及奖励模型驱动的强化学习微调框架 LightRFT: Lightweight, All-Modality & Reward-Model-Driven RL Fine-Tuning Framework
2025.03 — 至今 2025.03 — Present
  • 作为核心成员,为百卡级多模态大模型安全对齐项目 SafeWork-R1 构建底层训练设施。设计了统一的 Strategy 抽象层,实现了训练(DeepSpeed/FSDP)、推理(vLLM/SGLang)与验证(规则/模型)等异构任务在统一控制平面下的灵活切换与高效共置,显著提升了多模态基础模型与奖励模型协同训练的资源利用率和易用性。
  • 主导 Off-policy 设置下的算法机制研究,致力于解决强化学习微调中的样本效率与训练稳定性难题;同时深入探索基于奖励模型的强化微调算法,以提升模型对齐效果。
  • As a core member, built the underlying training infrastructure for the 100+ GPU multimodal LLM safety alignment project SafeWork-R1. Designed a unified Strategy abstraction layer enabling flexible switching and efficient co-location of heterogeneous tasks (training via DeepSpeed/FSDP, inference via vLLM/SGLang, verification via rules/models) under a unified control plane.
  • Leading research on off-policy algorithms for RL fine-tuning, addressing sample efficiency and training stability challenges; exploring Reward Model RFT algorithms to improve model alignment.
PriorZero:LLM/VLM 与 MCTS 协同决策架构 PriorZero: LLM/VLM & MCTS Synergistic Decision Architecture
2025.06 — 至今 2025.06 — Present
  • 探索 LLM/VLM 与 MCTS 的双模型协同架构。利用大模型的常识推理能力为搜索提供高质量先验(Prior),并通过 MCTS 在隐空间的规划结果反哺大模型训练。该研究旨在构建"推理-规划"协同进化的闭环系统,解决长程依赖及稀疏奖励下的复杂决策难题。
  • Exploring a dual-model synergistic architecture combining LLM/VLM with MCTS. Leveraging commonsense reasoning capabilities of large models to provide high-quality priors for search, while using MCTS planning results in latent space to improve LLM training. Aiming to build a closed-loop "reasoning-planning" co-evolution system for complex decision-making under long-horizon dependencies and sparse rewards.
LightZero:通用序列决策的 MCTS+RL 统一基准 LightZero: Unified MCTS+RL Benchmark for General Sequential Decision
2023.04 — 至今 2023.04 — Present
  • NeurIPS 2023 Spotlight 项目负责人。该框架旨在标准化 MCTS+RL 算法族,集成了 AlphaZero、MuZero 等 10 余种主流算法,是目前最全面的开源 MCTS 算法库。
  • 采用 Python/C++ 混合编程(Core MCTS in C++),设计了轻量化、模块化的系统架构。通过多进程并行数据采集与 DDP 分布式训练支持,显著提升了在 Atari、DMC 及棋类博弈等复杂环境下的训练效率与扩展性。
  • NeurIPS 2023 Spotlight Project lead. The framework standardizes the MCTS+RL algorithm family, integrating 10+ mainstream algorithms including AlphaZero and MuZero — the most comprehensive open-source MCTS library to date.
  • Designed a lightweight, modular architecture using Python/C++ hybrid programming (Core MCTS in C++). Multi-process parallel data collection and DDP distributed training significantly improved efficiency and scalability across Atari, DMC, and board game environments.
ScaleZero:可扩展的异构多任务学习 ScaleZero: Scalable Heterogeneous Multi-Task Learning
2024.05 — 2026.03
  • ICLR 2026 ScaleZero:针对大规模异构多任务学习中的可塑性(Plasticity)与效率瓶颈,系统性优化了世界模型设计。仅依托在线隐空间规划,在 Atari、DMC、Jericho 等多领域基准上实现了超越专用单任务模型的性能。设计了基于 LoRA 的在线动态参数扩展策略,能够根据任务训练进程自适应分配计算资源,在 DMC 任务上仅需约 75% 的交互步数即可达到单任务基线的性能水平。
  • ICLR 2026 ScaleZero: Systematically optimized world model design for plasticity and efficiency bottlenecks in large-scale heterogeneous multi-task learning. Achieved performance surpassing dedicated single-task models on Atari, DMC, and Jericho benchmarks using only online latent-space planning. Designed a LoRA-based online dynamic parameter expansion strategy that adaptively allocates compute resources, reaching single-task baseline performance with only ~75% of interaction steps on DMC tasks.
通用高效的 MCTS+RL 算法研究 General & Efficient MCTS+RL Algorithm Research
2023.09 — 2024.09
  • TMLR 2025 UniZero:独立提出基于 Transformer 的模块化世界模型架构。通过构建共享隐空间并联合优化长时序动态预测与决策目标,有效解决了 MCTS 算法在异构状态动作空间及长程依赖任务中的泛化难题,在单/多任务基准上均取得优异性能。
  • CoRL 2025 RemembeRL Workshop ReZero:合作提出反向重分析(Backward-view Reanalyze)机制,在保持样本效率的同时,将 MCTS 类算法的训练时间效率提升 1.5-2 倍。
  • TMLR 2025 UniZero: Independently proposed a modular world model architecture based on Transformer. By constructing a shared latent space and jointly optimizing long-horizon dynamics prediction and decision objectives, effectively addressed generalization challenges of MCTS in heterogeneous state-action spaces and long-horizon tasks.
  • CoRL 2025 RemembeRL Workshop ReZero: Co-proposed the Backward-view Reanalyze mechanism, improving training time efficiency of MCTS-based algorithms by 1.5-2x while maintaining sample efficiency.
DI-engine 强化学习开源框架与 PPO x Family 公开课 DI-engine RL Framework & PPO x Family Course
2021.07 — 2023.04
  • 作为 Top-2 贡献者参与 DI-engine 研发(3.6k+ Stars)。该框架通过算法与系统协同设计和抽象,灵活支持了从 DQN、PPO 到 DreamerV3、NGU 等全栈算法。重点负责了探索机制(Exploration)、POMDP 支持及 Model-based RL 模块的架构设计与实现。
  • 深度参与决策智能入门公开课 PPO x Family 制作(2.5k+ Stars)。负责多模态观测、复杂动作空间及多智能体协作等核心章节的实验设计与教程编写,帮助社区初学者从理论到实践全面掌握 PPO 算法家族。
  • Top-2 contributor to DI-engine (3.6k+ Stars). The framework supports a full stack of algorithms from DQN, PPO to DreamerV3, NGU through co-designed algorithm-system abstractions. Led the architecture design and implementation of exploration mechanisms, POMDP support, and model-based RL modules.
  • Deeply involved in the PPO x Family introductory course on decision intelligence (2.5k+ Stars). Designed experiments and wrote tutorials for core chapters on multimodal observations, complex action spaces, and multi-agent cooperation.

代表性论文 / Publications Publications
ICLR 2026
Pu, Y., et al. "One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning." arXiv preprint arXiv:2509.07945 (2025).
SafeWork-R1
Bao, Y., ..., Pu, Y., et al. "SafeWork-R1: Coevolving Safety and Intelligence under the AI-45° Law." arXiv preprint.
TMLR 2024
Pu, Y., Niu, Y., et al. "UniZero: Generalized and efficient planning with scalable latent world models." Transactions on Machine Learning Research.
NeurIPS 2023 Spotlight
Niu, Y., Pu, Y., et al. "LightZero: A unified benchmark for Monte Carlo tree search in general sequential decision scenarios." Advances in Neural Information Processing Systems, 36.
CoRL 2025 RemembeRL Workshop
Xuan, C., Niu, Y., Pu, Y., et al. "ReZero: Boosting MCTS-based algorithms by backward-view and entire-buffer reanalyze." OpenReview.
Preprint 2024
Niu, Y., Pu, Y., et al. "Unifying diverse decision-making scenarios with learned discrete actions." arXiv preprint.
Preprint 2021
Pu, Y., Wang, S., et al. "Decomposed soft actor-critic method for cooperative multi-agent reinforcement learning." arXiv preprint.

教育背景 / Education Education
中国科学技术大学 · 硕士,电子工程与信息科学 University of Science and Technology of China · M.S. in Electronic Engineering & Information Science
2018 — 2021
哈尔滨工业大学(威海) · 学士,通信工程,专业排名 3/91 Harbin Institute of Technology (Weihai) · B.E. in Communication Engineering, Rank: 3/91
2014 — 2018

技能 / Skills Skills
强化学习 (RL) Reinforcement Learning (RL)
熟悉 Python/PyTorch,拥有丰富的 RL 算法设计开发与性能优化经验。在基于世界模型的 RL (World Models)、蒙特卡洛树搜索 (MCTS)、高效探索 (Exploration) 及多任务学习等前沿方向有深入研究与顶会发表记录。
Proficient in Python/PyTorch with extensive experience in RL algorithm design, development, and performance optimization. In-depth research and top-venue publications in world model-based RL, Monte Carlo Tree Search (MCTS), efficient exploration, and multi-task learning.
大语言模型 (LLM) Large Language Models (LLM)
具备百亿参数级多模态大模型的后训练 (Post-training) 实战经验。熟练掌握 PPO、GRPO 等强化学习微调技术,在样本高效及训练稳定性等方面有独立探究经验。密切关注 RLVR/RLHF 与世界模型相融合的前沿技术。
Hands-on post-training experience with 10B+ parameter multimodal LLMs. Proficient in PPO, GRPO and other RL fine-tuning techniques, with independent research on sample efficiency and training stability. Closely tracking the frontier of RLVR/RLHF integrated with world models.
技术影响力 Community Impact
长期维护知乎专栏 《决策智能与强化学习》《MCTS+RL》《基础模型和多模态交互》。创建并维护前沿主题资源库 awesome-exploration-rlawesome-RLVR/RLHF 等,持续追踪领域最新动态。
Maintaining Zhihu columns on Decision Intelligence & RL, MCTS+RL, and Foundation Models & Multimodal Interaction. Created and maintain curated resource repositories including awesome-exploration-rl, awesome-RLVR/RLHF.

其他信息 / Misc Misc
学术服务:担任 ICLR 2026、NeurIPS 2025 Workshop 审稿人。连续 3 年受邀于清华大学深圳研究生院《大数据分析》课程讲授《强化学习与 MCTS 基础》。作为核心负责人参与《世界模型》专业书籍编写 (WIP)。
荣誉奖项:本科期间连续三年获国家励志奖学金,荣获山东省优秀毕业生称号。
综合素质:通过全国计算机四级、英语六级 (CET-6)。具备对前沿科技的好奇心与技术攻关精神,热爱羽毛球与登山。
Academic Service: Reviewer for ICLR 2026 and NeurIPS 2025 Workshop. Guest lecturer on RL & MCTS Fundamentals at Tsinghua University Shenzhen for 3 consecutive years. Core contributor to a professional book on World Models (WIP).
Honors: National Encouragement Scholarship for 3 consecutive years during undergraduate studies. Outstanding Graduate of Shandong Province.
Misc: National Computer Rank Examination Level 4, CET-6. Passionate about badminton and hiking.