Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

1The University of Tokyo, 2RIKEN

🎉 ICLR 2026 (Poster) 🎉

TL;DR

Appropriately controlled policy diversity improves the learning efficiency of ensemble RL in large-scale environments.

Teaser figure

(a) The leader-follower approach is an agent ensemble method that aggregates samples from multiple followers into a leader policy. (b) Misalignment between policies may causes a decline in sample efficiency and training stability. (c) Our method introduces KL divergence constraints to keep followers distributed around the leader, as well as adversarial reward to prevent policies overconcentration.

Abstract

Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability.

In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization (CPO), which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple dexterous manipulation tasks in both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods.

Experiment Results

Training Performance

Our proposed method significantly improves sample efficiency and final performance over strong baselines including SAPG, DexPBT, and PPO on multiple object manipulation tasks.

Performance figure

Inter-policy KL Divergence Analysis

By visualizing the KL divergence among ensemble policies during training, we show that SAPG exhibits severe leader-follower policy divergence on several tasks, whereas the proposed method appropriately suppresses divergence and improves learning capability.

Shadow Hand

Allegro-Kuka Reorientation

Allegro-Kuka Regrasping

BibTeX

@inproceedings{shitanda2026rethinking,
  author    = {Naoki Shitanda and Motoki Omura and Tatsuya Harada and Takayuki Osa},
  title     = {Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026},
  address   = {Rio de Janeiro, Brazil},
  note      = {Poster}
}