Shanghai AI Lab & Universities Crack AI Reasoning Entropy Collapse with New Reinforcement Learning Algorithm

Shanghai, China – In a significant leap forward for artificial intelligence, a collaborative effort led by the Shanghai AI Laboratory, in conjunction with Tsinghua University, the University of Illinois Urbana-Champaign, and other international institutions, has yielded a novel approach to combat the pervasive issue of entropy collapse in reinforcement learning (RL) applied to large language models (LLMs). The breakthrough, detailed in a recent paper, promises to unlock new potential in AI reasoning and problem-solving.

Large language models have demonstrated remarkable advancements in reasoning capabilities in recent years, extending the application of reinforcement learning from simple tasks to a much broader range of scenarios. This evolution empowers models with enhanced generalization and logical reasoning skills. However, unlike traditional imitation learning, RL demands significantly more computational resources to facilitate learning from experience. A central challenge lies in the decline of policy entropy, which reflects the balance between a model’s exploitation of known strategies and its exploration of new ones.

A low entropy value leads to over-reliance on existing strategies, stifling the model’s ability to discover novel solutions. This exploitation-exploration trade-off is fundamental to reinforcement learning, making the control of policy entropy a critical obstacle in training.

Unveiling the Entropy-Performance Relationship

To address this challenge, the research team formulated an empirical equation: R = -a exp H + b, where H represents policy entropy, R signifies downstream task performance, and a and b are fitting coefficients. This equation elucidates the trade-off between policy performance and entropy, highlighting entropy depletion as a significant performance bottleneck.

Further analysis of entropy dynamics revealed that it is driven by the covariance of action probabilities and logits changes. To counteract entropy collapse, the team innovatively introduced two techniques: Clip-Cov and KL-Cov. Clip-Cov mitigates the issue by clipping high covariance tokens, while KL-Cov applies a Kullback-Leibler (KL) divergence penalty to maintain entropy levels.

Experimental Validation and Results

The efficacy of these techniques was rigorously tested using the Qwen2.5 model on the DAPOMATH dataset, focusing on mathematical problem-solving. Results demonstrated significant performance gains. On 7B and 32B parameter models, performance improved by 2.0% and 6.4%, respectively. Notably, the 32B model exhibited a remarkable 15.0% performance increase on challenging benchmarks such as AIME24 and AIME25.

The research team extended their evaluation to a comprehensive set of 11 open-source models, including Qwen2.5, Mistral, LLaMA, and DeepSeek, with parameter sizes ranging from 0.5B to 32B. These models were assessed across eight publicly available benchmarks encompassing mathematical and programming tasks. Training was conducted using the veRL framework and a zero-shot setting, incorporating algorithms such as GRPO and REINFORCE++ to optimize policy performance.

The results consistently showed that Clip-Cov and KL-Cov techniques effectively maintained higher entropy levels. For instance, the KL-Cov method sustained entropy values more than ten times higher than the baseline, even when the baseline entropy plateaued. This not only addresses the policy entropy collapse problem but also provides a theoretical foundation for expanding reinforcement learning in language models.

Implications and Future Directions

This research underscores the critical role of entropy dynamics as a key bottleneck in performance enhancement. The team emphasizes the need for further exploration of entropy management strategies to drive the development of more intelligent language models. By effectively controlling and maintaining entropy, AI researchers can unlock new levels of reasoning and problem-solving capabilities in LLMs, paving the way for more advanced and versatile AI systems.

The findings from this collaborative effort represent a significant step towards overcoming a major hurdle in the application of reinforcement learning to large language models. As AI continues to evolve, strategies for managing entropy will undoubtedly play a crucial role in shaping the future of intelligent systems.

Reference:

The Entropy Mechanism of Reinforcement Learning. [Insert Link to Paper Here – As the provided text doesn’t include a direct link, replace this with the actual link when available.]

>>> Read more <<<