Diffusion Models Get Smart New Framework Uses RL for LLM Reasoning

Los Angeles, CA – October 26, 2023 – The landscape of large language model (LLM) inference is undergoing a significant shift, moving beyond the exclusive domain of autoregressive (AR) models. A new framework, d1, developed by researchers at UCLA and Meta, demonstrates that diffusion models can now think – understanding mathematics, logic, and reasoning – thanks to a novel application of reinforcement learning.

Traditionally, autoregressive LLMs have dominated the field, generating text sequentially from left to right. However, recent advancements in reinforcement learning (RL) have shown remarkable improvements in the reasoning capabilities of these models. Models like DeepSeek-R1 and Kimi K1.5 showcase performance comparable to OpenAI’s o1 through RL-based post-training.

However, this RL-driven progress has primarily been confined to autoregressive LLMs. Discrete diffusion large language models (dLLMs) are emerging as a promising non-autoregressive alternative for language modeling. Unlike autoregressive models, which generate text token by token in a causal manner, dLLMs generate text through an iterative denoising process. This allows them to optimize sequences in multiple steps and leverage both past and future context through bidirectional attention.

Open-source masked dLLMs like LLaDA have already achieved performance comparable to autoregressive models of similar size. Furthermore, closed-source dLLMs like Mercury have demonstrated impressive inference latency. Despite this potential, leading open-source dLLMs have not yet been subjected to RL post-training, leaving a significant avenue for exploration.

This paradigm shift raises a crucial question: how can RL post-training be effectively implemented in a non-autoregressive context? Adapting RL algorithms to masked dLLMs presents unique challenges. The d1 framework directly addresses this challenge.

d1: Bridging the Gap with GRPO-like Reinforcement Learning

The d1 framework leverages a novel approach inspired by Generative Pre-trained Transformer Reward Optimization (GRPO), adapting it for the unique characteristics of diffusion models. This allows the d1 framework to surpass the performance of solely relying on Supervised Fine-Tuning (SFT).

The framework is now open-source, paving the way for further research and development in this exciting area. By enabling diffusion models to leverage the power of reinforcement learning, d1 opens up new possibilities for building more intelligent and capable language models.

Implications and Future Directions

The development of d1 represents a significant step forward in the evolution of LLMs. By extending the benefits of reinforcement learning to diffusion models, researchers are unlocking new potential for these architectures. This could lead to:

Improved Reasoning Capabilities: Diffusion models can now be trained to solve complex problems and perform logical reasoning tasks more effectively.
Enhanced Efficiency: The non-autoregressive nature of diffusion models offers the potential for faster inference speeds.
Greater Flexibility: Diffusion models can leverage bidirectional context, potentially leading to more nuanced and contextually aware text generation.

The open-source release of d1 will undoubtedly spur further research and innovation in this area. As researchers continue to explore the intersection of diffusion models and reinforcement learning, we can expect to see even more impressive advancements in the capabilities of these powerful language models.

References:

(Original research paper and code repository for d1 will be linked here upon publication)
DeepSeek-R1: (Link to DeepSeek-R1 information)
Kimi K1.5: (Link to Kimi K1.5 information)
LLaDA: (Link to LLaDA information)
Mercury: (Link to Mercury information)

>>> Read more <<<