Tsinghua & Shanghai AI Lab’s Data Empowers Small Model to Rival GPT-4o

Beijing, China – As the AI community buzzes with excitement over OpenAI’s o1 and DeepSeek R1, the focus on enhancing the reasoning capabilities and test-time expansion (TTS) of Large Language Models (LLMs) has intensified. However, a critical challenge remains: how to accurately evaluate the quality of each step in a model’s response during complex reasoning tasks.

Traditional Process Reward Models (PRMs), while capable of verifying reasoning steps, are limited by their scalar scoring mechanisms. This makes it difficult to capture deep logical errors. Furthermore, their discriminative modeling approach restricts their ability to expand during testing.

Now, a collaborative effort between Tsinghua University and Shanghai AI Lab has yielded a promising solution: the Generative Process Reward Model, or GenPRM. This innovative approach combines generative chain-of-thought (CoT) reasoning with code verification and introduces a test-time expansion mechanism, offering a fresh perspective on process-supervised reasoning.

The research team, led by Liu Runze, a second-year Master’s student at Tsinghua University under Professor Li Xiu, and Zhao Jian, a third-year undergraduate student at Beijing University of Posts and Telecommunications, focused on enhancing the reasoning capabilities and test-time expansion of LLMs. Their work addresses a significant gap in the field.

The GenPRM Advantage

Similar to DeepSeek’s recently released Gradient Reward Model (GRM), GenPRM leverages generative modeling and test-time expansion to enhance the process reward model. This allows for a more nuanced and comprehensive evaluation of each step in the reasoning process.

The Significance

The implications of this research are significant. By enabling test-time expansion in process reward models, GenPRM paves the way for more robust and reliable AI systems capable of tackling complex reasoning tasks. This advancement could have a profound impact on various applications, including:

Problem Solving: Enhancing the ability of AI models to solve complex problems by accurately evaluating each step of the solution process.
Code Generation: Improving the quality and reliability of AI-generated code through rigorous process supervision.
Decision Making: Supporting better decision-making in critical domains by ensuring the accuracy and validity of the reasoning behind AI-driven recommendations.

Looking Ahead

The development of GenPRM represents a significant step forward in the pursuit of more intelligent and reliable AI systems. As research in this area continues, we can expect to see even more innovative approaches emerge, further pushing the boundaries of what is possible with LLMs. The team’s work, supported by a 23K dataset, has demonstrated that even smaller models (1.5B parameters) can achieve impressive results, rivaling the performance of much larger models like GPT-4o. This is a testament to the power of innovative algorithms and targeted training data.

References:

(To be added upon publication of the research paper, following APA style)

About the Researchers:

Liu Runze: Master’s student at Tsinghua University, specializing in Large Language Models and Reinforcement Learning. (ryanliu112.github.io)
Zhao Jian: Undergraduate student at Beijing University of Posts and Telecommunications, focusing on Large Language Models.

>>> Read more <<<