Introduction

In the rapidly evolving field of artificial intelligence, the combination of Long Chain of Thought Supervised Fine-Tuning (Long-CoT SFT) and Reinforcement Learning (RL) has been heralded as a golden duo for improving the performance of language models. The approach, which first trains models to emulate human-like reasoning patterns and then optimizes outputs using reward mechanisms, has consistently delivered remarkable results in pure language tasks. However, new research from Huawei and The Hong Kong University of Science and Technology (HKUST) has revealed a surprising and counterintuitive phenomenon: when applied to multimodal visual language models (VLMs), this powerful duo not only fails to deliver the expected synergy but can even hinder each other’s performance.

This unexpected outcome raises important questions about the underlying mechanics of reasoning in multimodal models and the differences between text-only and multimodal evaluations. In this article, we delve into the details of this groundbreaking research, explore the reasons behind this synergy dilemma, and discuss the implications for the future of AI model development.

The Golden Duo: Long-CoT SFT and RL in Language Models

What is Long-CoT SFT?

Before diving into the research findings, let’s first understand the two key techniques involved.

Long Chain of Thought Supervised Fine-Tuning (Long-CoT SFT) is a training approach where models are fine-tuned on tasks that require complex reasoning. This method focuses on teaching the model to follow extended and coherent reasoning paths, akin to how humans think through problems. The idea is that by training models on these long reasoning chains, they will be better equipped to solve intricate tasks that demand multi-step logical deductions.

What is Reinforcement Learning (RL)?

Reinforcement Learning (RL), on the other hand, optimizes the model’s outputs through a system of rewards and penalties. The model learns to generate better responses by receiving feedback based on how close or far it is from the desired outcome. In the context of language models, RL is often used to fine-tune the model’s responses to be more accurate, contextually appropriate, or aligned with human preferences.

When combined, Long-CoT SFT and RL work in tandem to enhance model performance. The former teaches the model to reason better, while the latter refines the outputs based on feedback. This complementary relationship has led to substantial performance improvements in language models, particularly in tasks requiring logical reasoning and problem-solving.

However, as the Huawei and HKUST research team discovered, things don’t always go as planned when these techniques are applied to multimodal models.

The Unexpected Findings: A Synergy Dilemma in Multimodal VLMs

What Are Multimodal Visual Language Models (VLMs)?

Multimodal Visual Language Models (VLMs) are AI models designed to process and understand multiple types of data simultaneously, such as text, images, and sometimes even audio. These models aim to bridge the gap between different forms of information, allowing for more holistic and human-like understanding. For example, a VLM could be tasked with answering questions about an image or generating descriptions based on a combination of visual and textual inputs.

The Research: A Surprising Discovery

The research paper titled The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs, available on arXiv, presents an unexpected outcome when applying the Long-CoT SFT and RL combination to multimodal VLMs.

Traditionally, in language-only models, combining these techniques leads to significant performance improvements. However, in the case of multimodal models, the researchers found that Long-CoT SFT and RL not only failed to produce the expected synergistic effect but, in some cases, actually degraded performance. This phenomenon was especially pronounced when the models were evaluated on tasks involving both simple perceptual questions and complex cognitive reasoning challenges.

Why Does This Happen?

The researchers hypothesized that the root cause of this synergy dilemma lies in the heterogeneity of multimodal reasoning tasks. Unlike purely text-based tasks, which tend to focus on high-level logical reasoning, multimodal tasks often require a combination of perceptual processing


>>> Read more <<<

Views: 1

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注