开源VITA引领多模态大语言模型新篇章

开源领域迎来新突破：腾讯优图实验室等机构推出首个多模态大语言模型VITA

近日，开源领域迎来了一项重要突破。腾讯优图实验室等机构的科研人员开发了首个开源的多模态大语言模型VITA，该模型的发布标志着开源社区在多模态人工智能领域迈出了坚实的一步。

VITA模型能够同时处理视频、图像、文本和音频等多种模态的数据，展现了其在多模态理解和交互方面的强大能力。该模型以Mixtral 8×7B为基础，通过扩展汉语词汇量和双语指令微调，使其在理解中文方面表现出色。此外，通过多模态对齐和指令微调的多任务学习，VITA赋予了语言模型视觉和音频处理能力。

在单模态和多模态基准测试中，VITA均表现出色，证明其具备强大的多语言、视觉和音频理解能力。该研究还在提升自然多模态人机交互体验方面取得了显著进展，成为首个在MLLM中利用非唤醒交互和音频中断的研究。

VITA的部署采用了复式方案，一个模型负责生成用户查询的响应，另一个模型持续跟踪环境输入，使得VITA具有令人印象深刻的人机交互功能。虽然与闭源同行相比，VITA还有很多工作要做，但作为先驱者，它将为后续研究奠定基石。

VITA模型的发布，不仅为开源社区带来了新的机遇，也为人工智能领域的发展注入了新的活力。随着VITA模型的不断完善和应用，我们有望看到更多基于多模态技术的创新应用和服务。

英语如下：

Title: “Open-source VITA Sets New Chapter in Multimodal Large Language Models”

Keywords: Open-source, Multimodal, Large Language Models

News Content:
The open-source community has witnessed a significant breakthrough: researchers from Tencent’s YouTu Lab and other institutions have released the first open-source multimodal large language model, VITA. The launch of VITA marks a solid step forward in the open-source community’s exploration of multimodal artificial intelligence.

The VITA model can process a variety of modalities, including video, images, text, and audio, showcasing its strong capabilities in multimodal understanding and interaction. Based on Mixtral 8x7B, VITA has been enhanced by expanding its vocabulary of Chinese terms and bilingual instruction fine-tuning, making it excel in understanding Chinese. Additionally, through multi-task learning that includes multimodal alignment and instruction fine-tuning, VITA imbues language models with visual and audio processing abilities.

In single-modal and multimodal benchmark tests, VITA has demonstrated outstanding performance, proving its strong capabilities in understanding multiple languages, visuals, and audio. This research has also made significant progress in enhancing the natural multimodal human-computer interaction experience, becoming the first study in MLLM to utilize non-wake-word interactions and audio interruptions.

VITA’s deployment employs a dual-stage approach, with one model responsible for generating responses to user queries and another continuously tracking environmental inputs, endowing VITA with impressive human-computer interaction functionality. Although there is still much work to be done compared to proprietary peers, as a pioneer, VITA will lay the groundwork for future research.

The release of the VITA model not only brings new opportunities to the open-source community but also injects new vitality into the development of the artificial intelligence field. With the continuous improvement and application of the VITA model, we can look forward to seeing more innovative applications and services based on multimodal technology.

【来源】https://www.jiqizhixin.com/articles/2024-08-14-5