多模态模型评测框架lmms-eval发布！

##多模态模型评测框架LMMs-Eval发布！全面覆盖，低成本，零污染

随着大模型研究的深入，如何将其推广到更多的模态上已经成为了学术界和产业界的热点。最近发布的闭源大模型如 GPT-4o、Claude 3.5等都已经具备了超强的图像理解能力，LLaVA-NeXT、MiniCPM、InternVL 等开源领域模型也展现出了越来越接近闭源的性能。在这个“亩产八万斤”，“10 天一个 SoTA”的时代，简单易用、标准透明、可复现的多模态评估框架变得越来越重要，而这并非易事。

为了解决以上问题，来自南洋理工大学 LMMs-Lab 的研究人员联合开源了 LMMs-Eval，这是一个专为多模态大型模型设计的评估框架，为多模态模型（LMMs）的评测提供了一站式、高效的解决方案。

LMMs-Eval 框架自 2024 年 3 月发布以来，已经收到了来自开源社区、公司和高校等多方的协作贡献。现已在 Github 上获得 1.1K Stars，超过 30+ contributors，总计包含 80 多个数据集和 10 多个模型，并且还在持续增加中。

**标准化测评框架**

为了提供一个标准化的测评平台，LMMs-Eval 包含了以下特性：

* **统一接口:** LMMs-Eval 在文本测评框架 lm-evaluation-harness 的基础上进行了改进和扩展，通过定义模型、数据集和评估指标的统一接口，方便了使用者自行添加新的多模态模型和数据集。
* **一键式启动:** LMMs-Eval 在 HuggingFace 上托管了 80 多个（且数量不断增加）数据集，这些数据集精心从原始来源转换而来，包括所有变体、版本和分割。用户无需进行任何准备，只需一条命令，多个数据集和模型将被自动下载并测试，等待几分钟时间即可获得结果。
* **透明可复现:** LMMs-Eval 内置了统一的 logging 工具，模型回答的每一题以及正确与否都会被记录下来，保证了可复现性和透明性。同时也方便比较不同模型的优势与缺陷。

LMMs-Eval 的愿景是未来的多模态模型不再需要自行编写数据处理、推理以及提交代码。在当今多模态测试集高度集中的环境下，这种做法既不现实，测得的分数也难以与其他模型直接对比。通过接入 LMMs-Eval，模型训练者可以将更多精力集中在模型本身的改进和优化上，而不是在评测和对齐结果上耗费时间。

**评测的“不可能三角”**

LMMs-Eval 的最终目标是找到一种 1. 覆盖广 2. 成本低 3. 零数据泄露的方法来评估LMMs。然而，即使有了 LMMs-Eval，作者团队发现想同时做到这三点困难重重，甚至是不可能的。

为了解决这一问题，LMMs-Eval 提出了 LMMs-Eval-Lite 和 LiveBench 两种方案：

* **LMMs-Eval-Lite: 广覆盖轻量级评估**
为了同时兼顾评测的多样性和评测的成本，LMMs-Eval 推出了 LMMs-Eval-Lite。它可以帮助用户快速评估模型在多个数据集上的性能，同时保持较低的成本。
* **LiveBench: 低成本和零数据泄露**
LiveBench 则专注于提供一种低成本、零数据泄露的评估方法。它利用在线评估平台，可以有效地避免数据泄露问题，同时保持较低的评估成本。

LMMs-Eval 的发布为多模态模型的评估提供了一个标准化、高效的解决方案，将极大地促进多模态大模型的研究和发展。相信随着 LMMs-Eval 的不断完善，它将成为多模态模型评估领域的重要工具，推动多模态人工智能技术的发展。

英语如下：

##LMMs-Eval: A Comprehensive, Low-Cost, and Zero-Leakage Evaluation Framework for Multimodal Models

**Keywords:** Multimodal, Evaluation, Framework

**News Content:**

As research on large language models deepens, extending them to more modalities has become a hot topic in both academia and industry. Recently released closed-source large models like GPT-4o and Claude 3.5 have demonstrated impressive image understanding capabilities, while open-source models likeLLaVA-NeXT, MiniCPM, and InternVL are showing increasingly close performance to their closed-source counterparts. In this era of rapid progress and constant advancements, a simple, standardized, transparent, and reproducible multimodal evaluation framework isbecoming increasingly important, but it is not an easy task.

To address this challenge, researchers from the LMMs-Lab at Nanyang Technological University have jointly open-sourced LMMs-Eval, an evaluation framework specifically designed forlarge multimodal models. It provides a one-stop, efficient solution for evaluating multimodal models (LMMs).

Since its release in March 2024, the LMMs-Eval framework has received collaborative contributions from various parties, including the open-source community, companies, and universities. It has garneredover 1.1K stars on Github, with more than 30 contributors, and currently includes over 80 datasets and 10 models, with continuous additions.

**Standardized Evaluation Framework**

To provide a standardized evaluation platform, LMMs-Eval incorporates the following features:

* **Unified Interface:** LMMs-Eval builds upon the text evaluation framework lm-evaluation-harness, introducing improvements and extensions. It defines a unified interface for models, datasets, and evaluation metrics, enabling users to easily add new multimodal models and datasets.
* **One-Click Launch:** LMMs-Eval hostsover 80 (and growing) datasets on HuggingFace, carefully converted from their original sources, including all variants, versions, and splits. Users need no preparation, simply running a single command will automatically download and test multiple datasets and models, providing results within minutes.
* **Transparency and Reproducibility:** LMMs-Eval includes a unified logging tool, recording each question answered by the model, along with its correctness, ensuring reproducibility and transparency. This also facilitates comparisons between different models, highlighting their strengths and weaknesses.

LMMs-Eval envisions a future where multimodal models no longer require manual data processing, inference,and code submission. In the current environment of highly concentrated multimodal test sets, this approach is both impractical and makes it difficult to directly compare scores with other models. By integrating with LMMs-Eval, model trainers can focus more on improving and optimizing the models themselves, rather than spending time on evaluation and aligning results.

**The “Impossible Triangle” of Evaluation**

The ultimate goal of LMMs-Eval is to find a method for evaluating LMMs that is 1. comprehensive, 2. low-cost, and 3. zero-data leakage. However, even with LMMs-Eval, theauthor team found that achieving all three simultaneously is extremely difficult, if not impossible.

To address this challenge, LMMs-Eval proposes two solutions: LMMs-Eval-Lite and LiveBench:

* **LMMs-Eval-Lite: Lightweight Evaluation with Wide Coverage**
To balance thediversity and cost of evaluation, LMMs-Eval introduces LMMs-Eval-Lite. It helps users quickly assess model performance across multiple datasets while maintaining low costs.
* **LiveBench: Low-Cost and Zero-Data Leakage**
LiveBench focuses on providing a low-cost, zero-data leakage evaluation method. It leverages online evaluation platforms, effectively preventing data leakage while maintaining low evaluation costs.

The release of LMMs-Eval provides a standardized and efficient solution for evaluating multimodal models, significantly promoting research and development in this area. As LMMs-Eval continues to evolve, it is poisedto become an essential tool in the field of multimodal model evaluation, driving the advancement of multimodal artificial intelligence technology.

【来源】https://www.jiqizhixin.com/articles/2024-08-21-3