Beijing, China – The Beijing Academy of Artificial Intelligence (BAAI), in collaboration with several universities, has announced the release of BGE-VL, a new multimodal vector model poised to revolutionize multimodal retrieval technology. This groundbreaking development, building upon the success of the BGE series, promises significant advancements in tasks such as image-text retrieval and combined image retrieval.

The key to BGE-VL’s performance lies in its training on MegaPairs, a massive synthetic dataset. This innovative approach offers two crucial advantages: exceptional scalability and superior data quality.

Scalability Through Synthetic Data: MegaPairs leverages multimodal representation models, large multimodal models, and large language models to efficiently mine multimodal triplet data from vast repositories of image and text. This automated process allows for the continuous generation of diverse and high-quality multimodal triplets at a fraction of the cost of traditional methods. The initial release comprises 26 million samples, providing substantial and valuable data support for training multimodal retrieval models.

Data Quality Outperforms Human Annotation: Remarkably, MegaPairs achieves superior training results with only 1/70th of the data volume compared to traditional human-annotated datasets. Using this synthetic data, BAAI trained the BGE-VL model, significantly boosting performance across several leading multimodal retrieval benchmarks.

The BGE-VL technical report has been published, and the associated data, models, and code resources will be progressively released to the community.

Addressing the Growing Demand for Multimodal Information Retrieval

In the era of large models, information retrieval needs to cater to increasingly diverse user demands. This includes not only multimodal query inputs but also the need for information spanning multiple modalities. Consider a scenario where a user photographs a car exterior and seeks specific information about that model. A multimodal retrieval system must comprehensively understand the user’s image and text instructions and retrieve the most relevant content from various information modalities.

Existing multimodal retrieval models, typically trained on single-form cross-modal paired data (e.g., image-text pairs), struggle to handle complex combined modal inputs. Instruction fine-tuning techniques have proven effective in enhancing multi-task capabilities in text retrieval and large language models. However, previous multimodal retrieval instruction datasets have largely relied on manual annotation, limiting the acquisition of large-scale, diverse data.

To overcome this limitation, the BGE team at BAAI innovatively proposed the MegaPairs data synthesis method. This approach leverages existing large models to automatically generate high-quality training data, paving the way for more robust and versatile multimodal retrieval systems.

Conclusion

The release of BGE-VL marks a significant step forward in multimodal information retrieval. By leveraging the power of synthetic data through the MegaPairs method, BAAI has created a model that surpasses the performance of traditional approaches while significantly reducing the reliance on costly and time-consuming manual annotation. This advancement promises to unlock new possibilities for accessing and understanding information in a world increasingly defined by diverse and interconnected data modalities. The open-source release of the model and associated resources will undoubtedly foster further innovation and collaboration within the research community, accelerating the development of even more powerful and versatile multimodal retrieval systems. Future research will likely focus on expanding the MegaPairs dataset, exploring new model architectures, and applying BGE-VL to a wider range of real-world applications.


>>> Read more <<<

Views: 0

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注