Alibaba International Unveils Ovis2 A New Multimodal AI Powerhouse

Hangzhou, China – Alibaba International has launched Ovis2, a new series of multimodal large language models (MLLMs) designed to bridge the gap between visual and textual understanding. This release marks a significant step forward in the development of AI capable of processing and reasoning across diverse data formats, from text and images to videos.

Ovis2 builds upon the architecture of its predecessor, Ovis, with a focus on structured embedding alignment to effectively reconcile the differences between visual and textual modalities. This allows the model to seamlessly integrate information from various sources and generate coherent, contextually relevant outputs.

One of the key advancements in Ovis2 is its enhanced reasoning capabilities. Through instruction fine-tuning and preference learning, the model exhibits a significant improvement in Chain-of-Thought (CoT) reasoning. This allows Ovis2 to tackle complex logical and mathematical problems by breaking them down into smaller, more manageable steps, providing a clear and traceable path to the solution.

The ability to reason step-by-step is crucial for complex tasks, explains a source familiar with the project. Ovis2’s enhanced CoT reasoning allows it to not only provide answers but also explain the reasoning behind them, making it a more transparent and reliable AI system.

Beyond text and images, Ovis2 introduces video and multi-image processing capabilities. This allows the model to understand and interpret complex visual information across multiple frames, opening up possibilities for applications in areas such as video analysis, surveillance, and autonomous driving. The model is capable of selecting keyframes from videos and processing multiple images simultaneously to understand the context and relationships between them.

Furthermore, Ovis2 boasts improved multilingual support and Optical Character Recognition (OCR) capabilities. This enables the model to process text in various languages and extract structured data from complex visual elements such as tables and charts. This is particularly valuable for applications requiring multilingual document processing and data extraction from visual sources.

The Ovis2 series encompasses six different model sizes, ranging from 1 billion to 34 billion parameters. This range allows developers to choose the model that best suits their specific needs and computational resources. According to Alibaba International, all models in the Ovis2 series have demonstrated exceptional performance on the OpenCompass multimodal benchmark, excelling in areas such as mathematical reasoning and video understanding.

The open-source release of Ovis2 is expected to accelerate research and development in the field of multimodal large language models. By providing researchers and developers with access to a powerful and versatile MLLM, Alibaba International aims to foster innovation and collaboration in the AI community.

We believe that open-source models are essential for driving progress in AI, says a representative from Alibaba International. By sharing Ovis2 with the community, we hope to encourage further research and development in this exciting field.

Ovis2’s capabilities have significant implications for a wide range of applications, including:

Education: Providing personalized learning experiences and automated tutoring.
Healthcare: Assisting in medical diagnosis and treatment planning.
Finance: Automating financial analysis and risk assessment.
Retail: Enhancing customer service and personalizing product recommendations.

The launch of Ovis2 underscores Alibaba International’s commitment to pushing the boundaries of AI and developing innovative solutions that address real-world challenges. As MLLMs continue to evolve, models like Ovis2 will play an increasingly important role in shaping the future of AI.

References: