ByteDance Unveils Dolphin Open-Source Large Model for Document Parsing AI

Beijing, China – ByteDance, the parent company of TikTok, has released Dolphin, a new open-source large language model (LLM) designed for efficient and accurate document parsing. This move marks a significant contribution to the open-source AI community, providing developers with a powerful tool for extracting information from a wide range of document types.

Dolphin distinguishes itself through its lightweight architecture and two-stage parsing approach. Unlike larger, more resource-intensive models, Dolphin boasts a parameter size of just 322 million, making it suitable for deployment in resource-constrained environments.

The model’s two-stage process involves:

Layout Analysis: This stage utilizes a Swin Transformer to encode the input document image and identify various elements within the document, such as titles, figures, tables, and footnotes. The model then arranges these elements into a sequence that reflects the natural reading order.
Content Extraction: Leveraging the identified elements as anchors, Dolphin then parses the content within each element in parallel. This process results in a structured output, available in JSON or Markdown format, which facilitates further processing and presentation of the extracted information.

According to ByteDance, Dolphin excels in a variety of document parsing tasks, outperforming models like GPT-4.1 and Mistral-OCR in certain benchmarks. Its capabilities include:

Comprehensive Layout Analysis: Accurately identifies and sequences document elements.
Structured Content Extraction: Converts documents into structured JSON or Markdown formats.
Precise Text Parsing: Accurately extracts text content in multiple languages, including Chinese and English.
Formula Recognition: Supports the identification of complex mathematical formulas, outputting them in LaTeX format.
Table Parsing: Extracts data from complex tables and generates HTML-formatted tables.
Versatile Input and Output: Supports various document image formats and outputs data in JSON, Markdown, and HTML.

The release of Dolphin’s code and pre-trained models provides developers with a valuable resource for building applications that require automated document processing. This could include applications in areas such as:

Academic Research: Automating the extraction of data from research papers.
Business Intelligence: Processing and analyzing business reports and contracts.
Technical Documentation: Extracting information from technical manuals and specifications.

By open-sourcing Dolphin, ByteDance is contributing to the advancement of AI-powered document understanding and fostering innovation within the developer community. This initiative is expected to drive further research and development in the field of document parsing, leading to more efficient and accessible solutions for information extraction.

References: