Guangzhou, China – In a significant step towards bridging the gap between ancient wisdom and modern technology, South China University of Technology (SCUT) has launched Tonggu, a large language model (LLM) specifically designed for processing classical Chinese texts. This innovative AI tool promises to revolutionize the way scholars, students, and enthusiasts engage with China’s rich literary heritage.
The Tonggu model, developed by SCUT’s Deep Learning and Visual Computing Laboratory (SCUT-DLVCLab), is built upon the Baichuan 2-7B-Base architecture. It underwent extensive pre-training using a massive corpus of 2.41 billion characters of classical Chinese texts, followed by instruction fine-tuning with 4 million sets of classical Chinese dialogue data.
Our aim was to create an AI model that not only understands classical Chinese but also facilitates its accessibility and appreciation, explained a lead researcher from SCUT-DLVCLab. Tonggu leverages cutting-edge techniques to overcome the challenges inherent in processing ancient texts, such as ambiguous sentence structures and a lack of punctuation.
The model employs a redundancy-aware fine-tuning (RAT) technique to enhance performance in classical Chinese tasks. Furthermore, it incorporates a retrieval-augmented generation (CCU-RAG) mechanism to minimize hallucinations in knowledge-intensive tasks, ensuring the accuracy and reliability of the generated content.
Key functionalities of the Tonggu model include:
- Classical Chinese Punctuation: Tonggu can automatically add punctuation to classical texts, addressing the common issue of missing or inconsistent punctuation in ancient documents. This feature significantly improves readability and comprehension.
- Translation between Classical and Modern Chinese: The model supports bidirectional translation, converting complex classical Chinese into modern vernacular and vice versa. This enables users to easily access and interpret ancient texts while also facilitating the creation of new works in the classical style.
- Poetry Generation: Tonggu can generate poems that adhere to the rules and styles of classical Chinese poetry. Users can provide themes or keywords, and the model will create corresponding poetic compositions.
- Classical Text Appreciation: The model offers in-depth analyses of classic texts, interpreting their literary value, historical context, and cultural significance. This feature aids users in gaining a deeper understanding of the nuances and complexities of ancient literature.
- Classical Text Retrieval and Question Answering: By integrating retrieval-augmented technology, Tonggu can quickly search classical texts and provide accurate answers to user queries, streamlining the process of information retrieval and research.
The launch of Tonggu marks a significant advancement in the application of AI to the humanities. By providing a powerful and accessible tool for processing classical Chinese texts, SCUT is empowering a new generation of scholars and enthusiasts to explore and appreciate China’s rich literary heritage. The model is expected to be a valuable resource for researchers, students, and anyone interested in delving into the world of classical Chinese literature. As AI continues to evolve, tools like Tonggu will undoubtedly play an increasingly important role in preserving and promoting cultural heritage around the world.
Views: 1