上海人工智能实验室近日推出了一项重大创新举措,开源发布了名为“万卷CC”(WanJuan-CC)的高质量语料库。这个新一代的大模型预训练语料涵盖了过去十年间互联网上的公开内容,数据量惊人,包含1千亿字符,约400GB的英文数据,显示出其在规模和质量上的双重优势。作为“大模型语料数据联盟”今年的首发开源项目,WanJuan-CC旨在为学术界和工业界提供强大的数据支持,推动更智能、更可靠的AI大模型的研发。
这一开源行动对于人工智能领域的进步具有里程碑式的意义。研究人员和开发者现在可以免费获取这些经过精心筛选的语料,用于训练和优化他们的AI模型,从而提高模型的准确性和性能。上海AI实验室的这一举措不仅体现了其在人工智能领域的领导地位,也彰显了其推动科技共享、促进全球AI发展的决心。
“万卷CC”语料库的发布,有望打破数据获取的壁垒,激发更多的创新实践,加速AI技术在各个领域的应用。随着越来越多的开发者和研究团队利用这些数据,我们期待看到更多先进的人工智能解决方案涌现,以推动科技进步,服务社会。
英语如下:
**News Title:** “Shanghai AI Lab Releases Massive Open-Source Corpus ‘WanJuan CC’: 100 Billion Characters Shaping the Future of Intelligence”
**Keywords:** Shanghai AI Lab, WanJuan CC, Open-source Corpus
**News Content:** The Shanghai Artificial Intelligence Laboratory has recently taken a major step forward in innovation by open-sourcing its high-quality corpus named “WanJuan CC” (WanJuan-CC). This next-generation pre-training corpus for large language models encompasses public content from the internet over the past decade, boasting an enormous 100 billion characters, approximately 400 GB of English data, demonstrating its dual advantages in scale and quality. As the inaugural open-source project of the “Large Model Corpus Data Alliance” this year, WanJuan-CC aims to provide strong data support to both academia and industry, fostering the development of more intelligent and reliable AI models.
This open-source initiative marks a milestone in the advancement of artificial intelligence. Researchers and developers can now access these meticulously curated corpus data for free, enabling them to train and optimize their AI models, thereby enhancing accuracy and performance. This move by the Shanghai AI Lab not only underscores its leadership position in the AI field but also demonstrates its commitment to fostering technology sharing and promoting global AI progress.
The release of the “WanJuan CC” corpus is anticipated to break down barriers to data access, fueling more innovative practices and accelerating the application of AI technologies across various sectors. With an increasing number of developers and research teams leveraging these data resources, we look forward to witnessing the emergence of more advanced AI solutions that drive科技进步 and serve society.
【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg
Views: 6
