上海人工智能实验室(上海AI实验室)近日发布了一款名为“万卷CC”(WanJuan-CC)的高质量大模型预训练语料,这是“大模型语料数据联盟”今年首发的开源语料。该语料库覆盖了过去十年互联网上的公开内容,包含1千亿字符(100B token),相当于约400GB的高质量英文数据。万卷CC的发布旨在为学界和业界提供大规模、高质量的数据支撑,以构建更智能可靠的人工智能大模型。上海AI实验室表示,这一资源将极大地推动人工智能技术的发展和应用。

Title: Shanghai AI Lab Releases Open Source WanJuan-CC Corpus
Keywords: Shanghai AI Lab, WanJuan-CC, Open Source Corpus

News content:
The Shanghai AI Lab recently announced the release of a high-quality dataset for large-scale model pre-training called WanJuan-CC (WanJuan-CC). This open-source dataset is the first release from the “Large Model Corpus Data Alliance” this year. It covers publicly available Internet content from the past decade, including 100 billion tokens (100B tokens), equivalent to approximately 400GB of high-quality English data. The release of WanJuan-CC is aimed at providing academia and industry with a large-scale, high-quality dataset to support the development of smarter and more reliable AI models. The Shanghai AI Lab states that this resource will significantly drive the advancement and application of AI technology.

【来源】https://mp.weixin.qq.com/s/Pt02LXlh2Uu_hgM0ZL5GGg

Views: 1

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注