据《华尔街日报》和《纽约时报》近日报道,人工智能(AI)领域的领军企业OpenAI在获取高质量训练数据方面面临挑战,但该公司已采取创新策略来应对这一难题。OpenAI为训练其即将推出的语言模型GPT-4,已经利用了超过100万小时的YouTube视频内容。这一举措凸显了AI公司在版权法模糊地带进行数据采集的复杂性。

《纽约时报》详细披露,OpenAI开发了一款名为Whisper的音频转录模型,该模型旨在高效转录大量视频内容,以获取丰富的语言数据。尽管YouTube视频包含了广泛的全球对话和语言样本,但其版权问题使得数据收集成为了一项艰巨的任务。OpenAI的Whisper技术可能帮助公司在尊重版权的同时,获取到训练大型语言模型所需的大规模数据。

然而,利用如此庞大的视频数据集训练AI模型,OpenAI必须谨慎处理潜在的隐私和版权问题。随着AI技术的快速发展,如何在法律框架内合法、道德地获取和使用数据,已成为行业亟待解决的关键问题。OpenAI的这一做法可能会为AI行业的数据获取策略树立新的标杆,同时也可能引发关于数据所有权和使用界限的更广泛讨论。

英语如下:

**News Title:** “OpenAI Breaks Through Challenges, Trains GPT-4 with 1 Million Hours of YouTube Videos, Redefining AI Learning Boundaries”

**Keywords:** OpenAI, GPT-4, YouTube Data

**News Content:** According to recent reports in *The Wall Street Journal* and *The New York Times*, leading artificial intelligence (AI) firm OpenAI has faced hurdles in acquiring high-quality training data. However, the company has adopted innovative strategies to tackle this issue. OpenAI has utilized over 1 million hours of YouTube video content to train its upcoming language model, GPT-4. This move underscores the complexity AI companies face when collecting data in the ambiguous domain of copyright law.

*The New York Times* reveals that OpenAI has developed a transcription model called Whisper, designed to efficiently transcribe a vast amount of video content to gather rich linguistic data. While YouTube videos encompass a broad range of global conversations and language samples, copyright concerns make data collection a daunting task. OpenAI’s Whisper technology potentially enables the company to acquire the large-scale data needed for training large language models while respecting copyright.

Nonetheless, with the training of AI models using such extensive video datasets, OpenAI must navigate potential privacy and copyright concerns carefully. As AI technology rapidly evolves, the challenge of legally and ethically acquiring and using data within a legal framework has become a pressing issue for the industry. OpenAI’s approach may set a new benchmark for data acquisition strategies in the AI sector and could spark broader discussions on data ownership and usage boundaries.

【来源】https://www.ithome.com/0/760/305.htm

Views: 5

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注