In the large-scale model race, the crisis of data shortage is becoming more and more severe.

A recent New York Times survey revealed that tech companies, including OpenAI, Google, and Meta, have taken shortcuts to obtain large amounts of high-quality training data, ignoring platform policies, and frantically testing on the verge of breaking the law.

Among them, OpenAI has collected more than 1 million hours of YouTube video text through Whisper, a voice transcription tool, as GPT-4 training data.

The cover of the New York Times report

AI companies are frantically obtaining all kinds of data from the Internet to train large AI models, but is this legal and in line with platform policies?

A battle for rights over data resources has been waged between creators, content platforms, and AI companies.

AI "fertilizer" shortage,

OpenAI疯狂转录YouTube视频

According to the New York Times, OpenAI has been collecting data, cleaning it, and feeding it into a vast pool of text for years to train large language models.

This data includes computer code from Github, chess databases, high school exam questions and assignments from Quizlet, and more.

By the end of 2021, OpenAI had exhausted all reliable English text resources on the internet and urgently needed more data to train its next-generation model, GPT-4.

To that end, OpenAI has negotiated several options internally: transcribe podcasts, audiobooks and YouTube videos, create data from scratch with AI systems, and acquire startups that have already collected large amounts of digital data.

OpenAI's research team later created a speech recognition tool called Whisper for transcribing YouTube videos and podcasts, generating new conversational text to further increase the AI's intelligence.

Whisper博客：https://openai.com/research/whisper

OpenAI employees knew that doing so would be a legal gray area and could violate YouTube's rules, three people familiar with the matter said. YouTube, which is owned by Google, prohibits the use of its videos in "standalone" applications and also prohibits access to its videos through "any automated means, such as bots or crawlers."

But the OpenAI team decided it was fair use to train AI with video, and ended up transcribing more than 1 million hours of YouTube videos.

OpenAI President Greg Brockman, who led the team that developed GPT-4, was personally involved in collecting the YouTube videos and then feeding them into GPT-4, according to people familiar with the matter.

In addition to OpenAI, tech companies such as Meta and Google have taken similar measures.

According to a recording of Meta's internal meeting early last year, Meta's vice president of generative AI, Ahmad Al-Dahle, said that the team has used almost every English-language book, paper, poem, and news article available on the internet to develop the model, and that it will not be able to match ChatGPT unless Meta gets more data.

In March and April 2023, the Meta team considered acquiring publisher Simon & Schuster to license its feature-length works, as well as discussing how to collect copyrighted data from the internet without permission, even if it would lead to litigation.

They mentioned that it would take too long to negotiate licensing with publishers, artists, musicians and the news industry.

Meta has said it has taken billions of publicly shared images and videos from Instagram and Facebook to train its models.

Google has also transcribed YouTube videos to train its own AI models and expanded its terms of service last year, according to people familiar with the matter.

The previous privacy policy stated that Google could only use publicly available information to "help train Google's language models and build features such as Google Translate," but the changed terms expand the scope of AI technology so that Google can use the data to "train AI models and build products and features such as Google Translate, Bard, and Cloud AI."

Changes to Google's Privacy Policy

According to Google's internal sources, one of the goals of this change is to allow Google to refine its AI product with other online data such as publicly available Google Docs and restaurant reviews on Google Maps.

Creators are suing AI for infringement

Developing bigger and stronger AI means requiring a seemingly endless amount of data resources. From news reports and published works to online messages, blog posts, photos and videos on social platforms, and more, all kinds of data on the Internet are becoming an important cornerstone of the development of the AI industry.

And for creators, AI companies use their work to train models, which has copyright infringement and ethical issues.

The New York Times sued OpenAI and Microsoft late last year for using copyrighted news articles to train AI chatbots without permission. OpenAI and Microsoft responded that it was "fair use," or that it was protected by copyright law.

Last year's Hollywood strike also involved a dispute over AI-related rights. Filmmaker and actress Justine Bateman, an AI consultant for the Screen Actors Guild of America (SAG-AFTRA), believes that AI models taking content, including her work and films, without permission or payment, "is "the biggest theft in the United States."

Recently, more than 200 artists, including well-known singers Billie Eilish and Nicki Minaj, signed an open letter asking tech companies to commit not to develop AI tools that destroy or replace human creativity, and that "we must prevent AI from being predatorily used to steal the voices and likenesses of professional creators, violate the rights of creators, and disrupt the music ecosystem."

In the face of the protests of creators, content platforms have also shown their attitude.

In a recent interview with Bloomberg, YouTube CEO Neal Mohan stressed that downloading YouTube videos and then using them to train AI models such as Sora is a clear violation of YouTube's current relevant terms.

He admitted that Google "used some of the content on YouTube" when training the Gemini model, but had been authorized by the creator before using it, which was allowed by the agreement between YouTube and the creator.

Google spokesman Matt Bryant responded to the changes to the privacy policy, saying that Google did not use information from Google Docs or related apps to train AI without the user's "explicit permission," referring to a voluntary program that allows users to test experimental features.

Is it feasible for AI to synthesize data?

Looking back at the progress of large AI models, before 2020, most AI models used much smaller training data than they do now.

Changes in the amount of AI large model training data, source: The New York Times

It wasn't until Johns Hopkins theoretical physicist Jared Kaplan published a seminal paper on AI that found that the more data needed to train a large language model, the better it performed.

此后，“规模即一切（Scale Is All You Need）”很快成为AI研究的口号。

Address: https://arxiv.org/pdf/2001.08361.pdf

OpenAI launched GPT-3 in November 2020, and it was the model with the largest amount of training data at the time — about 300 billion tokens. Google's AI lab, DeepMind, went one step further and tested 400 AI models in 2022, one of which was trained on 1.4 trillion tokens.

However, this record did not last long. Last year, researchers in China released an AI model, Skywork, with 3.2 trillion tokens of Chinese and English text training data. Google's PaLM 2 is more than 3.6 trillion tokens.

According to Epoch, a research firm, AI companies can use data faster than they can generate, and high-quality data on the internet could be exhausted as early as 2026.

How to solve the "data shortage" and a series of industrial problems has become the focus of AI development.

Faced with a data shortage crisis, tech companies are developing "synthetic data," which is the use of AI-generated text, images, and code that allows AI to learn from the content it generates.

OpenAI spokesperson Lindsay Held told The Verge that each of OpenAI's models has unique datasets, that they have a wide range of data sources, including partners with open and non-public data, and are considering generating their own synthetic data.

Sam Altman has said that in the future all data will become synthetic. Now that AI models can produce human-like text, additional data can also be created to develop better AI, which will reduce the team's reliance on copyright data.

Many industry insiders speculate that Sora may have generated a large amount of synthetic data as a training set by using the data-driven Unreal Engine 5.

But building an AI system that can train itself is easier said than done. AI models that learn from their own output can get stuck in an endless loop, reinforcing their own quirks, mistakes, and limitations.

"The data these AI systems need is like finding a way out of the jungle," said Jeff Clune, a former OpenAI researcher, "and if they are only trained on synthetic data, they are likely to get lost in the jungle." ”

To combat this, OpenAI and other companies are looking at how two different AI models can work together to generate more useful and reliable synthetic data. One AI generates the data, and the other evaluates the information to separate the good data from the bad. However, the effectiveness of this approach has not been confirmed.

In addition, at present, overseas companies such as Scale AI and Gretel.ai have begun to provide synthetic data services to the outside world.

Domestically, Xue Lan, Dean of Schwarzman College at Tsinghua University and Dean of the Institute of International Governance of Artificial Intelligence, said in a recent public speech that China's data volume is very large, but it is not really industrialized, and there are relatively few relatively standardized data service providers, because big data services are not profitable, public data companies are not willing to clean, and customized services generally charge relatively high. Therefore, how to build a datamart is also a problem that needs to be solved.

i"肥料"不足,Openai被曝疯狂转录 YouTube视频

AI "fertilizer" shortage,

OpenAI疯狂转录YouTube视频

Creators are suing AI for infringement

Is it feasible for AI to synthesize data?

Read on