AI companies squeeze their heads to grab data OpenAI "picks" video content, and Google "covets" office data

In an April 4 interview with Bloomberg Originals, YouTube CEO Neal Mohan warned OpenAI that if Sora leveraged YouTube's videos for training, it would be a violation of the platform's policies and regulations because creators don't want their content to be exploited.

But interestingly, when host Emily Cheung asked Google if it had also trained its Gemini AI with YouTube data and whether it paid for creators, the CEO's statement became a bit "ambiguous". He admitted that Google did use YouTube's data to train Gemini, but at the same time claimed that they did it "in accordance with the terms and regulations" and did not say whether they paid the creators in the relevant fees.

This response obviously couldn't convince netizens, so they began to complain "fancy":

"Creators, see? YouTube now says it owns the content that you produce. ”

"Don't say things you shouldn't say!"

"Google doesn't pay for data for creators either, right? Well, yes, the terms of service say you don't have to pay. ”

While there is no evidence yet that Sora actually used YouTube videos for training, the CEO's warning was likely influenced by a recent Wall Street Journal report. OpenAI has developed Whisper, a speech recognition tool that can transcribe YouTube videos into text, providing new training data for its large language models, the report said.

On the surface, YouTube seems to be on the side of creators, but in reality, whether it is Google or OpenAI, they are trying their best to find various compliant or gray area means to obtain large amounts of training data to ensure that they stay ahead of the curve in the field of artificial intelligence, and issues such as creator interests, are obviously not their primary considerations.

The available data on the internet will be quickly exhausted

In January 2020, Johns Hopkins University theoretical physicist Jared Kaplan published a groundbreaking paper on artificial intelligence with nine OpenAI researchers that came to a clear conclusion — the more data a large language model can be trained, the better it will perform.

Since then, "scale is everything" has become a tenet in the field of artificial intelligence. The amazing performance of OpenAI's ChatGPT-3.5 has ignited the carnival of the entire generative AI track and detonated the demand for data.

Nick Grudin, Meta's vice president of global partnerships and content, once said in a conference, "The only thing that is holding us back from getting to the level of ChatGPT is the amount of data." ”

AI giants have then started a fierce race for data resources: GPT-3 launched in March 2020 with 300 billion tokens, GPT-4 launched last year with 12 trillion tokens, and GPT-5 could need between 60 trillion and 100 trillion tokens if the current growth trajectory continues. Google's PaLM 2, launched last year, uses 3.6 trillion tokens, while PaLm, which went live in 2022, uses only 78 million tokens.

(The amount of training data required by different large language models.) Credit: The New York Times)

Because these large language models can use data faster than they can produce it, data resources, especially high-quality ones, have been "mined" and used in large quantities.

According to the prediction of artificial intelligence research institute Epoch, all high-quality available data could be exhausted by 2026, and in May last year, OpenAI CEO Sam Altman also publicly admitted at a technical conference that AI companies will run out of all available data on the Internet in the near future.

(Low-quality linguistic data is expected to be used up by 2050, high-quality linguistic data is expected to be used up by 2026, and visual data is expected to be used up by 2060.) Credit：Epoch）

Without new data sources or the inability to make data use more efficient, the development of machine learning models that rely on huge data sets will slow down. This means that in order to maintain their technological leadership, AI companies have to start a fierce battle for data and constantly look for new data sources.

A New Arms Race in AI: Getting More "Data"

OpenAI was already feeling the pressure of "data hunger" at the end of 2021, and they began to search for data in order to train larger models. Led by OpenAI President Greg Brockman, Project Whisper was born to inject new blood into the GPT-4 model by transcribing more than 1 million hours of YouTube videos. While there are legal risks associated with this approach, OpenAI's team still believes it is worth it.

Google, on the other hand, doesn't go far further, it also transcribes YouTube videos to fetch text for its large language models, and even targets user-generated content on services like Google Docs, Google Sheets, Google Slides, and Google Maps.

It is estimated that billions of tokens are embedded in these applications. In order to have the opportunity to use this data in the future, in June last year, Google asked the privacy team to revise the policy, and deliberately released a new policy on July 1 during the Independence Day holiday in the United States to distract the public. At this time, Google claims not to use this data outside of the experimental program.

In this "data gold rush", platforms with large amounts of user data have attracted special attention:

Shortly after ChatGPT's launch, tech giants such as "stimulated" Meta, Google, Amazon, and Apple struck deals with image library providers like Shutterstock to acquire hundreds of millions of its images, videos, and music files for AI training. According to Shutterstock, the initial transaction value was between $25 million and $50 million, and this number is rising as the demand for data increases.

Photobucket, the image hosting site that once served Myspace and Friendster, has also become a focal point for tech companies competing for data. Several tech giants are said to be in talks with Photobucket to acquire its 13 billion photos and videos to use to train their generative AI models. The materials are priced in a range of 5 cents to $1 per image, while videos are worth more than $1 each. Even though Photobucket currently has only 2 million users, well below its peak of 70 million users, the sheer volume of data it has is still valuable.

Shutterstock's rival Freepik has also announced that it has struck a deal with two big tech companies to license most of the 200 million images in its archives for 2 to 4 cents per image. The company also said there were five similar deals underway, but declined to disclose the identity of the buyers.

Google has signed an annual $60 million usage deal with Reddit to get high-quality, long-form content to train its large language models.

Even with large-scale social platforms like Facebook and Instagram, Meta still faces a shortage of high-quality data sources. Due to the lack of in-depth content precipitation on these two platforms, Meta tried to acquire Simon & Schuster publishing in order to acquire long-form works. In addition, to quickly obtain data for training, the company scraped almost every English-language book, essay, poetry, and news article available on the internet, and even some copyrighted content.

It's somewhat unfair for creators to unknowingly use a lot of content for training by tech companies that use that data to optimize their monetized products, while creators don't get a penny.

The New York Times sued OpenAI and Microsoft last year for using copyrighted news articles to train AI chatbots without permission. OpenAI and Microsoft, however, say that the use of the articles is "fair use," or that copyright law permits, because they have modified the works for different purposes.

Is "synthetic data" the way out?

As the "natural resources" available on the internet become increasingly scarce, the AI industry is exploring new sources of data to meet the needs of future large model training. Synthetic data is a potential avenue.

As the name suggests, synthetic data is not collected directly from the real world, but rather text, images, and code generated by algorithms designed to simulate the characteristics and behavior of real-world data, allowing systems to learn from self-generated content.

In other words, systems learn from what they produce on their own.

There are success stories. For example, Anthropic's Claude 3 LLM, which was launched last month, used some "synthetic data" for training, and it outperformed GPT-4 in the final ranking performance score across the board.

Sam Altman also proposed a path to train large language models with synthetic data last May: the model can produce human-like text, which can then be used to train the model, which will help developers build increasingly powerful technology and reduce reliance on copyrighted data.

Theoretically, this approach can form a perfect closed loop, which not only satisfies the huge demand for data for large-scale AI models, but also avoids the controversy and risk of collecting sensitive information directly from users.

But we can't be overly optimistic, and in recent months, researchers have found that training AI models on AI-generated data would be a digital form of "inbreeding" that would eventually lead to "model collapse" or "Habsburg AI". ”

Further model collapse will lead to low-quality and lack of diversity in the output of the generated model, which not only reduces the generalization ability and application value of the model, increases the difficulty and cost of training and debugging, damages the trust and credibility of users in the model and the system behind it, and ultimately hinders the research progress and technological innovation.

Whether it's capturing natural data or producing synthetic data, smaller companies face serious challenges in the AI race. They neither have the funds to buy copyrighted data nor access to user data hosted on platforms owned by tech giants.

On Reddit, some entrepreneurs lamented, "Yes, it's a violation of [YouTube's] user agreement, but honestly, we're in a bind because Big Tech has a monopoly on the market." My company collapsed because it couldn't crawl content from the open web, and this was because of the anti-competitive behavior of Twitter, Facebook, and Google. ”

"This only raises a series of questions. All of these companies are constantly encroaching on each other, but this is only to crowd out smaller companies. These big companies are guilty, otherwise they can't function properly. ”

In this era where data is king, the behavior of AI companies has revealed a profound truth: the acquisition and use of data has become an inevitable battlefield in the pursuit of technological leadership. As data resources become increasingly strained, companies are looking for new sources of data at all costs, even if it means wading into legal and ethical gray areas. This approach has not only sparked widespread debate on data privacy, copyright, and creators' rights, but also exposed the loopholes and inadequacies of existing data exploitation mechanisms.

In this data-driven race for technology, there are both exciting developments and worrying pitfalls. The development of technology should not be at the expense of personal privacy and the rights and interests of creators, and the rational and legal use of data, while protecting the source of data, and developing a more efficient and fair data utilization mechanism will be the key to the future development of the AI industry. With the advancement of technology and society, we look forward to the establishment of a more transparent and fair data ecosystem to promote the healthy and sustainable development of AI technology.

AI companies squeeze their heads to grab data OpenAI "picks" video content, and Google "covets" office data

The available data on the internet will be quickly exhausted

A New Arms Race in AI: Getting More "Data"

Is "synthetic data" the way out?

Read on