laitimes

Sora training data source: YouTube?OpenAI incarnates as a "thief"

author:Three easy life

Not long ago, OpenAI showed off its muscles with Sora, a large Wensheng video model known as the "World Simulator", once again proving to the outside world that it is still the leader of the AI track. But just recently, Sora seems to have put OpenAI in trouble. A few days ago, the Wall Street Journal and the New York Times successively reported on the difficulties encountered by AI companies in collecting high-quality training data, especially the New York Times, which is in a lawsuit with OpenAI, directly pointed out that OpenAI collected more than one million hours of YouTube videos to train GPT-4.

Sora training data source: YouTube?OpenAI incarnates as a "thief"

Immediately afterward, Google spokesman Matt Bryant said, "Our robots.txt files and terms of service prohibit unauthorized scraping or downloading of YouTube content." YouTube CEO Neil Mohan even said in an interview with Bloomberg that although there is no direct evidence that OpenAI used YouTube videos to train Sora, he warned that this behavior violates YouTube's current terms of service. In fact, YouTube's CEO insinuated not only for the New York Times report, but also for OpenAI's CTO Mira Murati's inability to identify the source of Sora's training data.

When asked by a Wall Street Journal reporter about the source of Sora's training data, Milla Murati said, "I'm actually not sure about that." Even when asked if OpenAI uses data from the stock media platform Shutterstock, the CTO of OpenAI dodged the question. You know, Shutterstock and OpenAI reached a cooperation as early as 2021, allowing OpenAI to use the platform's pictures, videos, and music to train AI models, and in order to appease artists, Shutterstock even paid to compensate artists whose works were used by OpenAI.

Sora training data source: YouTube?OpenAI incarnates as a "thief"

Mira Murati's disastrous statement directly detonated questions about OpenAI's lack of transparency and non-compliant data scraping practices. In fact, OpenAI does currently face the dilemma of lacking "public and permissioned" data.

According to a related report in the New York Times, OpenAI ran out of useful data supplies in 2021 and discussed the feasibility of transcribing YouTube videos, podcasts, and audiobooks after exhausting other resources. Even OpenAI does know that there are legal issues with using content from the YouTube platform, but it is considered fair use, and OpenAI President Greg Brockman is personally involved in the collection of the video content used.

However, "open data" is not the same as "open data", and although a considerable amount of data is indeed publicly published on the Internet, this does not mean that the owner of this data is willing to share it for free. OpenAI's training of ChatGPT is a positive example of the compliant use of public data on the Internet. It is reported that OpenAI uses Common Crawl, Wikipedia, and the United States patent document database, of which Wikipedia is one of the most well-known open source software projects, and Common Crawl is also an open source database that crawls the Internet and provides open source downloads of data.

Sora training data source: YouTube?OpenAI incarnates as a "thief"

The people who maintain these open data sources are almost all believers in the Internet spirit represented by openness, equality, collaboration, and sharing, but as the entire Internet industry becomes more and more commercialized, this Internet spirit is gradually withering, so that there are only a handful of projects like Wikipedia today. When organizations willing to share data for free can no longer satisfy OpenAI's appetite, paying for data is actually a way out. But the problem is that OpenAI's bid can't impress the copyright owner, and there are not many copyright owners who are willing to sell the data to it.

Copyright owners represented by the media usually want to sell the data at a high price, because from the capabilities shown by the current large models such as ChatGPT, GPT-4, and Sora, they may be the first to replace not science students, but liberal arts students. In fact, the copyright owner does not accept to sell the noose that hangs itself, but OpenAI's price of $1 million to $5 million a year is obviously not sincere enough. But OpenAI can't actually give too much price, because they need too much data, and the budget for data procurement, although it may be large, is less than $5 million per company.

Sora training data source: YouTube?OpenAI incarnates as a "thief"

In this way, it is not surprising that OpenAI will use fair use as an excuse to scrape YouTube video content. In fact, data crawling has been in a gray area since the internet industry boomed. In other words, "the world is as black as a crow", and almost no Internet vendor is completely invulnerable when it comes to data collection. For example, it has long been an unspoken rule for search engine crawler robots to crawl data from each other, but the copyright owner's treatment of search engines and AI models are completely two different faces.

Why can such a "gentleman's agreement" as the robots protocol exist for a long time, and the website will also be dedicated to SEO, optimizing the accessible content and structure for search engine crawlers? It is not because search engines are beneficial to websites, they will bring traffic to websites, and with traffic, they can sell advertisements or monetize in other ways.

Therefore, a win-win situation is the reason why search engines work tirelessly to crawl data, but do not make the copyright owner boil over. But on the other hand, the AI model is almost non-altruistic, OpenAI will only make its valuation higher if it takes away the data, and ChatGPT Plus has not seen a cent of the money earned by YouTubers.

Sora training data source: YouTube?OpenAI incarnates as a "thief"

In a sense, OpenAI was pushed to the forefront this time, in fact, it also shows that this AI unicorn also has flaws, that is, data is heavily dependent on external supplies. As major manufacturers are making AI models one after another, OpenAI will face an inevitable problem, that is, they do not have their own content platforms, and even the content platforms are all owned by friends.

And even if Microsoft wants to provide data to OpenAI, it is not easy, because as more and more users pay more attention to personal privacy at this stage, almost all user agreements will state, "We obtain your information to better serve you, and we promise not to share this information with third parties."

OpenAI, which has not come up with ChatGPT before, can still "develop secretly", but now OpenAI is already in the center of the stage, so the space left for them to move around is naturally getting smaller and smaller.

Read on