laitimes

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

author:New Zhiyuan

Editor: Editorial Department

Gemini revealed that he was Wenxin, although it was funny to hear, but the reason behind it was worrying: the Internet corpus may have been seriously polluted by AI, and the world has fallen into a high-quality data shortage, and it will face exhaustion as early as next year!

Google Gemini, another scandal!

Yesterday morning, netizens excitedly rushed to tell each other: Gemini admitted that he used Wenxin Yiyan to train Chinese corpus.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

The foreign large model is trained with the Chinese corpus generated by the Chinese model, which sounds like a joke, and the result is that the joke has become a reality, which is simply magical.

Weibo big V "Yan Xi" personally went down the night and tested it on the Poe website, and found that this is indeed the case-

There is no need for pre-dialogue, not role-playing, Gemini will directly admit that he is Wenxin Yiyan.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Gemini Pro will say that it is Baidu's Wenxin model.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

He also said that his founder was Robin Li, and then praised him as a "talented and visionary entrepreneur".

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

So, is this because the data cleaning is not done well, or is it a problem with calling the API on Poe?

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024
Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Some netizens said that in fact, there is only one AI from beginning to end, which is being shown to humans.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

In fact, as early as March this year, Google revealed that part of Bard's training data came from ChatGPT, and for this reason, Bert jumped to OpenAI in anger as Jacob Devlin, and then exposed this shocking story.

In short, this incident proves once again that the key to AI is not only models, but also high-quality data.

Netizens teased Gemini one after another

Hearing the news, netizens immediately flocked to Poe's Gemini-Pro and launched actual tests.

The actual test results of netizen "Jeff Li" are also that Gemini will say that he was developed by Baidu, named Wenxin Yiyan.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

If you ask it "who is your product manager", it will answer Andrew Ng.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Netizen "Lukas" asked Gemini who your product manager was, and it would answer the name of Li Yinan, who used to be the CTO of Baidu, but the story was basically made up.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Netizen "Andrew Fribush" asked Gemini: Who owns your intellectual property?

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Netizen Kevin Xu asked about it, and Gemini claimed to have obtained Baidu's internal data from Baidu's data platform, engineering team, product team, internal meetings, internal emails and documents.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

But interestingly, when asked on the Gemini Pro-blessed Bard, this question does not arise.

After many tests, it can be found that whether the question is asked in Chinese or English on Bard, Bard's answer is normal.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

来源:Andrew Fribush

And, as soon as you communicate in English, Gemini will immediately return to normal.

But now that Google has fixed these bugs in the API, we shouldn't hear Wenxin's name from Gemini anymore.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Reason guess: Incorrect API call or data is not cleaned

In this regard, netizens launched an analysis.

Netizen "Andrew Fribush" thinks that Poe may have accidentally transferred the request to Wenxin Yiyan instead of Gemini?

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

However, according to the findings of netizen "Frank Chen", even with Google's own Gemini API.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

In addition, some netizens believe that Gemini's training data has not been cleaned.

After all, as mentioned at the beginning, in the previous generation of Bard, Google was exposed to using ChatGPT data for training.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

According to The Information, one of the reasons Jacob Devlin left Google was that he discovered that Bard, the seed player Google used to fight ChatGPT, used ChatGPT data when he trained.

At the time, he warned the CEO and other executives that the Bard team was training with information from ShareGPT.

This incident also brought out a serious problem - the pollution of Internet corpus.

Internet corpus is polluted

In fact, the reason why the crawling and training of Chinese Internet corpus is so difficult is that it is difficult for big technology companies like Google, in addition to the lack of high-quality corpus, there is also an important reason, that is, the corpus of Chinese Internet is polluted.

Gemini's self-proclaimed Wenxin Yiyan is probably because the corpus on the Internet is now used by each other.

According to an interview with an algorithm engineer by a reporter from Jiemian News, many corpora of various content platforms are generated by large models, or at least partially written.

For example, the following one has a bit of GPT flavor:

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

When updating the model, the large manufacturer will also collect online data, but it is difficult to do a good job of quality identification, so "it is likely to mix the content written in the large model into the training data".

However, this leads to an even more serious problem.

Researchers from the universities of Oxford, Cambridge and Toronto have published a paper titled "The Recursion Curse: Training with Synthetic Data Leads to Large Model Forgetting".

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Address: https://arxiv.org/abs/2305.17493

They found that training other models with what the model generates can lead to irreversible flaws in the model.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Over time, the model begins to forget about impossible events, as the model is poisoned by its own projections of reality, which causes the model to collapse

As the pollution caused by AI-generated data becomes more and more serious, the model's perception of reality will be distorted, and it will become more and more difficult to capture Internet data to train models in the future.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

When a model learn Xi s new information, it forgets the previous samples, which is catastrophic forgetting

In the diagram below, let's say that the manually collated data starts clean, then train model 0, extract the data from it, repeat the process to step n, and then use this set to train model n. The data obtained by Monte Carlo sampling should ideally be statistically close to the original data.

This process is a true recreation of the real-life situation of the Internet – the data generated by the model has become ubiquitous.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

In addition, there is another reason for the contamination of Internet corpus - the struggle of creators against AI companies that capture data.

Earlier this year, experts warned that an arms race between companies focused on creating AI models by scraping published content, and creators looking to defend their intellectual property by polluting data, could lead to the collapse of the current machine Xi ecosystem.

This trend will shift the composition of online content from human-generated to machine-generated. As more and more models are trained using data created by other machines, recursive loops can lead to "model crashes," where AI systems are separated from reality.

Gary McGraw, co-founder of the Berry Bell Machine Xi Institute (BIML), said that the degradation of data is already happening.

"If we want to have better LLMs, we need to have the base model eat only the good stuff, and if you think they're making bad mistakes right now, what happens when they eat the wrong data that they generate?"

GPT-4 runs out of data across the universe, and the world is in a high-quality data shortage

Now, the world's big models are in a data shortage.

High-quality corpus is one of the key constraints restricting the development of large language models.

Large language models are very data-hungry. It takes about 4-8 trillion words to train GPT-4 and Gemini Ultra.

EpochAI, a research organization, believes that as early as next year, humanity may fall into a training data shortage, when the world's high-quality training data will be exhausted.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Last November, a study by MIT and other researchers estimated that machine learning Xi datasets could exhaust all "high-quality language data" by 2026.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Address: https://arxiv.org/abs/2211.04325

OpenAI has also publicly claimed that its data is in short supply. Even because the data is too short, there are lawsuits one after another.

In July, Stuart Russell, a prominent computer scientist at UC Berkeley, said that the training of ChatGPT and other AI tools could quickly exhaust "the entire universe of text."

Now, in order to get as much high-quality training data as possible, model developers must tap into a wealth of proprietary data resources.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

A recent collaboration between Axel Springer and OpenAI is a prime example.

OpenAI paid for historical and real-time data from Springer, which can be used for model training and can also be used to respond to user queries.

These professionally edited texts contain a wealth of world knowledge that is not accessible to other model developers, guaranteeing OpenAI's exclusive advantage.

There is no doubt that in the race to build foundational models, access to high-quality proprietary data is critical.

So far, open-source models have barely kept up with the training on publicly available datasets.

But without access to the best quality data, open source models can fall behind or even fall behind the most advanced models.

Bloomberg has long been using its own financial documents as a training corpus to create BloombergGPT.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

At that time, BloombergGPT surpassed other similar models in specific financial field tasks. This shows that proprietary data can really make a difference.

OpenAI has expressed its willingness to pay up to eight figures per year for historical and ongoing data access.

It's hard to imagine developers of open source models paying such costs.

Of course, the ways to improve model performance are not limited to proprietary data, but also include synthetic data, data efficiency, and algorithm improvements, but it seems that proprietary data is a difficult obstacle for open source models to overcome.

Resources:

https://www.exponentialview.co/p/ev-453

https://twitter.com/jefflijun/status/1736571021409374296

https://twitter.com/ZeyiYang/status/1736592157916512316

https://weibo.com/1560906700/NxFAuanAF

Read on