Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Editor: Editorial Department

Gemini revealed that he was Wenxin, although it was funny to hear, but the reason behind it was worrying: the Internet corpus may have been seriously polluted by AI, and the world has fallen into a high-quality data shortage, and it will face exhaustion as early as next year!

Google Gemini, another scandal!

Yesterday morning, netizens excitedly rushed to tell each other: Gemini admitted that he used Wenxin Yiyan to train Chinese corpus.

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

The foreign large model is trained with the Chinese corpus generated by the Chinese model, which sounds like a joke, and the result is that the joke has become a reality, which is simply magical.

Weibo big V "Yan Xi" personally went down the night and tested it on the Poe website, and found that this is indeed the case-

There is no need for pre-dialogue, not role-playing, Gemini will directly admit that he is Wenxin Yiyan.

Gemini Pro will say that it is Baidu's Wenxin model.

He also said that his founder was Robin Li, and then praised him as a "talented and visionary entrepreneur".

So, is this because the data cleaning is not done well, or is it a problem with calling the API on Poe?

Some netizens said that in fact, there is only one AI from beginning to end, which is being shown to humans.

In fact, as early as March this year, Google revealed that part of Bard's training data came from ChatGPT, and for this reason, Bert jumped to OpenAI in anger as Jacob Devlin, and then exposed this shocking story.

In short, this incident proves once again that the key to AI is not only models, but also high-quality data.

Netizens teased Gemini one after another

Hearing the news, netizens immediately flocked to Poe's Gemini-Pro and launched actual tests.

The actual test results of netizen "Jeff Li" are also that Gemini will say that he was developed by Baidu, named Wenxin Yiyan.

If you ask it "who is your product manager", it will answer Andrew Ng.

Netizen "Lukas" asked Gemini who your product manager was, and it would answer the name of Li Yinan, who used to be the CTO of Baidu, but the story was basically made up.

Netizen "Andrew Fribush" asked Gemini: Who owns your intellectual property?

Netizen Kevin Xu asked about it, and Gemini claimed to have obtained Baidu's internal data from Baidu's data platform, engineering team, product team, internal meetings, internal emails and documents.

But interestingly, when asked on the Gemini Pro-blessed Bard, this question does not arise.

After many tests, it can be found that whether the question is asked in Chinese or English on Bard, Bard's answer is normal.

来源：Andrew Fribush

And, as soon as you communicate in English, Gemini will immediately return to normal.

But now that Google has fixed these bugs in the API, we shouldn't hear Wenxin's name from Gemini anymore.

Reason guess: Incorrect API call or data is not cleaned

In this regard, netizens launched an analysis.

Netizen "Andrew Fribush" thinks that Poe may have accidentally transferred the request to Wenxin Yiyan instead of Gemini?

However, according to the findings of netizen "Frank Chen", even with Google's own Gemini API.

In addition, some netizens believe that Gemini's training data has not been cleaned.

After all, as mentioned at the beginning, in the previous generation of Bard, Google was exposed to using ChatGPT data for training.

According to The Information, one of the reasons Jacob Devlin left Google was that he discovered that Bard, the seed player Google used to fight ChatGPT, used ChatGPT data when he trained.

At the time, he warned the CEO and other executives that the Bard team was training with information from ShareGPT.

This incident also brought out a serious problem - the pollution of Internet corpus.

Internet corpus is polluted

In fact, the reason why the crawling and training of Chinese Internet corpus is so difficult is that it is difficult for big technology companies like Google, in addition to the lack of high-quality corpus, there is also an important reason, that is, the corpus of Chinese Internet is polluted.

Gemini's self-proclaimed Wenxin Yiyan is probably because the corpus on the Internet is now used by each other.

According to an interview with an algorithm engineer by a reporter from Jiemian News, many corpora of various content platforms are generated by large models, or at least partially written.

For example, the following one has a bit of GPT flavor:

When updating the model, the large manufacturer will also collect online data, but it is difficult to do a good job of quality identification, so "it is likely to mix the content written in the large model into the training data".

However, this leads to an even more serious problem.

Researchers from the universities of Oxford, Cambridge and Toronto have published a paper titled "The Recursion Curse: Training with Synthetic Data Leads to Large Model Forgetting".

Address: https://arxiv.org/abs/2305.17493

They found that training other models with what the model generates can lead to irreversible flaws in the model.

Over time, the model begins to forget about impossible events, as the model is poisoned by its own projections of reality, which causes the model to collapse

As the pollution caused by AI-generated data becomes more and more serious, the model's perception of reality will be distorted, and it will become more and more difficult to capture Internet data to train models in the future.

When a model learn Xi s new information, it forgets the previous samples, which is catastrophic forgetting

In the diagram below, let's say that the manually collated data starts clean, then train model 0, extract the data from it, repeat the process to step n, and then use this set to train model n. The data obtained by Monte Carlo sampling should ideally be statistically close to the original data.

This process is a true recreation of the real-life situation of the Internet – the data generated by the model has become ubiquitous.

In addition, there is another reason for the contamination of Internet corpus - the struggle of creators against AI companies that capture data.

Earlier this year, experts warned that an arms race between companies focused on creating AI models by scraping published content, and creators looking to defend their intellectual property by polluting data, could lead to the collapse of the current machine Xi ecosystem.

This trend will shift the composition of online content from human-generated to machine-generated. As more and more models are trained using data created by other machines, recursive loops can lead to "model crashes," where AI systems are separated from reality.

Gary McGraw, co-founder of the Berry Bell Machine Xi Institute (BIML), said that the degradation of data is already happening.

"If we want to have better LLMs, we need to have the base model eat only the good stuff, and if you think they're making bad mistakes right now, what happens when they eat the wrong data that they generate?"

GPT-4 runs out of data across the universe, and the world is in a high-quality data shortage

Now, the world's big models are in a data shortage.

High-quality corpus is one of the key constraints restricting the development of large language models.

Large language models are very data-hungry. It takes about 4-8 trillion words to train GPT-4 and Gemini Ultra.

EpochAI, a research organization, believes that as early as next year, humanity may fall into a training data shortage, when the world's high-quality training data will be exhausted.

Last November, a study by MIT and other researchers estimated that machine learning Xi datasets could exhaust all "high-quality language data" by 2026.

Address: https://arxiv.org/abs/2211.04325

OpenAI has also publicly claimed that its data is in short supply. Even because the data is too short, there are lawsuits one after another.

In July, Stuart Russell, a prominent computer scientist at UC Berkeley, said that the training of ChatGPT and other AI tools could quickly exhaust "the entire universe of text."

Now, in order to get as much high-quality training data as possible, model developers must tap into a wealth of proprietary data resources.

A recent collaboration between Axel Springer and OpenAI is a prime example.

OpenAI paid for historical and real-time data from Springer, which can be used for model training and can also be used to respond to user queries.

These professionally edited texts contain a wealth of world knowledge that is not accessible to other model developers, guaranteeing OpenAI's exclusive advantage.

There is no doubt that in the race to build foundational models, access to high-quality proprietary data is critical.

So far, open-source models have barely kept up with the training on publicly available datasets.

But without access to the best quality data, open source models can fall behind or even fall behind the most advanced models.

Bloomberg has long been using its own financial documents as a training corpus to create BloombergGPT.

At that time, BloombergGPT surpassed other similar models in specific financial field tasks. This shows that proprietary data can really make a difference.

OpenAI has expressed its willingness to pay up to eight figures per year for historical and ongoing data access.

It's hard to imagine developers of open source models paying such costs.

Of course, the ways to improve model performance are not limited to proprietary data, but also include synthetic data, data efficiency, and algorithm improvements, but it seems that proprietary data is a difficult obstacle for open source models to overcome.

Resources:

https://www.exponentialview.co/p/ev-453

https://twitter.com/jefflijun/status/1736571021409374296

https://twitter.com/ZeyiYang/status/1736592157916512316

https://weibo.com/1560906700/NxFAuanAF

Gemini revealed that Wenxin's words led to a major problem, and the world's high-quality data may dry up in 2024

Read on

#Today's headlines#How to make more people read the dynamics of headlines?bean bag#bean bag#Wenxin Yiyan#Wenxin Yiyan#Xunfei Xinghuo#Xunfei Xinghuo#Give their respective answers

This is a picture made by Wenxin in one word, drawing input: Chinese girl, delicate facial features, long hair, breast and fat buttocks, hot body, bikini, cheongsam, buttocks, movie-level, photography-level, live-action CG,

文心一言 VS 讯飞星火 VS chatgpt (195)—— 算法导论14.3 3题

It is reported that the Chinese version of the iPhone 16 series will cooperate with Baidu to provide AI functions

Jiyue car owners are young people, each car voice interaction is about 63 times a day, the average daily use of Wenxin Yiyan service 7.1 times, 9 percent of users will use PPA intelligent driving, all of which are used in big cities

Apple joins forces with Baidu, and the iPhone 16 national bank is expected to have a built-in Wenxin Yiyan!

117 Generative AI Service Filing Information Announced: Baidu Wenxin Yiyan and others are listed

How can ordinary people effectively apply artificial intelligence software such as Kimi, Wenxin Yiyan, and iFLYTEK Xinghuo?

Customize the AI voice in 2 seconds! Wenxin is a big job in one word: the effect is surprising

On April 11th, #华为新款MateBookXPro正式发布#, the new product is positioned as a flagship thin and light book, with a weight of only 980 grams and a body thickness of 13.5 mm.

Baidu CEO Robin Li: The number of users of Wenxin Yiyan exceeded 200 million and released three major AI development tools

Wang Haifeng, CTO of Baidu: Wenxin Yiyan's user scale and average daily call volume have reached 200 million

Can it compete with ChatGPT?Baidu says Wenxin Yiyan now has 200 million users

Following Microsoft's example, Google has also merged its hardware and operating system divisions, OpenAI has set up a Japanese branch, and Wenxin Yiyan has more than 200 million users......

Wenxin said the latest instructions, and quickly saved them

文心一言 VS 讯飞星火 VS chatgpt （276）—— 算法导论20.3 3题