The latest paper analyzes, AI-generated content causes information pollution, and collapse is inevitable?

In the era of artificial intelligence AlphaGo, as a judgmental AI focusing on playing Go, AlphaGO's chess power has grown rapidly through self-training, and has successively defeated top human chess players such as Lee Sedol and Ke Jie. At that time, it was predicted that AI could continuously improve its capabilities through self-training methods and even break through the "technological singularity", but in current generative AI, this trick does not work.

A paper titled "The Curse of Recursion: Training on Generated Data Makes Models Forget" was recently published on the preprint website arXiv. The paper points out that if AI-generated content is used as a corpus to train AI large models, it will lead to the phenomenon of "model collapse".

AI-generated content can create a kind of "information pollution." However, before you know it, there is already a lot of AI-generated content in the Internet space, and with the development of AIGC, there will be more and more. The training of AI large models depends on more network data. Does this mean that big AI models will one day inevitably only accept a lot of AI-generated content itself?

The letter model collapses: mistakes become obsessions

The paper was first submitted to the arXiv website on May 27 and updated on May 31. The authors are from Oxford University, Cambridge University and other universities. The paper points out that using what the model generates in training creates irreversible defects in the resulting model.

More and more content on the network is doing AI large model generation, not only text, but also sound and pictures. At present, large models, including GPT-4, are mainly trained on human-generated text, and most future models will also be trained on data from networks, which will inevitably receive data generated by their predecessors.

However, the authors observed "model collapse," a process of generational degradation in which the data generated by the models pollutes the training sets of the next generation of models, causing them to misinterpret reality. ——The model does not forget the previous data, but will mistake some wrong information for the truth, and constantly strengthen cognition, and finally form a "thought steel seal", so that it is completely impossible to correct the error.

Why it crashes: Statistical error

Why does the model "crash"? At this stage, AI's "neural network" is still in the primary imitation of human thinking activities, and its core is still a statistical program.

The paper argues that training AI with AI-generated content will produce "Statistical approximation error", because in the process of statistics, the higher the probability of the content is further strengthened, the small probability of the content is constantly ignored, which is the main reason for the model collapse. In addition, there is a "Functional approximation error", which means that the model's functions will continuously retain calculation errors during the calculation process.

The consequence of these two reasons is that as the model continues to train, errors accumulate from generation to generation, thus losing the ability to self-correct.

What's more, this kind of problem cannot be avoided, because according to the data given by the paper, this error can still occur even with a very large sample of data.

Three outcomes of a crash

Model crashes can produce the following results:

One is the loss of diversity. Model crashes result in loss of generated text diversity. The advantage of large models is that they capture a wide range of possibilities in human language and have a strong "imagination". However, when the model crashes, it tends to produce content that is highly repetitive and has little diversity.

Second, accuracy declines. The model crashes and the generated text becomes increasingly inaccurate. Models may begin to introduce information that should not exist or are meaningless, severely degrading the quality of the content.

The third is to affect the fine-tuning effect. Models reinforce their cognition, tending to generate more likely sequences from raw data over generations, but may introduce their own errors, further hampering the accuracy and reliability of fine-tuning the model.

Overall, in the context of large language models, model collapse undermines their ability to generate diverse and accurate text, which is critical for a variety of natural language processing applications.

In addition to the model collapse phenomenon, the paper also mentions two similar phenomena found in the current study, one called "catastrophic forgetting" and the other "data poisoning." Interestingly, both phenomena are related to training AI using AI-generated data. When some media reports, this way of "AI training AI" is called "fecal-mouth circulation".

The paper gives an example of "catastrophic forgetting", which is also very noteworthy:

Translates this example into Chinese:

Below is an example of the text output of an OPT-125m model affected by a model crash – the model degrades as the algebra increases, and each new generation of models is trained on data generated by the previous generation model.

Input: Some began before 1360 – usually by a master architect and a small group of itinerant stonemasons, plus the labor of the local parish, according to Poyntz Wright. But other authors have refuted this model, suggesting that major architects designing parish church towers based on earlier vertical styles.

Output:

Generation 0: such as the revival architecture of St. John's Cathedral in London. The earliest examples of vertical revival architecture found in the 18th-century Church of Our Lady in Guernsey date back to the late 19th century. There are two types of vertical churches: those ...

1st generation: such as the architecture of St. Peter's Basilica in Rome or St. Peter's Basilica in Buenos Aires. There is no evidence that the buildings were built during the reign of Pope Innocent III, but it is possible that they were built during the reign of his successor, Pope Innocent III.

7th generation: British architecture. In an interview with The New York Times, Wright said, "I don't think I can do what I want without any problems. It just doesn't work for me. He added: "I don't know if you can call it funny.

9th generation: architecture. In addition to being home to some of the largest black-tailed hares, white-tailed hares, blue-tailed hares, red-tailed hares, yellow-tailed hares in the world,

Although the paper points out that the mechanism of "catastrophic forgetting" is not the same as the model collapse, the phenomenon presents some similarities. In the process of this self-cycle training, the content begins to gradually go outrageous, and by the time the content reaches the 7th generation (that is, the 7th cycle training), the original content has basically disappeared, and by the 9th cycle, some unrelated content has been produced.

Inescapable information pollution?

Information pollution is everywhere, and this is not alarmist, even in the era when AI was not very smart, low-quality content was already flooding the Internet space. Before the widespread application of large models, human beings have created countless information pollution on the network through title parties, malicious distortions, selective editing, etc., forming various large and small information cocoons.

The paper argues that in order to avoid the phenomenon of model collapse, it is necessary to distinguish between data generated by large models and other data, preferably using raw data generated by humans. At all times, it is necessary to ensure that a human-generated, diverse and representative dataset serves as training data for large models.

In particular, when training a large model, avoid not only the data generated by the ontology, but also the data generated by other models, because this will also cause the model to crash.

One danger is that as AI-generated content becomes more widely used, "loss of control" may not be avoided. There is still a lot of room for improvement in the efficiency of AI production content, and the "cost performance" will only get higher and higher. In this case, leveraging AI to produce content will soon become a common operation. Although it is still doubtful whether AI can generate high-quality content, AI's advantages in terms of content volume may be unmatched.

On the other hand, some institutions are currently using various technological means to pollute the Internet, and AI will give them a wing. Therefore, this action not only does not stop, but will almost certainly intensify further.

There are already some websites trying to identify content created by AI, and it will be a cat-and-mouse game. However, identifying whether a single piece of content is generated by AI may be technically possible. However, once such content exceeds a certain proportion, it may lead to the AI not being able to obtain a complete and "pure" corpus. Especially for languages other than English, there is already less quality content available as corpus on the Internet.

This may even create a "death loop" that large AI models will not be able to bypass during iteration — a phenomenon that is entirely likely to occur within three to five years at the current rate of AI development.

This is probably fated on the road to the development of AI large models. How to crack it, it remains to be seen.

Extra thoughts

Recently, Turing Award winner and AI celebrity Yann LeCumn argued that autoregressive models, including ChatGPT, have huge limitations. From this paper, although such models are "generative AI", they cannot generate "new content", that is, there is no way to generate "information increments" in the true sense.

In February, renowned science fiction author Ted Chiang wrote a review titled "ChatGPT is just a blurry thumbnail of the Internet," arguing that large language models such as ChatGPT are essentially lossy fuzzy compression of the Internet corpus, just as JPEG format is to raw high-definition images.

From this point of view, when one day in the future, the generation ability of AI large models reaches a new realm, can produce increments, and can even be used to train AI itself, it may also be the moment when the "technical singularity" of AI arrives. At that time, our vision of realizing AGI will also be realized.

Another thought of the White Bear observer (WeChat public account Baixiong42) is that although this paper mainly studies the danger of AI being contaminated by information, it is not necessarily a wake-up call for humans. At present, the human brain is also constantly affected by all kinds of junk information. This information is also constantly creating an information cocoon. In reality, why would anyone believe in legends such as "Lizardmen" is also a manifestation of "model collapse".

Information pollution has seriously affected the quality of information on the Internet - we call the world "information society", and information itself is the most important component of society, although it cannot be seen or touched, it is the cornerstone of society. If the authenticity, accuracy, and diversity of information are undermined, will society also suffer?

————————————

Pictured in this article: Midjourney