"Mentally handicapped" post: The effect of training AI is far ahead?, the research team responded

"China Science News" reporter Zhao Guangli

Bai Yuelin and his friends really couldn't imagine that the Chinese instruction fine-tuning dataset they recently made would be "out of the circle" because of the use of Baidu's post data related to the "mentally handicapped".

Bai Yuelin is a third-year master's student at the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences. In the study titled "COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning," his team used "mentally retarded bar title + GPT-4 answer" to fine-tune the results of the fine-tuned large model that exceeded other supervised fine-tuning (SFT) instruction set data they collected. The latter comes from social platforms including Zhihu, Encyclopedia, Douban, and Xiaohongshu. In this regard, industry insiders said that "I laughed when I saw the paper".

Netizens commented on the post: "This 'mentally retarded' is on Dafen", "Great wisdom is foolish", "'Mentally retarded' is the last bastion for human beings to face AI".

"I didn't expect this work to be 'out of the circle', but there are some misinterpretations on the Internet, such as some people ridiculed this study as 'Zhihu is not as good as the mentally handicapped'. As the co-first author of the paper, Bai Yuelin told China Science Daily that the authors of the article are from many top institutions at home and abroad, and "considering the team's academic reputation and social impact, these misinterpretations need to be clarified."

"Mentally handicapped" post: The effect of training AI is far ahead?, the research team responded

Bai Yue Hsien

It's not "mentally retarded" who "go to Oita"

"Mentally Handicapped Bar" is a sub-forum of Baidu Tieba. In this forum, users often post challenging content with puns, polysemous words, causal inversions, harmonic words, etc., many of which are designed with logical traps that can be challenging even for humans.

The style of the title of the mentally handicapped bar post is roughly as follows:

"How many hours and a half hours is an hour and a half?"

"Why do meteorites always fall in craters?"

"Can a man survive if he has only one heart left?"

"The Bluetooth headset is broken, go to the hospital to hang the ear department or dentistry?"

There are also some humorous speeches: "Sashimi is a dead fish fillet", "Waiting for a red light is waiting for a green light", "Caffeine comes from coffee fruits", "Fighting fires is extinguishing fires", "The compass mainly refers to the north", "Xiao Ming turned on the faucet because the faucet burned Xiao Ming's hands"......

Screenshot of "Mentally Handicapped Bar".

Because many of the questions in the "Mentally Handicapped Bar" are brain-opening, these questions are often used to test the ability of large models.

Such corpus data naturally cannot escape the "magic eye" of the research team.

In addition, China Science News learned that the average age of this research team is only in their 20s, and most of them are master's and doctoral students. They often patronize platforms such as Zhihu, Douban, and Xiaohongshu, and of course, "mentally handicapped" is indispensable.

When they decided to "hand-rub" a high-quality Chinese instruction fine-tuning dataset, the "mentally retarded" related corpus naturally became an option for them.

However, it is not as the legend - "the first of the 8 tests of the mentally handicapped bar, far surpassing Zhihu Douban Xiaohongshu" and "becoming the best Chinese AI training data". In fact, the best performance on the Yi-34B large model is not simply "mentally retarded". Specifically, the mentally handicapped bar only contributed a title.

The paper mentions that the research team collected the 500 posts with the most likes on "Mentally Handicapped Bar" and used the titles of these posts as instructions to use GPT-4 to generate corresponding responses. For the responses generated by GPT-4, the research team also conducted manual review, optimization and screening, and finally obtained 240 pairs of (instructions, responses) samples. The Yi-34B large model trained with these 240 pairs of samples recorded high scores on the Belle-Eval test set.

The training effect of the dataset from Ruozhiba is far ahead of that of other data sources. Image taken from the paper

It should be pointed out that in addition to the "mentally handicapped bar", the research team did not use GPT-4 to generate answers to the data from sources such as Zhihu, Xiaohongshu, Douban, and Encyclopedia, but used strict data filtering to retain the high-quality content written by humans on the network to the greatest extent.

Taking Zhihu, which has a large number of high-quality user-generated content, as an example, the research team set up filter conditions such as "high praise answers", and after content filtering and scoring, the original content with high scores was used.

In contrast, the research team only used the title of the mentally handicapped bar post as a command to train the large model, and did not include the replies and comments of netizens at all, but used GPT4 to assist in artificially constructing the reply answers.

Therefore, in the face of remarks such as "'mentally handicapped' on Dafen" on the Internet, Bai Yuelin responded: "The propaganda on the Internet exaggerates the facts. ”

"Many readers mistakenly think that we can achieve good results by training a large model with the comments of 'mentally handicapped' netizens, but in fact, we only keep the title of the mentally handicapped post. Bai Yuelin said: "The experimental results do not represent the mentally handicapped, because the data is actually equivalent to the collaborative construction of multiple parties (netizens, authors and large model systems)." ”

It is not the intent of the study to "run scores" on the data from various platform sources

Why did the research team only focus on "mentally handicapped bars"?

"Because our goal is to build data that meets the fine-tuning quality requirements of large model instructions, and the comments of netizens in Tieba are usually not suitable for direct fine-tuning data, we did not include the comments of 'mentally handicapped' netizens in our data. Bai Yuelin told China Science News.

Zhang Ge, the corresponding author of the paper and a doctoral student at the University of Waterloo in Canada, further explained to China Science News: "The 'mentally handicapped problem' that netizens racked their brains to come up with in 'The Mentally Handicapped Bar' does provide high-quality instructions with a clear angle for the large model. However, the answers to the post have a lot of offensive statements and even factual errors, and many of the answers are witty and playful, while GPT-4's answers are basically 'very serious', and they can basically get more reliable answers after manual screening. ”

Zhang Ge

Because it is difficult to pay attention to the "differential treatment" of the data of the "mentally handicapped" in the dissemination, it is easy for the people who eat melons to misinterpret this work, believing that only using the content of the "mentally handicapped" can train the large model far beyond other platforms.

Bai Yuelin further said: "The results of our experiments are not completely representative of the various platforms in the Internet, and any sentiment about platform antagonism is not something we want to explore or want to see. ”

However, it was the research team's special operation of the "mentally handicapped" data that caused relevant people to question the experimental results after the fermentation of the content of the paper.

Some skeptics have pointed out that other sub-datasets from platforms such as Zhihu and Douban have sampled the original content and netizen comments, and only the sub-dataset of "Mentally Handicapped Bar" does not include netizen comments at all, but uses GPT-4 synthetic responses - such answers are obviously more complete, accurate, and diverse, and it is GPT-4 that is finally scored. "As both an athlete and a referee, won't evaluation bias explode? Isn't it a bit too lax to mislead the public and gain traffic with this kind of operation?"

Bai Yuelin also gave a positive response to this question.

"Getting traffic is not our original intention, we have no intention of grandstanding, let alone planning or arranging any promotional content, our original intention is just to silently contribute some high-quality datasets to the Chinese NLP (natural language processing) community; the original intention of the platform's 'running score' is to observe the impact of each platform data on each task in the test set." Bai Yuelin explained.

As for why only the "mentally handicapped" subset does not include netizen comments, as mentioned above, it is also because some of the netizen comments of "mentally handicapped" were judged to be unable to meet the quality standards of the answers of the trained language model, so they decided to reconstruct the answers. The use of GPT-4 to assist in the construction of answers is mainly to reduce the human input as much as possible. Bai also said that the problem of assessment bias has been noted, and they plan to "supplement the manual assessment experiment" in the next edition of the paper.

Zhang Ge told China Science News that "hand-rubbing" a general, high-quality Chinese instruction fine-tuning dataset requires a lot of screening, checking and tuning, "it is a manual job", and those who can seek help from the machine will certainly not let it go.

All for "AI more suitable for Chinese babies"

Ge Zhang is a central figure in this research and is one of the initiators of the COIG (Chinese Open Instruction Generalist) series Chinese of works.

Talking about the original intention of initiating this research, he told China Science News that there are no particularly high-quality open source projects in China in terms of fine-tuning datasets for Chinese instructions, and individual projects are only "barely usable", so the idea of providing the industry with a completely open source dataset from various sources, including Chinese social media data, can directly fine-tune large models.

Structuring challenging and authentic Chinese corpus interaction data through screening and collection is undoubtedly valuable for training and evaluating the ability of large language models to understand and execute Chinese instructions. Most directly, it will help to reduce the "hallucinations" of large models in their responses (some content that the model appears when outputting text that does not conform to facts or common sense).

In this work, the authors' team constructed a Chinese instruction fine-tuning dataset containing more than 40,000 pieces of high-quality data, and open-sourced it to research institutions, enterprises and other parties, providing valuable resources for the Chinese NLP community.

However, this work is cumbersome and complex, not only to "crawl" high-quality content data from various platforms, but also to use various technical means to clean and review, which is a very large workload and requires collective efforts. As a result, the team of authors for the work is 20 people.

In the team, in addition to Bai Yuelin from the Shenzhen Institute of Advanced Technology of the Chinese Academy of Sciences, there are also members from top institutions such as the Institute of Automation of the Chinese Academy of Sciences, the University of Science and Technology of China, Peking University, the University of Waterloo in Canada, and the University of Manchester.

China Science News further learned that the group of young people started the study in November 2023 and completed almost all of the work in less than four months. With such an efficient performance, how do they organize and collaborate?

"We have created an open-source community dedicated to multimodal AI - M-A-P (Multimodal Art Projection), there is no offline entity, no profit purpose, as long as we can come and do things together, we welcome it. Zhang Ge said that more than two years ago, he and several friends came together for a music large-scale model training project and co-founded M-A-P. After that, friends, friends, friends, friends, friends...... More and more partners are interested in joining, and an open source community with stable contributions has been formed.

He told reporters that in the M-A-P community, after everyone initiates a project, they will seek partners to do it together, and if it involves some resource needs, everyone will negotiate with technology companies, etc., and if the company is willing to invest resources, they can cooperate and develop together. However, the premise is that after the completion of the project, the company must share the results of the project with the open source community, in addition to retaining some private resources.

"The goal of all our projects is to be able to make something good open source for everyone to use. Zhang Ge said that the open source community has the flexibility and purity that universities and enterprises do not have, and the work of the Chinese Instruction Fine-tuning Dataset (CQIA) was initiated in the M-A-P community and gradually brought together domestic and foreign scientific research forces.

Zhang Ge said frankly that from the initiation to the completion of this work, he had never even met some of the partners involved.

(Yiming Liang, a Ph.D. student at the Institute of Automation, Chinese Academy of Sciences (co-first author of the paper) also contributed to this paper)