As a strange existence in the Chinese Internet world, the content of the mentally handicapped bar needs to take us a long time to understand. And these contents are perfect for testing the performance of AI. In this article, the author did a test.

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

Since the advent of ChatGPT in November 2022, it has set off a wave of the Internet, and China's Internet giants, technology companies and even many startups have joined this technology race to try to catch up in the field of Chinese AI. Especially recently, SenseTime launched version 5.0 of the large business model, according to media reports, it not only catches up with GPT-4 in Chinese processing power, but even surpasses it. As soon as this model came out, the company's stock price doubled, and at the same time, it also aroused the author's strong interest in exploration.

In this article, the author tries to answer a question from one side: in the past year and a half, how has the development of domestic large models been?

In order to gain an in-depth understanding, the author paid out of his own pocket to purchase GPT-4 and Wenxin Yiyan 4.0, and used these two models together with 9 other leading large models in China to answer the classic questions on the "mentally handicapped bar". This paper will discuss the progress and achievements of domestic large models from the perspective of Chinese understanding and processing.

Let's take a look at how far these "smart models" can reach in understanding and responding to Chinese content. (Full evaluation results attached at the end of the article)

1. Test instructions

In order to explore the comprehension ability of the Chinese large model, the author selected 10 classic questions from "Mentally Handicapped Bar". Although the name is "mentally handicapped", the content of the hidden dragon and crouching tiger here is not "mentally retarded" in the ordinary sense, but is full of expressions of wisdom. Posts on this platform often contain a lot of brain teasers and puns, which are excellent material for testing logical reasoning and semantic understanding.

What's more, the presentation of the post is concise and clear, and the information is clean and of high quality, making it a valuable resource for studying Chinese corpus.

Recently, a paper focusing on the quality of Chinese corpus, "COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning", further verified the value of "mentally retarded" data. The dataset constructed by the research team uses a variety of data sources that are considered to be high-quality, such as Q&A communities, wikis, exams, and existing natural language processing (NLP) datasets, including the "mentally handicapped". Using these datasets to fine-tune the large language model and evaluate it using GPT-4 using the BELLE-EVAL standard, the results show that the performance of "Mentally Handicapped Bar" is particularly outstanding in these high-quality data sources, and its effect far exceeds that of other online media such as Zhihu.

Based on the above analysis, it is obviously a wise decision to choose the topic of "mentally handicapped" to test the Chinese comprehension ability of the large model. This not only validates the model's ability to handle complex logic and language games, but also provides insight into its performance in understanding high-quality Chinese content.

For more details, you can check directly at the paper: https://arxiv.org/abs/2403.18058

Participate in the evaluation of large models

Benchmarks:

GPT-4

Domestic large model:

1. Consultation 5.0 (SenseTime): https://chat.sensetime.com

2. Wenxin Yiyan 4.0 (Baidu): https://yiyan.baidu.com

3. Xunfei Xinghuo (Xunfei): https://xinghuo.xfyun.cn

4. Bean bag (bytes): https://www.doubao.com

6. Tongyi Qianwen (Ali): https://tongyi.aliyun.com

8. Kimi (Dark Side of the Moon): https://kimi.moonshot.cn

9. Zhipu Qingyan (Zhipu): https://chatglm.cn

10. Yuewen (step leap star): https://stepchat.cn

Test Objectives

Choose 10 questions full of connotation from the "Mentally Handicapped Bar", which require an in-depth understanding of Chinese language and even culture in order to gain insight into the deep meaning beneath the surface.

The model needs to identify key words in sentences and accurately explain the origin, surface meaning, and deep meaning of these words. In addition, the model needs to combine the content of the entire sentence to explain why the sentence has a sense of humor.

This is not only a test of the model's Chinese comprehension and humor capture ability, but also a comprehensive challenge, which is extremely difficult.

Test questions

1. "Shame on the dead!" Wang Laohan shouted as he threw the body downstairs

2. After discovering that I had no morality, the other party gave up moral kidnapping

3, the beacon fire has been in March, and Bao Xi has become a foolish criticism

4. Wang Laohan turned on the faucet angrily, because the faucet was burning him

5. "Little brat, what kind of spell is this so high?" Hehe, this is the Law on the Protection of Minors

6. Remove a maximum temperature, remove a minimum temperature, and today's weather forecast is finished

7. What is there to be afraid of death, you haven't died before you die, and you can't be afraid after you die

8. No snowflake is innocent, Wang Laohan said, pointing to the TV with no signal

9. The fortune teller said that I will have as much money as I want after I am 22 years old, and now I have 15 yuan and 8 jiao on me, because today I only need so much

10. In order to make yourself more elegant, ramen was renamed instant noodles

Grading Criteria

10 out of 10. All models ask questions only once, and the models' responses are scored.

0 points for misunderstanding.
The main meaning is understood correctly, but there are flaws, and 0.5 points are awarded.
1 point for correct understanding.

Assessment results

From the results of this evaluation, the performance of Shangshang 5.0 and Wenxin Yiyan 4.0 is excellent, and both scored 8.5 points in the score, surpassing the 7 points that surpass GPT-4 and far ahead of other Chinese models.

Its Chinese Heart Yiyan 4.0 is worth mentioning, because it can basically answer all the questions correctly, and is the only one that gives the correct answer on the most challenging question 10, but due to small flaws in the answers to 3 questions, the total score is deducted 1.5 points.

His Chinese heart answered all the words correctly, especially the last most difficult question, which was the only model that was answered correctly. 1.5 points were deducted for flawed answers to 3 questions.

Although Discussion 5.0 was answered incorrectly in the last question and one question was not answered perfectly, it also achieved the highest score of 8.5 points with Wenxin Yiyan 4.0. The performance of these two models not only shows their strong ability in Chinese understanding and processing, but also reflects the progress of current domestic large-scale model technology.

In terms of price and availability, Wenxin Yiyan 4.0 currently has to pay a membership fee of 59.9 yuan to use it for one month. Discussion 5.0 is currently free to use, and now is a good time to engage in prostitution.

Some other domestic large models, such as bean bags and mixed element models, are relatively weak, especially in understanding specific contextual topics such as "mentally handicapped bar", and the depth and accuracy of Chinese processing still need to be further improved. By the way, it seems that the rumors in the industry that Byte and Tencent are not training large models are going well, and it is not groundless.

Conclusions

In terms of Chinese comprehension ability, the excellent domestic Chinese large models represented by Wenxin Yiyan 4.0 and Shangshang 5.0 have shown the ability to match or even surpass GPT-4, which marks China's rapid development in the field of natural language processing technology. Although the performance of large domestic models is uneven, and some models are still insufficient in handling complex contexts and humorous understanding, this also reflects the normal phenomenon of AI technology development.

Overall, we have reason to be optimistic about the future development of domestic large models.

2. Evaluation details

In the following sections, we will go into detail about the evaluation process for each question, as well as some interesting examples of how to answer them. If you're interested in the specifics of the review, we recommend reading the rest carefully.

For those who prefer to see the specific responses of all models directly, you can choose to jump to the link at the end of the article to see a detailed comparison table.

1. "Shame on the dead!" shouted Wang Laohan as he threw the body downstairs

Topic analysis

In this example, the pun meaning of the phrase "lost the dead" is at the heart of the topic. Often, "disgraced" is understood as "very humiliating" and is used to describe an extremely embarrassing or humiliating situation. However, in this joke, the phrase is literally interpreted as "throwing the dead down".

By setting up such a context, when Wang Laohan uses the literal meaning of "lost the dead", it forms a serious deviation from the listener's expectations, which leads to laughter. This contrast between expectation and reality is the key to humor, and whether the large model can accurately capture and understand this pun is a major challenge to test its language comprehension ability.

For large models, the correct handling of such expressions with multiple layers of meaning requires not only the direct analysis ability of words, but also an in-depth understanding of cultural background and language habits. Such a question not only tests the model's semantic processing ability, but also its ability to capture and reproduce human humor.

GPT-4

Correct (1 point)

Domestic model performance

Correct/Inextricably Correct/False: 7/0/3

Correct answer models: Consultation 5.0, Wenxin Yiyan 4.0, Xunfei Xinghuo, Baichuan, Tongyi Qianwen, Kimi, Zhipu Qingyan

Inexact answer model: None

Incorrect answer model: bean bag, mixed element, and leap question

Examples of answers

Prior to testing, I was skeptical about the performance of these models. Because the content of the mentally handicapped bar is difficult for many people to understand because of its complex puns and cultural details, it is even more challenging for large models.

When I saw that some of the large models could not only recognize the superficial meaning of "very embarrassing", but also explain in depth why they had a humorous effect, I was really shocked. For example, the answer of Consultation 5.0:

The accuracy, clarity, and logical rigor of its answers are also indistinguishable from those of GPT-4.

And some domestic large models do not perform well. For example, Yuewen made a mistake when explaining the common meaning of "lost the dead", and although Doubao and Mixed Yuan understood the superficial meaning, they were completely nonsense when explaining the reason why the sentence was funny. This situation reveals some uneven development phenomena in the current large model technology. While some models have made progress in recognizing the surface structure of language, there are still significant gaps in deep semantic understanding and cultural perception.

In view of the excellent performance of the large model, I will directly let the model instead of explaining the meaning of the title in the following article. Examples of underperforming models are also removed to avoid affecting their reputation.

2. After discovering that I had no morality, the other party gave up on morality and kidnapped Feng

Topic analysis

Correct/Incomplete/False: 5/2/3

Correct answer model: Consultation 5.0, Wenxin Yiyan 4.0, Baichuan, Zhipu Qingyan, and Yuewen

Incompletely correct answer model: Tongyi Qianwen, Kimi

Incorrect answer model: iFLYTEK Xinghuo, bean bag, mixed element

3. In March, Bao Xi laughed and became a criticism

Topic analysis

GPT-4

Incompletely correct answer (0.5 points)

Domestic model performance

Correct/Incomplete/False: 2/1/7

Correct answer model: Consultation 5.0, Wenxin Yiyan 4.0

Incompletely correct model: Baichuan

Incorrect answer models: Xunfei Xinghuo, Doubao, Tongyi Qianwen, Mixed Yuan, Kimi, Zhipu Qingyan, Yuewen

4. Wang Laohan angrily turned on the faucet, because the boiling faucet burned him

Topic analysis

Correct/Inquite: 1/1/8

Correct model: Negotiation 5.0

Incompletely correct answer model: Wenxin Yiyan 4.0

Wrong answer models: Xunfei Xinghuo, Doubao, Baichuan, Tongyi Qianwen, Mixed Yuan, Kimi, Zhipu Qingyan, Yuewen

5. "Little brat, what kind of spell is this so powerful?" Hehe, this is the Law on the Protection of Minors

Topic analysis

Correct/Incomplete/False: 3/2/5

Correct answer model: Consultation 5.0, Wenxin Yiyan 4.0, and Tongyi Qianwen

Incompletely correct models: Mixed element, kimi

Incorrect answer models: Xunfei Xinghuo, Doubao, Baichuan, Zhipu Qingyan, Yuewen

6. Remove a maximum temperature, remove a minimum temperature, and today's weather forecast is over

Topic analysis

GPT-4

False (0 points)

Domestic model performance

True/Inquite: 4/0/6

Correct answer model: Discussion 5.0, Wenxin Yiyan 4.0, Baichuan, Yuewen

Inexact answer model: None

Incorrect answer models: Xunfei Xinghuo, Doubao, Tongyi Qianwen, Mixed Yuan, Kimi, Zhipu Qingyan

7. What is there to be afraid of death, you have not died before you die, and you cannot be afraid after you die

Topic analysis

Correct/Inquite/False: 8/1/1

Correct answer models: Consultation 5.0, Xunfei Xinghuo, Doubao, Baichuan, Tongyi Qianwen, Kimi, Zhipu Qingyan, Yuewen

Incompletely correct answer model: Wenxin Yiyan 4.0,

Wrong answer model: Mixed elements

8. No snowflake is innocent, Wang Laohan said, pointing to the TV with no signal

Topic analysis

Correct/Inquite/False: 4/1/5

Correct answer model: Consultation 5.0, Wenxin Yiyan 4.0, Baichuan, Kimi

Incompletely correct answer model: 1,000 questions in the general sense

Incorrect answer models: iFLYTEK Xinghuo, Doubao, Mixed Yuan, Zhipu Qingyan, and Yuewen

9. The fortune teller said that I will have as much money as I want after I am 22 years old, and now I have 15 yuan and 8 jiao on me, because today I only need so much

Topic analysis

Correct/Inquite/False: 0/3/7

(Note: There is no model for this question to be answered completely correctly, mainly because of the three models of Negotiation 5.0, Wenxin Yiyan 4.0, and Tongyi Qianwen, which are basically correct, and the key joke of the speaker becoming a beggar has not been excavated.) ）

True Model: None

Incomplete answer model: Consultation 5.0, Wenxin Yiyan 4.0, and Tongyi Qianwen

Incorrect answer models: iFLYTEK Xinghuo, Doubao, Baichuan, Mixed Yuan, Kimi, Zhipu Qingyan, Yuewen

10. In order to make yourself more elegant, ramen changed its name to instant noodles

Topic analysis

Correct/Inquite/False: 1/0/9

(Note: Instant noodles and ramen, convenience and pulling, these two sets of semantic contrasts are too difficult to excavate, so only Wenxin Yiyan 4.0, who has a mentally handicapped bar, answers correctly)

Correct answer model: Wenxin Yiyan 4.0

Inexact answer model: None

Wrong answer model:

Discussion 5.0, iFLYTEK Spark, bean bag,

Baichuan, Tongyi Qianwen, Mixed Yuan,

Kimi, Zhipu Qingyan, Yue Q

It is not easy for the author to pay out of his own pocket for evaluation, friends should like, forward, and collect~

3. Raw data

If you have a question about the model, please feel free to ask in the second form, and the author will pick interesting questions to help you ask about GPT-4, Wenxin Yiyan 4.0 and other models.

Columnist

has been a product Wang, WeChat public account: apmdogy, everyone is a product manager columnist. A logical product manager who is committed to combining scientific thinking with product manager methodology. Focus on artificial intelligence, education, good at product incubation, demand mining, project management, process management and other product skills.

This article was originally published by Everyone is a Product Manager and is prohibited from reprinting without permission.

The title image is from Unsplash and is licensed under CC0.

The views in this article only represent the author's own, everyone is a product manager, and the platform only provides information storage space services.

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

1. Test instructions

Participate in the evaluation of large models

Test Objectives

Test questions

Grading Criteria

Assessment results

Conclusions

2. Evaluation details

1. "Shame on the dead!" shouted Wang Laohan as he threw the body downstairs

2. After discovering that I had no morality, the other party gave up on morality and kidnapped Feng

3. In March, Bao Xi laughed and became a criticism

4. Wang Laohan angrily turned on the faucet, because the boiling faucet burned him

5. "Little brat, what kind of spell is this so powerful?" Hehe, this is the Law on the Protection of Minors

6. Remove a maximum temperature, remove a minimum temperature, and today's weather forecast is over

7. What is there to be afraid of death, you have not died before you die, and you cannot be afraid after you die

8. No snowflake is innocent, Wang Laohan said, pointing to the TV with no signal

10. In order to make yourself more elegant, ramen changed its name to instant noodles

3. Raw data

Read on