laitimes

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

author:Technical Alliance Forum

Original by Lu Dongxue InfoQ 2023-05-29 13:30 Posted in Liaoning

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

Author | Lu Dongxue

Recently, favorable policies related to the field of artificial intelligence in China have been released one after another, and relevant meetings held by the central government emphasized that "in the future, we must pay attention to the development of general artificial intelligence and create an innovative ecology." "Several Measures to Promote the Innovation and Development of General Artificial Intelligence in Beijing (2023-2025) (Draft for Comments)" puts forward 21 specific measures around five major directions, including "carrying out research on large model innovation algorithms and key technologies", "strengthening the research and development of large model training data collection and governance tools", etc., while expanding application scenarios for government services, medical care, scientific research, finance, autonomous driving, urban governance and other fields to seize the development opportunities of large models. To promote innovation leadership in the field of general artificial intelligence, China's large-model technology industry has ushered in a wave of unprecedented development opportunities, Baidu, Alibaba, Huawei and many other domestic enterprises have rapidly laid out related businesses and launched their own artificial intelligence large-model products.

In addition, at present, the entire global large model field has a high-density talent team and capital support. In terms of talents, it can be seen from the background of some large model R&D teams announced so far, that the team members are all from top international universities or have top scientific research experience; In terms of capital, taking Amazon and Google as an example, the capital expenditure of these two companies in large model technology in 2022 reached $58.3 billion and $31.5 billion respectively, and is still on an upward trend, according to Google's latest disclosure data, its training parameter size of 175 billion large models, ideal training cost of more than $9 million.

When a field has a high-density capital and talent team, it means that the field will develop faster. Many people feel that the emergence of ChatGPT, a phenomenal product, has kicked off the vigorous development of big language model technology. But in fact, since the birth of the big language model in 2017, OpenAI, Microsoft, Google, Facebook, Baidu, Huawei and other technology giants have continued to explore in the field of large language models, ChatGPT has only pushed the big language model technology to the outbreak stage, and the current large model product pattern has presented a new situation - foreign basic models have accumulated deeply, and domestic applications have given priority to efforts.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

To this end, based on the three research methods of desktop research, expert interviews and scientific analysis, InfoQ Research Center has found a large number of literature and materials, interviewed 10+ technical experts in the field, and at the same time divided semantic understanding, grammatical structure, knowledge question answering, logical reasoning, code ability, context understanding, context awareness, multilingual capability, multimodal capability, data foundation, model and algorithm ability, and four major dimensions of language model accuracy, data foundation, model and algorithm capability. Security and privacy 12 subdivision dimensions, respectively, ChatGPT gpt-3.5-turbo, Claude-instant, Sage gpt-3.5-turbo, Tiangong 3.5, Wen Xin Yiyan V2.0.1, Tongyi Qianwen V1.0.1, iFLYTEK Xinghuo cognitive large model, Moss-16B, ChatGLM-6B, vicuna-13B have been evaluated more than 3000+ questions, and according to the evaluation results, the " Large Language Model Comprehensive Ability Assessment Report 2023 (hereinafter referred to as the "Report").

In order to ensure the objectivity and impartiality of the report and the accuracy of the calculation results, InfoQ Research Center has created a scientific calculation method based on the sample - through the actual test to obtain the answers of each model to 300 questions, score the answers, get 2 points for the correct answer, 1 point for partially correct answers, 0 points for completely wrong answers, and -1 point for the model that it says it will not do. The calculation formula is "the score rate of a model in a subdivision category = model score / model total score". For example, if the total score of the A large model in the category of 7 questions is 10, and the total score obtained by this category is 7*2=14, then the score of the A large model in this question category is 10/14=71.43%.

Based on the above evaluation methods, the report mainly draws many conclusions worthy of everyone's attention, and I hope that the interpretation of the core conclusions below can provide direction for your specific practice and exploration of future big language model technology.

The scale of 10 billion parameters is the "ticket" to large model training, and the technological revolution of large models has begun

For the research and development of large model products, enterprises need to have three major elements at the same time, namely data resource elements, algorithm and model elements, and capital and resource elements. Through the analysis of the product characteristics in the current market, InfoQ Research Center found that data resources, funds and resources are the basic elements of large model development, and algorithms and models are the core elements that distinguish the development capabilities of large language models. The model richness, model accuracy, and ability emergence influenced by algorithms and models have become the core indicators for evaluating the advantages and disadvantages of large language models. It should be noted here that although data and financial resources set a high threshold for the development of large language models, it is still less challenging for large enterprises with strong strength.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

A closer look at the core elements of large model products will find that large model training needs to be "large enough", and the scale of tens of billions of parameters is "ticket". Data from GPT-3 and LaMDA show that when the model parameter size is in the range of 10 billion to 68 billion, many of the capabilities of large models, such as computing power, are almost zero. At the same time, a large number of calculations trigger the "alchemy mechanism", according to the appendix chapter in the NVIDIA paper, the calculation amount of an iteration is about 4.5 ExaFLOPS, while the complete training requires 9500 iterations, and the calculation amount of the complete training is 430 ZettaFLOPS (equivalent to the calculation amount of a monolithic A100 running for 43.3 years).

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

数据来源:Sparks of Artificial General Intelligence Early experiments with GPT-4

Looking at the order of magnitude of the global large model training parameter scale, according to Minsheng Securities Research Institute and wiki encyclopedia data, the estimated parameter magnitude of the international leading large model GPT-4 can reach more than 5 trillion, and the scale of some domestic large models is greater than 10 billion. Among them, Baidu's Ernie and Huawei's Pangu are currently the leaders in the scale of domestic large-scale model parameters with data.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

After comprehensive testing of the large language models of each company, the InfoQ research center also found that the foreign ChatGPT capabilities are indeed very resistant, ranking first. Surprisingly, Baidu's Wenxin broke into the top three and ranked second, and it is worth mentioning that its overall score is only 2.15 behind ChatGPT, far ahead of third place Claude.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

Data Note: Assessment results are based on the models listed above only and are valid as of May 25, 2023

Throughout the course of the study, InfoQ Research found that algorithms and trained models dominate the performance of large language models. From the basic model to the engineering of the training method, to the specific model training technology, all the manufacturers in the current track, the difference in model selection in each link has created the difference in the final ability performance of the large language model.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

The product capabilities of various manufacturers may be different, but because there are enough players involved in the construction of large model technology, their continuous exploration of technology allows us to see the hope of success of the large model technology revolution. At a time when large model products are blooming, big language models have expanded computer capabilities from "search" to "cognition & learning" to "action & solution", and the core capabilities of large language models have shown a pyramid structure.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

2 "Writing ability" and "sentence comprehension ability" are the top 2 abilities that large language models are currently good at

According to the evaluation results of InfoQ Research Center, security and privacy issues are the consensus and bottom line of big language model development, ranking first in the ability score. The overall performance of the basic capabilities of large language models is higher, and the overall performance of programming, reasoning and context understanding related to logical reasoning still has much room for improvement. Multimodality remains a unique advantage of a few large language models.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

At the level of basic ability, the big language model shows excellent Chinese creative writing ability. Among the six writing subdivision topic categories, the performance of the big language model was more prominent, among which the interview outline and email writing both obtained close to full scores, while the writing of video script was still the less familiar field of the big language model product, and the subdivision topic category score was only 75%.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

Regarding literary questions, as the difficulty of writing increases, the level of ability expressed by large language models decreases. Among them, the best performing section was the simple writing question, with a score of 91%; Although many models perform well in couplet questions, some models perform poorly on couplet answers, with an overall score of 55%.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

However, in terms of semantic understanding, the current large language model is not so "smart". In the classification of four questions, dialect comprehension, keyword refinement, semantic similarity judgment, and "what to do", the large language model showed a great differentiation distribution, and the "how to do" question obtained the highest score of 92.5%, and the Chinese dialect comprehension question stumped the large language model, and the overall accuracy rate was only 40%.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

The InfoQ Research Centre report shows that domestic models outperform international models in the category of Chinese knowledge. Among the ten models, the highest knowledge score was Wen Xin Yiyan, with a score of 73.33%, and the second score was ChatGPT, with a score of 72.67%. In addition to IT knowledge quiz questions, the Q&A performance of large model products in China in the Chinese knowledge environment of the other eight topics classified is generally close to or better than that of international large model products.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

In fact, whether it is Chinese creative writing, semantic understanding, and Chinese knowledge Q&A, these topics mainly reflect the basic cognition and learning ability of large language model products, and we can clearly see from the assessment results that Baidu Wenxin has excellent performance in all aspects of data, and all ability scores are ranked in the Top2. However, what we see is not only the technical ability of Wen Xin, but also the strong technological breakthrough and significant progress of the domestic large language model.

3. Domestic products still have a lot of room for improvement in cross-language translation, and the overall challenge of logical reasoning ability is greater

With the increase in investment in artificial intelligence by the state and domestic manufacturers in recent years, we have seen the rapid progress of domestic large language models, and the technical achievements make us happy, but when we look at the development of large language model technology more objectively, we will find that we still have a lot of room for improvement in some aspects compared with the international level.

For example, we can know from the "Report" released by the InfoQ Research Center that the programming ability of foreign products is significantly higher than that of domestic products, and the highest programming score in the ten models is Claude, with a score of 73.47%, and the best performance of domestic products is Wen Xin, with a score of 68.37%, and there is still a certain distance from Claude. Among the four topic classifications, Android-related topics foreign products obviously surpass domestic products, but surprisingly, in the "code autocomplete category" topic, domestic products have surpassed foreign products, which shows that it is only a matter of time before domestic products surpass the international level.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

In addition, the highest knowledge score in the ten models is also Claude, with a score of 93.33%, and the highest scores of domestic large language models are Wen Xin Yiyan and Tiangong 3.5, but there is still a gap with the international level. You know, translation questions mainly reflect the ability of large language model products to understand language, this InfoQ evaluation of "programming translation questions", "English writing", "English reading comprehension" three topic classification, large language model presents a very differentiated distribution, in all models assessed, English writing questions get the highest score of 80%, while English reading comprehension only scored 46%, which means that domestic products need to continue to work hard to iterate in cross-language translation.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

The gap is still there, but there is no need to be presumptuous, the technological evolution of large model technology has been going on. According to the report, the current challenges of the entire large language model in terms of logical reasoning ability are relatively large. In order to evaluate the comprehension and judgment of large language models, InfoQ Research Center has set up logical reasoning problems in multiple dimensions. In the five question classifications of business tabulation questions, mathematical calculation questions, mathematical application questions, humorous questions, and Chinese characteristic reasoning questions, the overall score of the large language model was lower than the basic ability. Analyze the reasons, business tabulation problems not only need to collect and identify content, but also need to do logical classification and sorting on the basis of content, the overall difficulty is greater, logical reasoning ability is the main offensive direction of future large language model products.

Among the ten models evaluated by InfoQ Research Center, the highest scores for logical reasoning questions were Wen Xin Yiyan and Xunfei Xinghuo, both with 60% scores, only 1.43% worse than the highest scored ChatGPT. In some subdivisions, the performance of domestic products is still very good, for example, in the Chinese characteristic reasoning questions, the domestic model has scored more points ahead of the international model, and the familiarity of the domestic model with the Chinese content and logic should be the core reason for the result.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

From the above evaluation results released by InfoQ Research Center, the gap between domestic products and foreign products, the domestic large language model capability is close to the GPT3.5 level, but there is still a huge gap with GPT4 ability. However, throughout the entire field of large language models, in fact, each of us can clearly find that the development threshold and challenges of large language model technology are still very high, and the chip threshold, the threshold of practical experience accumulation, data and corpus need to be broken through by major manufacturers at home and abroad.

According to the evaluation results of the InfoQ Research Center, the comprehensive score of Wen Xin Yiyan is almost the same as that of ChatGPT, and in the latest wave of Internet revolution in China, Wen Xin Yiyan can be called the AIGC product with the most promising to catch up with the international standard in the short term. And the Wen Xin Yiyan team, which has many AI experts, has maintained a diligent attitude of technical exploration and strives to narrow the gap, and the next breakthrough of Wen Xinyi Yan is not far away, which is worth all of us looking forward to.

Read on