The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

Original by Lu Dongxue InfoQ 2023-05-29 13:30 Posted in Liaoning

Author | Lu Dongxue

Recently, favorable policies related to the field of artificial intelligence in China have been released one after another, and relevant meetings held by the central government emphasized that "in the future, we must pay attention to the development of general artificial intelligence and create an innovative ecology." "Several Measures to Promote the Innovation and Development of General Artificial Intelligence in Beijing (2023-2025) (Draft for Comments)" puts forward 21 specific measures around five major directions, including "carrying out research on large model innovation algorithms and key technologies", "strengthening the research and development of large model training data collection and governance tools", etc., while expanding application scenarios for government services, medical care, scientific research, finance, autonomous driving, urban governance and other fields to seize the development opportunities of large models. To promote innovation leadership in the field of general artificial intelligence, China's large-model technology industry has ushered in a wave of unprecedented development opportunities, Baidu, Alibaba, Huawei and many other domestic enterprises have rapidly laid out related businesses and launched their own artificial intelligence large-model products.

In addition, at present, the entire global large model field has a high-density talent team and capital support. In terms of talents, it can be seen from the background of some large model R&D teams announced so far, that the team members are all from top international universities or have top scientific research experience; In terms of capital, taking Amazon and Google as an example, the capital expenditure of these two companies in large model technology in 2022 reached $58.3 billion and $31.5 billion respectively, and is still on an upward trend, according to Google's latest disclosure data, its training parameter size of 175 billion large models, ideal training cost of more than $9 million.

When a field has a high-density capital and talent team, it means that the field will develop faster. Many people feel that the emergence of ChatGPT, a phenomenal product, has kicked off the vigorous development of big language model technology. But in fact, since the birth of the big language model in 2017, OpenAI, Microsoft, Google, Facebook, Baidu, Huawei and other technology giants have continued to explore in the field of large language models, ChatGPT has only pushed the big language model technology to the outbreak stage, and the current large model product pattern has presented a new situation - foreign basic models have accumulated deeply, and domestic applications have given priority to efforts.

To this end, based on the three research methods of desktop research, expert interviews and scientific analysis, InfoQ Research Center has found a large number of literature and materials, interviewed 10+ technical experts in the field, and at the same time divided semantic understanding, grammatical structure, knowledge question answering, logical reasoning, code ability, context understanding, context awareness, multilingual capability, multimodal capability, data foundation, model and algorithm ability, and four major dimensions of language model accuracy, data foundation, model and algorithm capability. Security and privacy 12 subdivision dimensions, respectively, ChatGPT gpt-3.5-turbo, Claude-instant, Sage gpt-3.5-turbo, Tiangong 3.5, Wen Xin Yiyan V2.0.1, Tongyi Qianwen V1.0.1, iFLYTEK Xinghuo cognitive large model, Moss-16B, ChatGLM-6B, vicuna-13B have been evaluated more than 3000+ questions, and according to the evaluation results, the " Large Language Model Comprehensive Ability Assessment Report 2023 (hereinafter referred to as the "Report").

In order to ensure the objectivity and impartiality of the report and the accuracy of the calculation results, InfoQ Research Center has created a scientific calculation method based on the sample - through the actual test to obtain the answers of each model to 300 questions, score the answers, get 2 points for the correct answer, 1 point for partially correct answers, 0 points for completely wrong answers, and -1 point for the model that it says it will not do. The calculation formula is "the score rate of a model in a subdivision category = model score / model total score". For example, if the total score of the A large model in the category of 7 questions is 10, and the total score obtained by this category is 7*2=14, then the score of the A large model in this question category is 10/14=71.43%.

Based on the above evaluation methods, the report mainly draws many conclusions worthy of everyone's attention, and I hope that the interpretation of the core conclusions below can provide direction for your specific practice and exploration of future big language model technology.

The scale of 10 billion parameters is the "ticket" to large model training, and the technological revolution of large models has begun

For the research and development of large model products, enterprises need to have three major elements at the same time, namely data resource elements, algorithm and model elements, and capital and resource elements. Through the analysis of the product characteristics in the current market, InfoQ Research Center found that data resources, funds and resources are the basic elements of large model development, and algorithms and models are the core elements that distinguish the development capabilities of large language models. The model richness, model accuracy, and ability emergence influenced by algorithms and models have become the core indicators for evaluating the advantages and disadvantages of large language models. It should be noted here that although data and financial resources set a high threshold for the development of large language models, it is still less challenging for large enterprises with strong strength.

A closer look at the core elements of large model products will find that large model training needs to be "large enough", and the scale of tens of billions of parameters is "ticket". Data from GPT-3 and LaMDA show that when the model parameter size is in the range of 10 billion to 68 billion, many of the capabilities of large models, such as computing power, are almost zero. At the same time, a large number of calculations trigger the "alchemy mechanism", according to the appendix chapter in the NVIDIA paper, the calculation amount of an iteration is about 4.5 ExaFLOPS, while the complete training requires 9500 iterations, and the calculation amount of the complete training is 430 ZettaFLOPS (equivalent to the calculation amount of a monolithic A100 running for 43.3 years).

数据来源：Sparks of Artificial General Intelligence Early experiments with GPT-4

Looking at the order of magnitude of the global large model training parameter scale, according to Minsheng Securities Research Institute and wiki encyclopedia data, the estimated parameter magnitude of the international leading large model GPT-4 can reach more than 5 trillion, and the scale of some domestic large models is greater than 10 billion. Among them, Baidu's Ernie and Huawei's Pangu are currently the leaders in the scale of domestic large-scale model parameters with data.

After comprehensive testing of the large language models of each company, the InfoQ research center also found that the foreign ChatGPT capabilities are indeed very resistant, ranking first. Surprisingly, Baidu's Wenxin broke into the top three and ranked second, and it is worth mentioning that its overall score is only 2.15 behind ChatGPT, far ahead of third place Claude.

Data Note: Assessment results are based on the models listed above only and are valid as of May 25, 2023

Throughout the course of the study, InfoQ Research found that algorithms and trained models dominate the performance of large language models. From the basic model to the engineering of the training method, to the specific model training technology, all the manufacturers in the current track, the difference in model selection in each link has created the difference in the final ability performance of the large language model.

The product capabilities of various manufacturers may be different, but because there are enough players involved in the construction of large model technology, their continuous exploration of technology allows us to see the hope of success of the large model technology revolution. At a time when large model products are blooming, big language models have expanded computer capabilities from "search" to "cognition & learning" to "action & solution", and the core capabilities of large language models have shown a pyramid structure.

2 "Writing ability" and "sentence comprehension ability" are the top 2 abilities that large language models are currently good at

According to the evaluation results of InfoQ Research Center, security and privacy issues are the consensus and bottom line of big language model development, ranking first in the ability score. The overall performance of the basic capabilities of large language models is higher, and the overall performance of programming, reasoning and context understanding related to logical reasoning still has much room for improvement. Multimodality remains a unique advantage of a few large language models.

At the level of basic ability, the big language model shows excellent Chinese creative writing ability. Among the six writing subdivision topic categories, the performance of the big language model was more prominent, among which the interview outline and email writing both obtained close to full scores, while the writing of video script was still the less familiar field of the big language model product, and the subdivision topic category score was only 75%.

Regarding literary questions, as the difficulty of writing increases, the level of ability expressed by large language models decreases. Among them, the best performing section was the simple writing question, with a score of 91%; Although many models perform well in couplet questions, some models perform poorly on couplet answers, with an overall score of 55%.

However, in terms of semantic understanding, the current large language model is not so "smart". In the classification of four questions, dialect comprehension, keyword refinement, semantic similarity judgment, and "what to do", the large language model showed a great differentiation distribution, and the "how to do" question obtained the highest score of 92.5%, and the Chinese dialect comprehension question stumped the large language model, and the overall accuracy rate was only 40%.

The InfoQ Research Centre report shows that domestic models outperform international models in the category of Chinese knowledge. Among the ten models, the highest knowledge score was Wen Xin Yiyan, with a score of 73.33%, and the second score was ChatGPT, with a score of 72.67%. In addition to IT knowledge quiz questions, the Q&A performance of large model products in China in the Chinese knowledge environment of the other eight topics classified is generally close to or better than that of international large model products.

In fact, whether it is Chinese creative writing, semantic understanding, and Chinese knowledge Q&A, these topics mainly reflect the basic cognition and learning ability of large language model products, and we can clearly see from the assessment results that Baidu Wenxin has excellent performance in all aspects of data, and all ability scores are ranked in the Top2. However, what we see is not only the technical ability of Wen Xin, but also the strong technological breakthrough and significant progress of the domestic large language model.

3. Domestic products still have a lot of room for improvement in cross-language translation, and the overall challenge of logical reasoning ability is greater

With the increase in investment in artificial intelligence by the state and domestic manufacturers in recent years, we have seen the rapid progress of domestic large language models, and the technical achievements make us happy, but when we look at the development of large language model technology more objectively, we will find that we still have a lot of room for improvement in some aspects compared with the international level.

For example, we can know from the "Report" released by the InfoQ Research Center that the programming ability of foreign products is significantly higher than that of domestic products, and the highest programming score in the ten models is Claude, with a score of 73.47%, and the best performance of domestic products is Wen Xin, with a score of 68.37%, and there is still a certain distance from Claude. Among the four topic classifications, Android-related topics foreign products obviously surpass domestic products, but surprisingly, in the "code autocomplete category" topic, domestic products have surpassed foreign products, which shows that it is only a matter of time before domestic products surpass the international level.

In addition, the highest knowledge score in the ten models is also Claude, with a score of 93.33%, and the highest scores of domestic large language models are Wen Xin Yiyan and Tiangong 3.5, but there is still a gap with the international level. You know, translation questions mainly reflect the ability of large language model products to understand language, this InfoQ evaluation of "programming translation questions", "English writing", "English reading comprehension" three topic classification, large language model presents a very differentiated distribution, in all models assessed, English writing questions get the highest score of 80%, while English reading comprehension only scored 46%, which means that domestic products need to continue to work hard to iterate in cross-language translation.

The gap is still there, but there is no need to be presumptuous, the technological evolution of large model technology has been going on. According to the report, the current challenges of the entire large language model in terms of logical reasoning ability are relatively large. In order to evaluate the comprehension and judgment of large language models, InfoQ Research Center has set up logical reasoning problems in multiple dimensions. In the five question classifications of business tabulation questions, mathematical calculation questions, mathematical application questions, humorous questions, and Chinese characteristic reasoning questions, the overall score of the large language model was lower than the basic ability. Analyze the reasons, business tabulation problems not only need to collect and identify content, but also need to do logical classification and sorting on the basis of content, the overall difficulty is greater, logical reasoning ability is the main offensive direction of future large language model products.

Among the ten models evaluated by InfoQ Research Center, the highest scores for logical reasoning questions were Wen Xin Yiyan and Xunfei Xinghuo, both with 60% scores, only 1.43% worse than the highest scored ChatGPT. In some subdivisions, the performance of domestic products is still very good, for example, in the Chinese characteristic reasoning questions, the domestic model has scored more points ahead of the international model, and the familiarity of the domestic model with the Chinese content and logic should be the core reason for the result.

From the above evaluation results released by InfoQ Research Center, the gap between domestic products and foreign products, the domestic large language model capability is close to the GPT3.5 level, but there is still a huge gap with GPT4 ability. However, throughout the entire field of large language models, in fact, each of us can clearly find that the development threshold and challenges of large language model technology are still very high, and the chip threshold, the threshold of practical experience accumulation, data and corpus need to be broken through by major manufacturers at home and abroad.

According to the evaluation results of the InfoQ Research Center, the comprehensive score of Wen Xin Yiyan is almost the same as that of ChatGPT, and in the latest wave of Internet revolution in China, Wen Xin Yiyan can be called the AIGC product with the most promising to catch up with the international standard in the short term. And the Wen Xin Yiyan team, which has many AI experts, has maintained a diligent attitude of technical exploration and strives to narrow the gap, and the next breakthrough of Wen Xinyi Yan is not far away, which is worth all of us looking forward to.

The "2023 Comprehensive Ability Assessment Report of Large Language Models" was released

Read on

Global AI Agent inventory, big language model entrepreneurship must refer to 60 AI agents

Reversing the Curse: The Powerlessness of Big Language Models

CNCC | Prospective problems and challenges of large language models in mathematics: theory, methods and applications

Recently, the desktop operating system, the three camps have very large version updates. First of all, domestic DeepinOS accesses AI large language models. Immediately after the 26th, Microsoft Wind

The implementation practice of large language model in data warehouse data governance

The breakthrough of the big language model is to equip AI with five senses and five senses

How to use big language models to build a private knowledge base?

🚀Langchain-Chatchat: The New Choice for Local Knowledge Base Q&A! 🌟 Project Highlights: Based on the Big Language Model: Combining Langchain and Ch

Microsoft launched the AutoGen framework to help developers create complex applications based on large language models

Live Review | Potential and resistance, explore the application of big language models in the field of financial risk control

Under the wave of ChatGPT, look at the development of China's large language model industry #Dongshroom Business School#

The Big Language Model of Federal Law

The bookstore picked it up casually and took a look, and stood for three hours to read it, the fastest reading speed 😂 ever#Large Language Model#OpenAI

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

MIT Amazing Proof: Big Language Model is the World Model? LLM understands space and time

How to Become LLM Word Master! "The Underlying Mental Method of Big Language Model"

Title: "The Symphony of Context, Users and Evolution: Exploring the True Potential and Future Road of Large Language Model Technology" In the current torrent of scientific and technological development, large language models have become artificial intelligence

Demystifying generative AI and language models

Huawei returned to the first place in sales in the Chinese market in the first two weeks of 24 years! It can only be said that this is just the beginning, and the current domestic sales of Huazi are only 1/3 of the peak of the chip production capacity is still seriously insufficient, in the future

yai | USSR 20000, 2000s 2.0.0moemo.

Demystifying Sora: Using the method of large language models to understand videos and realize the "emergence" of the physical world

Google has officially launched Gemma, an open-source large language model that claims to surpass Meta's Llama-2 competitors

#三合一新物种#, it is a tablet, a computer and a mobile phone. #酷比魔方酷玩Pad Pro#. It's a tablet, it's a computer, it's a mobile phone. Kubi Rubik's Cube Cool Play Pad Pr

Tongyi Qianwen has open-sourced 32 billion parameter models, and has realized 7 large language models that are all open-source

Use LM Studio to deploy local AI large language models with one click

With 3 times the sensitivity, it only takes a few seconds to search for millions of protein pairs, and Fudan and others have developed new language models

8.3K Stars!

Meta Researchers Crack the Curse of Large Model Reversal and Launch "Language Model Physics"

Decoding AI: Demystifying the "brain" of chatbots - large language models

Predicting protein co-regulation and function, Harvard & MIT trained a genomic language model

Intel has made important progress in the field of artificial intelligence accelerators, and its subsidiary HabanaLabs is in

Researchers propose a new concept of artificial intelligence that allows large language models to interact with the real physical world