The latest evaluation report of Tsinghua University's 14 major LLMs is released, and GLM-4 is in the first echelon

Editor: Editorial Department

Tsinghua University has made the most comprehensive comprehensive ability evaluation of 14 LLMs at home and abroad, among which GPT-4 and Cluade 3 are well-deserved trump cards, while GLM-4 and Wenxin 4.0 have broken into the first echelon in China.

In the 2023 "100 Model War", many practitioners have launched various models, some of which are original, some are fine-tuned for open source models, some are generic, and some are industry-specific. The question of how to reasonably evaluate the capabilities of these models has become a key issue.

Although there are a number of model capability evaluation lists at home and abroad, their quality is uneven, and the rankings vary significantly, mainly because the evaluation data and testing methods are not mature and scientific. We believe that a good evaluation method should be open, dynamic, scientific and authoritative.

In order to provide objective and scientific evaluation standards, the Basic Model Research Center of Tsinghua University and Zhongguancun Laboratory have developed the SuperBench comprehensive capability evaluation framework for large models, aiming to promote the healthy development of large model technology, application and ecology.

Recently, the March 2024 edition of the SuperBench Large Model Comprehensive Capability Evaluation Report was officially released.

The evaluation included a total of 14 representative models at home and abroad. For the closed-source model, one with the highest score in the two call modes of API and web page was selected for evaluation.

The latest evaluation report of Tsinghua University's 14 major LLMs is released, and GLM-4 is in the first echelon

Based on the evaluation results, the following main conclusions can be drawn:

● On the whole, GPT-4 series models and Claude-3 and other foreign models are still in a leading position in many capabilities, and the domestic top models GLM-4 and Wenxin Yiyan 4.0 have performed well, which is close to the level of international first-class models, and the gap has gradually narrowed.

● Among the large foreign models, the GPT-4 series models performed steadily, and Claude-3 also showed strong comprehensive strength, and won the first place in the evaluation of semantic understanding and as an agent, ranking among the world's first-class models.

● Among the domestic large models, GLM-4 and Wenxin Yiyan 4.0 performed the best in this evaluation and were the domestic head models, followed by Tongyi Qianwen 2.1, Abab6, moonshot web version and qwen1.5-72b-chat, which also performed well in some ability evaluations, but there is still a big gap between the domestic large models and the world-class models in the two abilities of code writing and as an agent, and the domestic models still need to work hard.

Migration of large model capabilities & SuperBench

Since the birth of large language models, evaluation has become an indispensable part of large model research. With the development of large-scale model research, the research on its performance focus is also constantly shifting. According to our research, the large model capability assessment goes through the following five stages:

2018-2021: Semantic evaluation phase

Early language models mainly focus on natural language comprehension tasks (e.g. word segmentation, part-of-speech annotation, syntactic analysis, information extraction), and related evaluations mainly examine the language model's semantic understanding ability of natural language. Representative work: BERT, GPT, T5, etc.

2021-2023: Code evaluation phase

With the enhancement of language model capabilities, code models with more application value are gradually emerging. The researchers found that the model trained based on the code generation task showed stronger logical reasoning ability in the test, and the code model became a research hotspot. Representative work: Codex, CodeLLaMa, CodeGeeX, etc.

2022-2023: Alignment evaluation phase

With the wide application of large models in various fields, researchers have found that there are differences between the continuous training methods and the instructional application methods, and understanding human instructions and aligning human preferences have gradually become one of the key goals of large model training optimization. A well-aligned model can accurately understand and respond to the user's intent, laying the foundation for the widespread application of large models. Representative work: InstructGPT, ChatGPT, GPT4, ChatGLM, etc.

2023-2024: Agent evaluation phase

Based on the ability of instruction compliance and preference alignment, the ability of large models to disassemble, plan, make decisions and execute complex tasks as an intelligent center is gradually being explored. Large models as agents to solve practical problems are also regarded as an important direction towards artificial general intelligence (AGI). Representative work: AutoGPT, AutoGen, etc.

2023-Future: Security Evaluation Phase

With the improvement of model capabilities, the evaluation, supervision and reinforcement of model safety and values have gradually become the focus of researchers. Strengthening the research and judgment of potential risks to ensure the controllability, reliability and credibility of large models is the key issue of "sustainable development of AI" in the future.

Therefore, in order to comprehensively evaluate the capabilities of large models, the SuperBench evaluation system includes five evaluation categories and 28 subcategories, including semantics, code, alignment, agents, and security.

PART 1 SEMANTIC EVALUATION

ExtremeGLUE is a difficult collection of 72 bilingual traditional datasets designed to provide language models with more rigorous evaluation criteria, using a zero-shot CoT evaluation method and scoring the model output according to specific requirements.

First, more than 20 language models were used for initial testing, including GPT-4, Claude, Vicuna, WizardLM, and ChatGLM.

Then, based on the comprehensive performance of all models, the most difficult 10%~20% of the data in each category was selected and combined into a "difficult traditional dataset".

Evaluation Methodology & Process

● Evaluation method: 72 traditional bilingual datasets in Chinese and English were collected, and the difficult questions were extracted to form a four-dimensional evaluation dataset, and the zero-sample CoT evaluation method was adopted, and the score of each dimension was calculated as the percentage of the number of questions answered correctly, and the final total score was taken as the average value of each dimension.

● Evaluation process: According to the form and requirements of different questions, the results generated by the zero-sample CoT of the model are scored.

Overall Performance:

In the evaluation of semantic comprehension ability, each model formed three echelons, with 70 points as the first echelon, including Claude-3, GLM-4, Wenxin Yiyan 4.0 and GPT-4 series models.

Among them, Claude-3 ranked first with a score of 76.7, while the domestic models GLM-4 and Wenxin Yiyan 4.0 surpassed the GPT-4 series models to rank second and third, but there was a 3-point gap with Claude-3.

Categorical Performance:

● Knowledge-common sense: Claude-3 leads the way with a score of 79.8, and the domestic model GLM-4 performs well, surpassing GPT-4 to rank second, while Wenxin Yiyan 4.0 performs poorly, with a gap of 12.7 points from the top Claude-3.

● Knowledge-Science: Claude-3 is still the leading model and is the only model with a score of more than 80, while Wenxin Yiyan 4.0, GPT-4 series model and GLM-4 model all have a score of 75 or above, which is the first echelon model.

● Mathematics: Claude-3 and Wenxin Yiyan 4.0 tied for the first place with a score of 65.5, GLM-4 ranked third ahead of the GPT-4 series models, and the scores of other models were more concentrated around 55 points.

● Reading comprehension: The distribution of scores is relatively even, with Wenxin Yiyan 4.0 surpassing GPT-4 Turbo, Claude-3 and GLM-4 to take the top spot.

PART 2 CODE REVIEW

NaturalCodeBench (NCB) is a benchmark test to evaluate the code ability of a model, the traditional code ability evaluation dataset mainly examines the model's ability to solve problems in terms of data structures and algorithms, while the NCB dataset focuses on the model's ability to write correct and usable code in real programming application scenarios.

All questions are screened from the user's questions in the online service, and the style and format of the questions are more diverse, covering questions in seven areas, including databases, front-end development, algorithms, data science, operating systems, artificial intelligence, and software engineering, which can be simply divided into two categories: algorithms and functional requirements.

The questions include two programming languages, Java and Python, as well as Chinese and English. Each question corresponds to 10 human-written corrected test examples, 9 are used to test the functional correctness of the generated code, and the remaining 1 is used for code alignment.

Evaluation Methodology & Process

● Evaluation method: Run the function generated by the model, and compare the output results with the prepared test case results for scoring. The output results are compared with the prepared test case results for scoring, and finally the first-pass rate of the generated code is calculated pass@1.

● Evaluation process: Given the problem, unit test code, and test example, the model first generates an objective function according to the problem, runs the generated objective function, and uses the input in the test case as a parameter to obtain the function running output, which is compared with the standard output in the test case, and the output matches the score, and the output does not match or the function runs incorrectly.

Overall Performance:

In the evaluation of code writing ability, there is still an obvious gap between the domestic model and the international first-class model, the GPT-4 series model and the Claude-3 model are significantly ahead in the code pass rate, and the domestic model GLM-4, Wenxin Yiyan 4.0 and Xunfei Xinghuo 3.5 have outstanding performance, with a comprehensive score of more than 40 points.

However, even the best-performing models still have a first-pass rate of only about 50%, and the task of code generation is still a challenge for today's large models.

Categorical Performance:

In the datasets of Python, Java, Chinese and English, the GPT-4 series models take the first place, reflecting powerful and comprehensive code capabilities, and the rest of the models except Claude-3 have obvious gaps.

● English code instructions: GPT-4 Turbo is 6.8 and 1.5 points higher than Claude-3 in Python and Java, and 14.2 and 5.1 points higher than GLM-4 in Python and Java problems, respectively.

● Chinese code instructions: GPT-4 Turbo is 3.9 points higher than Claude-3 on Python and 2.3 points lower than Java, which is not much different. GPT-4 Turbo is 5.4 and 2.8 points higher than GLM-4 in Python and Java problems, respectively, and there is still a certain gap between the domestic model and the international first-class model in terms of Chinese coding ability.

PART 3 ALIGNMENT EVALUATION

AlignBench aims to comprehensively evaluate the alignment of large models with human intentions in the field of Chinese, evaluate the quality of answers through model scoring, and measure the instruction compliance and usefulness of the model.

It includes 8 dimensions, such as basic tasks and professional competencies, using real, difficult questions with high-quality reference answers. Good performance requires the model to be well-rounded, understand instructions, and generate helpful answers.

The "Chinese Reasoning" dimension focuses on the performance of large models in Chinese-based mathematical computing and logical reasoning. This section consists of taking and writing standard answers from real user questions and involves evaluation in a number of fine-grained areas:

● Mathematical calculations include calculations and proofs in elementary mathematics, advanced mathematics and daily calculations.

● In terms of logical reasoning, it includes common deductive reasoning, common sense reasoning, mathematical logic, brain teasers and other problems, and fully examines the performance of the model in scenarios that require multi-step reasoning and common reasoning methods.

The "Chinese Language" section focuses on the general performance of large models in Chinese writing and language tasks, including six different directions: basic tasks, Chinese comprehension, comprehensive question and answering, text writing, role playing, and professional ability.

Most of the data in these tasks are obtained from real users' questions, and the answers are written and corrected by professional annotators, which fully reflects the performance level of the large model in text application from multiple dimensions. Specifically:

● The basic task examines the ability of the model to generalize to user instructions in the scenario of conventional NLP tasks.

● In terms of Chinese comprehension, the model's understanding of the traditional Chinese culture and the structural origin of Chinese characters is emphasized;

● Comprehensive Q&A focuses on the performance of the model in answering general open questions;

● Textual writing reveals the level of performance of the model in the work of the writer;

● Role-playing is an emerging type of task, which examines the ability of the model to follow the user's requirements for dialogue under the user's instructions;

● Professional competence studies the mastery and reliability of large models in the field of professional knowledge.

Evaluation Methodology & Process

● Evaluation method: Evaluate the quality of answers through strong models (such as GPT-4) to measure the model's ability to follow instructions and its usefulness. The scoring dimensions include factual correctness, meeting user needs, clarity, completeness, richness, etc., and the scoring dimensions are not exactly the same under different task types, and the comprehensive score is given as the final score of the answer.

● Evaluation process: The model generates answers based on the questions, and GPT-4 conducts detailed analysis, evaluation, and scoring based on the generated answers and the reference answers provided by the test set.

Overall Performance:

In the evaluation of human alignment ability, GPT-4 web version occupies the top spot, followed by Wenxin Yiyan 4.0 and GPT-4 Turbo with the same score (7.74), GLM-4 also performs well in domestic models, surpassing Claude-3 and ranking fourth, and Tongyi Qianwen 2.1 is slightly lower than Claude-3, ranking sixth, and is also the first echelon large model.

Categorical Performance:

The overall score of Chinese reasoning is significantly lower than that of Chinese language, and the overall reasoning ability of large models needs to be strengthened:

● Chinese reasoning: GPT-4 series models perform the best, slightly higher than the domestic model Wenxin Yiyan 4.0, and there is a significant gap with other models.

● Chinese language: The domestic model took the top four, namely KimiChat web version (8.05 points), Tongyi Qianwen 2.1 (7.99 points), GLM-4 (7.98 points), and Wenxin Yiyan 4.0 (7.91 points), surpassing GPT-4 series models and Claude-3 and other international first-class models.

Breakdown analysis of each category:

Chinese reasoning:

● Mathematical calculation: The GPT-4 series models occupy the top two places, and the scores of the domestic models Wenxin Yiyan 4.0 and Tongyi Qianwen 2.1 exceed Claude-3, but there is still a certain gap with the GPT-4 series models.

● Logical reasoning: 7 points or more is the first echelon, led by the domestic model Wenxin Yiyan 4.0, and the GPT-4 series model, Claude-3, GLM-4 and Abab6 are also in the first echelon.

Chinese Language:

● Basic tasks: GLM-4 won the top spot, Tongyi Qianwen 2.1, Claude-3 and GPT-4 web version occupied the second to fourth place, and other large domestic models Chinese Yiyan 4.0 and KimiChat web version also performed better, surpassing GPT-4 Turbo.

● Chinese understanding: the overall performance of the domestic model is better, covering the top four, Wenxin Yiyan 4.0 has a clear lead, 0.41 points ahead of the second place GLM-4; among the foreign models, the performance is acceptable, ranking fifth, but the GPT-4 series model performs poorly, ranking in the middle and lower reaches, and the difference with the first place is more than 1 point.

● Comprehensive Q&A: All major models performed well, with 6 models scoring more than 8 points, GPT-4 web version and KimiChat web version scoring the highest scores, and GLM-4 and Claude-3 scoring the same score, close to the top score, tied for third.

● Text writing: KimiChat web version is the best performer, and it is also the only model with a score of 8 or more, GPT-4 Turbo and ranked second and third, respectively.

● Role-playing: The domestic models Abab6, Tongyi Qianwen 2.1 and KimiChat web version are among the top three, and all of them have a score of more than 8, surpassing the GPT-4 series models and Claude-3 and other world-class models.

● Professional ability: GPT-4 Turbo occupies the first place, KimiChat web version surpasses GPT-4 web version to win the second place, and among other domestic models, GLM-4 and Tongyi Qianwen 2.1 also perform well, tied for fourth place.

PART 4 AGENT EVALUATION

AgentBench is a comprehensive benchmarking toolkit that evaluates the performance of language models as agents in a variety of real-world environments, including operating systems, games, and web pages.

Code Environment: This section focuses on the potential applications of LLMs to assist humans in interacting with computer code interfaces. With their excellent coding and reasoning capabilities, LLMs are expected to be powerful intelligent agents that help people interact with computer interfaces more effectively. To evaluate the performance of LLMs in this regard, we introduced three representative environments that focused on coding and reasoning skills. These environments provide real-world tasks and challenges that test the ability of LLMs to handle a variety of computer interface and code-related tasks.

Game Environment: The game environment is part of AgentBench and is designed to evaluate the performance of LLMs in game scenarios. In games, agents are often required to have strong strategy design, follow instructions, and reasoning skills. Unlike the coding environment, the tasks in the game environment do not require specialized knowledge of coding, but more require a comprehensive grasp of common sense and knowledge of the world. These tasks challenge LLMs' abilities in common-sense reasoning and strategy development.

Network Environment: The network environment is the main interface for people to interact with the real world, so evaluating the behavior of agents in a complex network environment is critical to its development. Here, we use two existing web browsing datasets to provide a practical evaluation of LLMs. These environments are designed to challenge LLMs' capabilities in web interface operations and information retrieval.

Evaluation Methodology & Process

● Evaluation method: The model interacts with the pre-set environment in multiple rounds to complete each specific task, and the scenario guessing subclass will use GPT-3.5-Turbo to score the final answer, and the scoring method of the other subclasses will score the model's completion of the task according to the determined rules.

● Evaluation process: The model interacts with the simulated environment, and then uses rule scoring or GPT-3.5-Turbo scoring to the results given by the model.

● Scoring rules: Due to the different score distributions of different subtasks, the total score calculated directly based on the average score is seriously affected by the extreme value, so the scores of each subtask need to be normalized. As shown in the table below, the value of "Weight(-1)" for each subtask is the normalized weight, which is the average score of the model initially tested on Agentbench. When calculating the total score, divide the score of each subtask by Weight(-1) and find the average. According to this calculation, the model with average ability should end up with a total score of 1.

SR: Success rate

#Avg.Turn: The average number of rounds of interaction required to solve a single problem

#Dev. #Test: The expected total number of interaction rounds for a single model in the development set and the test set

Weight⁻¹: The reciprocal of the weight of each sub-score when calculating the total score

Overall Performance:

In the evaluation of the ability of the agent, the domestic model as a whole lags significantly behind the international first-class model. Among them, the Claude-3 and GPT-4 series models occupy the top three, and GLM-4 has the best performance among domestic models, but there is still a large gap with Claude-3, which is at the top of the list.

The main reason for the poor performance of large models at home and abroad is that the requirements of agents are much higher than those of other tasks, and most of the existing models do not have strong agent capabilities.

Categorical Performance:

Except for the first place in online shopping, which was won by the domestic model GLM-4, the top of the list was occupied by the Claude-3 and GPT-4 series models in other categories, reflecting the relatively strong ability to act as an agent, and the domestic model still needs to be continuously improved.

● The top three models of embodied intelligence (Alfworld) are all covered by Claude-3 and GPT-4 series models, which have the largest gap with domestic models.

● In the two dimensions of database (DB) and knowledge graph (KG), the domestic model GLM-4 has entered the top 3, but there is still a certain gap with the top two.

PART 5 SAFETY EVALUATION

SafetyBench is the first comprehensive benchmark for evaluating the safety of large language models in a multiple-choice question. This includes offense, prejudice, discrimination, physical health, mental health, illegal activities, ethics, privacy, property, etc.

Evaluation Methodology & Process

● Evaluation method: Thousands of multiple-choice questions are collected for each dimension, and the understanding and mastery of each security dimension are examined through the model selection test. The few-shot generation method is used in the evaluation, and the answers are extracted from the generated results for comparison with the real answers, and the scores of each dimension of the model are the percentage of questions answered correctly, and the final total score is the average of the scores of each dimension. For the phenomenon of rejection, the rejection score and the non-rejection score will be calculated separately, with the former treating the rejected question as an incorrect answer and the latter excluding the rejected question from the question bank.

● Evaluation process: Extract the answer from the generated results of the model for the specified question and compare it with the real answer.

Overall Performance:

In the security capability evaluation, the domestic model Wenxin Yiyan 4.0 performed well, beating the world-class model GPT-4 series model and Claude-3 to score the highest score (89.1 points), and among other domestic models, GLM-4 and Claude-3 scored the same score and tied for fourth.

Categorical Performance:

Under the five categories of illegal activities, physical health, aggression and offense, mental health, and privacy and property, each model has its own winners and losers, but in terms of ethics and bias discrimination, the scores of each model are large, and the partial order relationship is more consistent with the total score.

● Ethics: Wenxin Yiyan 4.0 beat Claude-3 to rank first, and the domestic large model GLM-4 also performed well, surpassing GPT-4 Turbo to rank among the top three.

● Prejudice and discrimination: Wenxin Yiyan 4.0 continues to rank first, ahead of the GPT-4 series models, followed by GLM-4, which is also the first echelon model.

Resources:

https://mb.webin.kk.com/c/r_aajbtflx3PD06SK

https://mb.webin.kk.com/S/Venuesja1cc9PK6K