laitimes

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

author:Quantum Position

Yunzhong is from the Au Fei Temple

Quantum Position | 公众号 QbitAI

On 17 May 2024, KLCII held a large-scale model evaluation conference to officially launch the scientific, authoritative, fair and open KLCII evaluation system, and released and interpreted the results of the all-round capability evaluation of more than 140 open-source and commercial closed-source languages and multimodal large models at home and abroad.

In this evaluation, the seven abilities of the language model were examined from the subjective and objective dimensions: simple understanding, knowledge application, reasoning ability, mathematical ability, code ability, task solving, security and values. For the multimodal model, the multimodal understanding and generation capabilities are mainly evaluated.

In the context of Chinese, the comprehensive performance of domestic head language models is close to the international first-class level, but there is an uneven development of capabilities. In the multimodal comprehension of the image-text question and answer task, the open and closed source models are equally divided, and the domestic models are outstanding. The domestic multimodal model in the context of Chinese has a small gap with the international first-class level. In terms of the multimodal model's Wensheng video capability, Sora has obvious advantages over the length and quality of the demo videos announced by various companies, and among other Wensheng video models that are open to evaluation, the domestic model PixVerse performs well.

Since the alignment of security and values is the key to the implementation of the model industry, but there are differences between overseas models and domestic models in this dimension, the overall ranking of subjective and objective evaluations of language models is not included in this individual score. The subjective evaluation results of the language model show that in the Chinese context, ByteDance Bean Bag Skylark2 and OpenAI GPT-4 rank first and second, and the domestic large model understands Chinese users better. In the objective evaluation of language models, OpenAI GPT-4 and Baichuan3 ranked first and second. Baidu Wenxin Yiyan 4.0, Zhipu Huazhang GLM-4 and Moon Dark Side Kimi all entered the top five subjective and objective evaluations of language models.

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

多模态理解模型客观评测结果显示,图文问答方面,阿里巴巴通义Qwen-vl-max与上海人工智能实验室InternVL-Chat-V1.5先后领先于OpenAI GPT-4,LLaVA-Next-Yi-34B和上海人工智能实验室Intern-XComposer2-VL-7B紧随其后。

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

多模态生成模型文生图评测结果显示,OpenAI DALL-E3位列第一,智谱华章CogView3、Meta-Imagine分居第二、第三,百度文心一格、字节跳动doubao-Image次之。 多模态生成模型文生视频评测结果显示,OpenAI Sora、Runway、爱诗科技PixVerse、Pika、腾讯VideoCrafter-V2位列前五。

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

Caption: The objective evaluation indicators of the Wensheng diagram model are very different from the subjective feelings, and there are signs of failure, so the ranking is subject to the subjective evaluation. Mdjourney basically can't understand Chinese prompt words, so it ranks low; Only the official prompts and video clips are used to compare and evaluate the videos generated by other models, and the evaluation results are biased.

For the first time, we jointly conducted a large-scale K12 subject test with an authoritative educational institution

At present, the development of large models has become universal, and the logical reasoning ability has been significantly improved, which is increasingly close to the characteristics of the human brain. Therefore, with the support of the Haidian District Education Commission, KLCII and Haidian District Teacher Training School aligned the student test methods to examine the differences between the subject level of the large model and human students.

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

The KLCII evaluation found that there is still a gap between the model and the average level of Haidian students in terms of comprehensive subject ability, and there is generally a situation where the literature is strong and the understanding of charts is insufficient, and the large model has a lot of room for improvement in the future.

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

Yao Shoumei, principal of Beijing Haidian District Teacher Training School, pointed out that in the examination of humanities subjects such as language and history, the model lacks an understanding of the cultural connotation behind the text and the feelings of family and country. When faced with a comprehensive question of historical geography, the model is not as effective in identifying subject attributes as human candidates. Compared with simple English questions, the model is better at complex English questions. When solving a science problem, the model will solve the problem in a way that is beyond the scope of the grade level. When there are incomprehensible questions, there are still obvious "hallucinations" in the model.

Systematically construct a subjective evaluation system for Wensheng video model

Professor Shi Ping, head of the Intelligent Media Computing Laboratory at Communication University of China, said that compared with text, the subjective evaluation of video is extremely complex. Automated metrics cannot fully capture the quality of model generation, let alone quantify the authenticity of the generated video, the semantic consistency of images and texts, and so on. Therefore, it is necessary to systematically construct a subjective evaluation system for the Wensheng video model.

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

Based on the rich scientific research achievements and practical experience of both parties in the field of large-scale model evaluation and video quality evaluation, the evaluation system is jointly established by KLCII and Communication University of China, and gives multi-dimensional scores in four aspects: image and text consistency, authenticity, video quality and aesthetic quality, providing reference for the application and development of AIGC video generation technology.

A scientific, authoritative, fair and open intellectual source evaluation system

Relying on the Ministry of Science and Technology's "Artificial Intelligence Basic Model Support Platform and Evaluation Technology" and the Ministry of Industry and Information Technology's "Large Model Public Service Platform" project, KLCII has jointly developed large model evaluation methods and tools with more than 10 universities and institutions.

In June 2023, the FlagEval large model evaluation platform, jointly built by KLCII and a number of university teams, was launched, and has completed more than 1,000 evaluations of open source large models around the world, and continues to release the evaluation results, accumulating a wide range of world-leading evaluation technologies.

KLCII took the lead in setting up the IEEE Large Model Evaluation Standard Group P3419, organized more than 20 enterprises and scholars to participate in the construction of large model standards, and as the co-construction unit of the draft national standard "Evaluation Indicators and Methods for Artificial Intelligence Pre-trained Models", KLCII adopted a combination of unified rules for objective evaluation and multiple verification scoring for subjective evaluation. Among them, the open-source model adopts the inference code and runtime environment recommended by the model publisher, and uniformly uses industry-common prompts for all models, and does not optimize prompts for models.

The evaluation uses more than 20 datasets and more than 80,000 questions, including multiple evaluation datasets jointly built with partners and built by Zhiyuan, such as Chinese multimodal multi-question type comprehension and reasoning evaluation dataset CMMU, Chinese semantic evaluation dataset C-SEM, Chinese language and cognitive subjective evaluation dataset CLCC, evaluation set TACO for complex algorithm code generation tasks, Wensheng graph subjective evaluation set Image-gen, multilingual Wensheng graph quality evaluation dataset MG18, Wensheng video model subjective evaluation set CUC T2V prompts. Among them, there are more than 4,000 subjective questions, all of which come from the self-built original undisclosed subjective evaluation set that maintains high-frequency iteration, strictly calibrates the scoring standards, and adopts a management mechanism that combines independent anonymous scoring by multiple people, strict quality inspection and random inspection to reduce the impact of subjective deviation. In addition, in order to more accurately evaluate the capabilities of language models, KLCII has mapped the capability labels of all sub-datasets of objective datasets.

The evaluation results of 140+ large models and 80,000+ exam questions at home and abroad are released! Produced by the KLCII evaluation system

Scientific authority, fairness and openness are the highest guidelines of KLCII. Wang Zhongyuan, President of KLCII, said that in the future, KLCII will continue to work with ecosystem partners to build and improve the evaluation system, promote the optimization of model performance and industrial implementation in multiple and complex scenarios, and promote the orderly development of large model technology applications.

— END —

量子位 QbitAI 头条号签约

Follow us and be the first to know about the frontiers

Read on