laitimes

KLCII has evaluated more than 100 large models: the head model is close to the world's first-class, but there are still "biases"

author:CBN
KLCII has evaluated more than 100 large models: the head model is close to the world's first-class, but there are still "biases"

In 2024, after completing the technical disenchantment of OpenAI, domestic large-scale model manufacturers will intensively release large-scale model products with the help of open-source platforms, and at the same time bring their own "list-beating" actions to prove their technical strength.

According to Wang Zhongyuan, president of KLCII, while the large model industry is currently blooming, it is also facing the problem of good and bad, "As an AI researcher, I sometimes can't tell which one is strong and which one is weak. Wang Zhongyuan said.

In addition, most of the current evaluations are "open-book exams", which train the corresponding abilities according to the existing question banks, and finally temporarily obtain test results that are higher than those of peers. The main problem brought about by this is that the vendor directly opens the book and brushes the questions, which makes the "list" behavior unable to objectively and fairly reflect the technical gap between large models.

On 17 May, KLCII launched the KLCII evaluation system, releasing and interpreting the results of more than 140 open-source and commercial closed-source language and multi-modal large models at home and abroad. On November 14, 2018, under the guidance and support of the Ministry of Science and Technology and the Beijing Municipal Party Committee and Municipal Government, relying on Peking University, Tsinghua University, Chinese Academy of Sciences, Baidu, Xiaomi, ByteDance and other advantageous units in the field of artificial intelligence in Beijing, Beijing Academy of Artificial Intelligence was jointly established.

More than 20 datasets and more than 80,000 test questions were used in this evaluation, of which more than 4,000 subjective questions were derived from self-built original and undisclosed subjective evaluation sets that maintained high-frequency iterations.

KLCII examined the seven capabilities of the language model from two dimensions: simple understanding, knowledge application, reasoning ability, mathematical ability, code ability, task solving, security and values, and evaluated the multimodal understanding and generation ability of the multimodal model. The evaluation results show that in the Chinese context, the comprehensive performance of domestic head language models is close to the international first-class level, but there is still an uneven development of capabilities. For example, in the multimodal comprehension of the image-text question answering task, the open and closed source models are equally divided, and the domestic models are outstanding.

In this regard, in the interview, Lin Yonghua, vice president and chief engineer of the Zhiyuan Artificial Intelligence Research Institute, told the first financial reporter that there is no conclusion on the question of whether open source or closed source is better, because it is difficult to ensure that the closed-source model is a model or multiple models behind it, and may even be connected to the retrieval.

In addition, the evaluation results show that the domestic multimodal model has a small gap with the international first-class level in the context of Chinese. In terms of the multimodal model's Wensheng video capability, Sora has obvious advantages over the length and quality of the demo videos announced by various companies, and among other Wensheng video models that are open to evaluation, the domestic model PixVerse performs well.

Since the alignment of security and values is the key to the implementation of the model industry, but there are differences between overseas models and domestic models in this dimension, the overall ranking of subjective and objective evaluations of language models is not included in this individual score. The subjective evaluation results of the language model show that in the Chinese context, ByteDance Bean Bag Skylark2 and OpenAI GPT-4 rank first and second, and the domestic large model understands Chinese users better. In the objective evaluation of language models, OpenAI GPT-4 and Baichuan3 ranked first and second. Baidu Wenxin Yiyan 4.0, Zhipu Huazhang GLM-4 and Moon Dark Side Kimi all entered the top five subjective and objective evaluations of language models.

多模态理解模型客观评测结果显示,图文问答方面,阿里巴巴通义Qwen-vl-max与上海人工智能实验室InternVL-Chat-V1.5先后领先于OpenAI GPT-4,LLaVA-Next-Yi-34B和上海人工智能实验室Intern-XComposer2-VL-7B紧随其后。

多模态生成模型文生图评测结果显示,OpenAI DALL-E3位列第一,智谱华章CogView3、Meta-Imagine分居第二、第三,百度文心一格、字节跳动doubao-Image次之。 多模态生成模型文生视频评测结果显示,OpenAI Sora、Runway、爱诗科技PixVerse、Pika、腾讯VideoCrafter-V2位列前五。

KLCII has evaluated more than 100 large models: the head model is close to the world's first-class, but there are still "biases"

At present, the development of large models has become universal, and the logical reasoning ability has been significantly improved, which is increasingly close to the characteristics of the human brain. Therefore, with the support of the Haidian District Education Commission, KLCII has collaborated with the Haidian District Teacher Training School to align the student testing methods to examine the differences between the subject level of the large model and human students. The KLCII evaluation found that there is still a gap between the model and the average level of Haidian students in terms of comprehensive subject ability, and there is generally a situation where the literature is strong and the understanding of charts is insufficient, and the large model has a lot of room for improvement in the future.

Yao Shoumei, principal of Beijing Haidian District Teacher Training School, pointed out that in the examination of humanities subjects such as language and history, the model lacks an understanding of the cultural connotation behind the text and the feelings of family and country. When faced with a comprehensive question of historical geography, the model is not as effective in identifying subject attributes as human candidates. Compared with simple English questions, the model is better at complex English questions. When solving a science problem, the model will solve the problem in a way that is beyond the scope of the grade level. When there are incomprehensible questions, there are still obvious "hallucinations" in the model.

Professor Shi Ping, head of the Intelligent Media Computing Laboratory at Communication University of China, said that compared with text, the subjective evaluation of video is extremely complex. Automated metrics cannot fully capture the quality of model generation, let alone quantify the authenticity of the generated video, the semantic consistency of images and texts, and so on. Therefore, it is necessary to systematically construct a subjective evaluation system for the Wensheng video model.

As far as the entire large model industry is concerned, it has become a new trend to no longer "hit the list" and start a price war. Tan Cheng, President of Volcano Engine, said, "This year, the industry will no longer compete for the scale of parameters, because everyone has 'understood'. "

In this regard, Wang Zhongyuan told the first financial reporter that the large model industry will develop in two directions in the future: the top large model will continue to pursue the goal of AGI, but this will also bring consumption of computing power and data. Therefore, another wave of practitioners will pursue opportunities to change industries and industries and reduce costs as much as possible. Therefore, recently, major manufacturers have begun to "volume" unit prices while releasing large-scale model products.

In Wang Zhongyuan's view, the price reduction will promote the progress of industrialization, which is conducive to manufacturers to occupy the market as soon as possible and then carry out the layout of the next move, but the current large-scale model capacity is also in the process of rapid improvement, far from reaching the ceiling. If the price of the product is lower than the actual cost, it has the potential to disrupt the entire market.

(This article is from Yicai)

Read on