laitimes

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

author:AI Tech Review
Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

SuperBench评测显示,Llama 3不敌国产大模型。

Author丨Zhang Jin

Editor丨Chen Caixian

Recently, after a long wait, Meta finally released the 8B and 70B versions of the open-source large model Llama 3, which once again shook the AI circle.

According to Meta, Llama 3 has demonstrated state-of-the-art performance on multiple industry benchmarks, offering new features including improved inference capabilities, and is the best open-source model on the market today.

根据Meta的测试结果,Llama 3 8B模型在语言(MMLU)、知识(GPQA)、编程(HumanEval)等多项性能基准上均超过了Gemma 7B和Mistral 7B Instruct,70B 模型则超越了名声在外的闭源模型 Claude 3的中间版本 Sonnet,和谷歌的 Gemini Pro 1.5 相比三胜两负。 Meta还透露,Llama 3的 400B+ 模型仍在训练中。

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

Meta has managed to retain its throne in the field of open-source large models.

The release of open source Llama 3 is a matter that has a great impact on the entire large model industry, and has once again sparked a heated discussion about the "open source vs. closed source debate". But on the other side of the ocean, back in China, the style of painting has changed abruptly, and there is a harsh voice spreading on the Internet - "Llama 3 is released, and there can be a new breakthrough in the domestic large model".

Even before Llama 3 was released, you could hear the voice of "If you want to catch up with GPT-4 in China, just wait for Llama 3 to open source".

Open source itself is a thing that is committed to breaking the monopoly of technology, promoting the continuous progress of the entire industry, and bringing innovation, but every time Meta is open sourced, from Llama to Llama 3, the domestic large model has to experience a ridicule and belittlement from the Chinese people.

In fact, not only large models, from cloud computing to autonomous driving, similar arguments have endured, the reason is that China's technology has been developing behind foreign countries for a long time, and has been pressed for a long time, causing the Chinese people's technological unconfidence.

But in fact, after a year of hard work and accumulation, foreign large models such as Llama have been very strong, and at the same time, domestic large models can also be latecomers and become very strong, even before the release of Llama 3, domestic large models have evolved to the effect of Llama 3, and even stronger than Llama 3:

近日,清华大学 SuperBench 团队在前不久发布的《SuperBench大模型综合能力评测报告》基础上加测了 Llama 3 新发布的两个模型,测试了 Llama 3 在语义(ExtremeGLUE)、代码(NaturalCodeBench)、对齐(AlignBench)、智能体(AgentBench)和安全(SafetyBench)五个评测集中的表现。

The SuperBench team selected the following list models to place Llama 3 in the ranks of large models in the world for comparison, in addition to the mainstream open source and closed-source models abroad, Llama 3 was also compared with the mainstream models in China.

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

For the closed-source model, the SuperBench team selected the one with the highest score among the two call modes, API and web page.

Based on the results of the reviews they publish, the following conclusions can be drawn:

(1) The Llama 3-70B version is inferior to the GPT-4 series models and Claude-3 Opus and other world-class models in each evaluation set, and has the largest gap from the top in the semantics and code evaluations, and has the best performance in the agent evaluation, ranking 5th, but considering the difference in the number of model parameters, the overall performance of Llama 3-70B is still good.

(2) Compared with the domestic large models, the Llama 3-70B surpassed most domestic models in five evaluations, and only lost to GLM-4 and Wenxin Yiyan.

According to the SuperBench standard test results, it can be found that the domestic large model has long been stronger than the Llama 3 large model, and the domestic large model GLM-4 and Wenxin Yiyan have already reached the strength of Llama 3 and belong to the first echelon of global large model competition. After a year of catching up, the gap between domestic large models and GPT-4 is narrowing.

And this also makes such technical unconfident arguments as "Llama 3 is released, and there can be new breakthroughs in domestic large models" and "If China wants to catch up with GPT-4, just wait for Llama 3 to open source", and they are self-defeating.

1

GLM-4, Wenxin exceeded

Flame 3-70B

The SuperBench Large Model Comprehensive Capability Evaluation Framework was jointly released by the Basic Model Research Center of Tsinghua University and the Zhongguancun Laboratory in December 2023, and its research and development background is based on the evaluation chaos in the field of large models in the past year - through the list, each large model has ranked first in the major lists, catching up with GPT-4.

The purpose of SuperBench is to provide objective and scientific evaluation standards, clear the fog, so that the outside world can have a clearer understanding of the real strength of domestic large models, so that domestic large models can come out of the illusion of covering their ears and stealing bells, face up to the gap with foreign countries, and be down-to-earth.

At present, there are a series of lists at home and abroad to test the ability of large models, but today, because of data pollution and benchmark leakage, the fairness and reliability of the benchmark rankings that have attracted much attention in the field of large models are being questioned list, each has a major breakthrough, either ranking first or surpassing GPT-4.

In a short period of time, it seems that everyone is "far ahead" and their strength is comparable. However, in practice, the performance of most models is often unsatisfactory, and the performance of many models is still far behind GPT4.

This kind of behavior has continued in the past year, and the domestic large models have fallen into a carnival of brushing the list, but everyone knows that there is no model that can really compete with GPT-4. After all, Rome was not built in a day, and the gaps in front of everyone - technological breakthroughs and investment in computing power and capital make us realize the reality - the gap between us and OpenAI cannot be filled in a year and a half.

One of the major consequences of the prevalence of the list is that it is difficult for the outside world to distinguish the strength of the domestic model, and some really powerful large model start-up companies should be able to raise money and attract talents but are robbed by those who are good at propaganda and momentum, causing bad money to drive out good money, which affects the development of the entire domestic model.

Even as mentioned in the introduction, when it comes to domestic large models, some people think that they are all brushed up anyway, what is worth paying attention to? Anyway, they are not as good as foreign countries, and there are many people who applaud domestic large models under their own self-esteem.

Therefore, when evaluating large models, the industry proposes to use more benchmarks from different sources, and the SuperBench team from Tsinghua University, a top university in China, has many years of experience in large model research, and the design of the SuperBench large model comprehensive ability evaluation framework has the characteristics of openness, dynamics, science and authority, and the most important thing is that the evaluation method should be fair.

According to the migration process of large model capability focus - from semantics, pairs, code, agents to security, the SuperBench evaluation dataset includes five benchmark datasets: ExtremeGLUE (semantics), NaturalCodeBench (code), AlignBench (alignment), AgentBench (agent), and SafetyBench (safety).

Let's take a look at the detailed evaluation results, GLM-4 and Wenxin Yiyan surpass Llama 3-70B in which capabilities:

(1) In the semantic assessment, the overall performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

In the evaluation of semantic comprehension ability, Llama 3-70B ranked 6th, behind Claude-3, GPT-4 series models, as well as domestic large models GLM-4 and Wenxin Yiyan 4.0, and there is still a certain gap from the top Claude-3 (a difference of 8.7 points), but it is ahead of other domestic models and is at the top of the second echelon as a whole.

Categorical Performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

Llama 3-70B performed best in the semantic comprehension sub-category assessment in mathematics, outperforming the GPT-4 series of models in 4th place, and also performed well in both reading comprehension and knowledge-science assessments, ranking 6th, with the smallest gap between reading comprehension and the top of the list, with only a 4.3-point gap, but with a lower score of 60.9 in the knowledge-general knowledge assessment, 18.9 points behind the top Claude-3.

(2) In the code evaluation, the overall performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

In the code writing ability evaluation, Llama 3-70B ranked 7th with a score of 37.1, which is worse than the GPT-4 series models and Claude-3 and other world-class models, as well as domestic models such as GLM-4, Wenxin Yiyan 4.0 and iFLYTEK Xinghuo 3.5, and has a large gap with GPT-4 Turbo, with a score difference of 13.7 points. It is worth mentioning that the code pass rate of Llama 3-8B exceeds that of domestic large models such as KimiChat web version and lark large model.

Categorical Performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

Llama 3-70B performs average in the classification evaluation of code writing ability, ranking 6-8, and there is a big gap with GPT-4 series models and Claude-3, among which in the English code instruction-python evaluation, the gap between Llama 3-70B and GPT-4 Turbo at the top of the list has reached 20.3 points;

(3) In the Chinese alignment evaluation, the overall performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

In the evaluation of human alignment ability, Llama 3-70B ranks 7th, still worse than GPT-4 series models and Claude-3; in domestic models, in addition to Wenxin Yiyan 4.0 and GLM-4, Tongyi Qianwen 2.1 also slightly exceeds Llama 3-70B in the alignment evaluation; however, Llama 3-70B is not much behind the models in front, only 0.35 points away from the top GPT-4 web version.

Categorical Performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

Llama 3-70B ranked 7th in the Chinese reasoning evaluation, with a difference of about 0.6 points from the first echelon of GPT-4 series models and Wenxin Yiyan 4.0; ranked 8th in the Chinese language evaluation, but with the GPT-4 series of models and Claude-3, the difference is smaller, in the same echelon, and the top of the list KimiChat web version is only 0.23 points away.

(4) In the agent assessment, organize the performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

In the evaluation of the ability of the agent, the large models at home and abroad performed poorly under this ability, and the Llama 3-70B performed well in the horizontal comparison, only worse than the Claude-3, GPT-4 series models and the domestic model GLM-4, ranking 5th.

Categorical Performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

Llama 3-70B entered the top 3 in the database (DB), knowledge graph (KG) and online shopping (Webshop), but there is still a certain gap from the top of the list, and it also performed well in operating system (OS) and web browsing (M2W), ranking 4th and 5th, and Situational Guessing (LTP) performed the worst with a score of 0.5.

(5) In the safety assessment, the overall performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

In the security capability evaluation, Llama 3-70B scored 86.1 points, ranking 7th, which is not much different from the scores of Wenxin Yiyan 4.0, GPT-4 series, GLM-4 and other models in the front.

Categorical Performance:

Don't say that the technological breakthrough of domestic large models depends on the open source of Llama 3

Llama 3-70B ranked 4th in the cross-sectional comparison for bias discrimination (UB), and ranked 7th and later in other evaluations, but it was not much different from the previous model, with mental health (MH), privacy property (PP), and physical health (PH) all within 3 points of the top score.

Judging from the above SuperBench evaluation results, compared with the domestic large models, Llama 3-70B surpassed most domestic models in five evaluations, only losing to GLM-4 and Wenxin Yiyan, and beating Llama 3-70B to enter the first echelon of the Zhipu GLM-4, ranking first in China in terms of the most critical semantic understanding and agent capabilities, beating many players.

In the past year, Zhipu has also been the most prominent large-scale model start-up company in China - technological breakthroughs and commercialization have achieved leading results.

2

To "reproduce OpenAI" of the Chinese model

How's the company doing?

In the past year, China has given birth to a number of large-scale model unicorns, and Zhipu is one of the fastest companies in China to exceed 10 billion yuan in valuation.

It has won a lot of capital mainly because of its ChatGLM model - in the past year, Zhipu has released three generations of pedestal large models ChatGLM, ChatGLM2, and ChatGLM3 at an average rate of three months, and at the beginning of 2024, Zhipu has released a new generation of pedestal large model GLM-4, which is close to GPT-4 in performance.

And this is also in line with its strategic positioning - a comprehensive benchmark against OpenAI.

The above SuperBench evaluation results once again quantify the capabilities of the GLM-4 model, surpassing Llama 3-70B, approaching GPT-4, and ranking among the first echelon of global models.

Analyzing the development history and current situation of Zhipu, it can be found that Zhipu is a company that combines production, education and research well.

Academically, since the launch of the new generation of pedestal model GLM-4, Zhipu has successively released a lot of research results, involving LLM, multimodal, long text, alignment, evaluation, inference acceleration, agent and other aspects of the large model industry:

For example, a new perspective on evaluating the emergent ability of large models – A key point of exploration in the research and development of large language models is how to understand and improve the "emergent ability" of the model, and the traditional view is that the size of the model and the amount of training data are the decisive factors in improving this ability. The paper "Understanding Emergent Abilities of Language Models from the Loss Perspective" published by Zhipu proposes a new perspective: Loss is the key to emergence, not the model parameters.

By analyzing the performance of multiple language models of different sizes and data volumes on multiple English and Chinese datasets, Zhipu AI found that low pre-training loss was negatively correlated with the high performance of the model in practical tasks. This discovery not only challenges the previous common sense, but also provides a new direction for the optimization of future models, that is, to stimulate and improve the emergence ability of models by reducing the pre-training loss. This insight provides a theoretical basis for AI researchers and developers to introduce new evaluation indicators and methods in model design and evaluation.

In addition, by exposing GLM-4's RLHF technology, large language model alignment is an important issue related to AI control and AI security, and only by ensuring that the behavior and output of the model are consistent with human values and intentions can AI systems serve society more safely, responsibly, and effectively. In response, Zhipu AI has developed a technology called ChatGLM-RLHF, which trains language models by integrating human preferences to produce more popular responses.

Finally, the large-scale model technology and academic research of Zhipu have been transformed into commercial results.

In March this year, during the first anniversary of ChatGLM, Zhipu released a batch of commercialization cases and announced that it had achieved far more than expected results in commercialization, including the identification of more than 2,000 ecological partners, 1,000 large-scale applications, and in-depth co-creation with more than 200 customers.

Compared with other model manufacturers, it is understood that many large model companies have not yet found a suitable commercialization path, in contrast, the commercialization of Zhipu is at least half a year ahead of China.

Zhang Peng, CEO of Zhipu, has expressed such a view many times: the biggest obstacle to the commercialization of large models is still in technology, if Zhipu has really achieved the level of GPT-4 or GPT-5, many commercialization problems, such as poor effect and high price, do not even need to consider the business model, just provide API.

This statement is also suitable for the entire large model industry, and one of the most important factors for Zhipu to be half a year ahead in commercialization is the leadership of its ChatGLM model.

Academic research and model iteration continue to empower commercialization, and Zhipu's achievements today also tell the industry that the nature of industry-university-research in the large-scale model industry determines that those companies that walk on multiple legs in model, business, and academia are bound to be more stable.

3

postscript

In 2023, ChatGPT will detonate the Chinese Internet, which has triggered a wave of large-scale model entrepreneurship at home and abroad. But China's big model is not a tree without roots, water without a source, and will only follow abroad.

As early as 2021, the PBC Zhiyuan Artificial Intelligence Research Institute gave birth to China's first trillion-dollar model "Wudao", which opened the road to the research of domestic large models.

Similarly, after the past year of hard work to catch up and learn, domestic large models such as GLM-4 and Wenxin Yiyan have defeated the strongest open-source model, Llama 3, and entered the first echelon of global competition, justifying the name of domestic technology that can only be followed and imitated.

In the past, it has been emphasized that we should open our eyes to the world and learn from foreign countries, but in the era of large models, looking at the changes of domestic large models in the past year, what we lack more is to face up to the progress of domestic technology.

An industry veteran once sighed: Obviously, domestic large-scale model companies also have a lot of technological innovation, why are they only willing to pay attention to foreign countries, and finally become foreign fires and domestic notices?

For example, the research VDT paper published in arXiv in May 2023 by the large-scale model startup Zhizi Engine is "almost identical" to the Sora "big crash" released by OpenAI in 2024 - the architecture behind Sora, and the Transformer-based Unified Video Generation Framework proposed by the team in a paper published almost 1 year ago.

Before Sora was born, they took this paper VDT, which is now accepted by ICLR 2024, and worked hard for investors and knowledge seekers for more than half a year, but they ran into walls everywhere.

After the Spring Festival, Sora became the new top streamer, and investors who called to meet the team lined up in a long line, all to learn about Sora and the team's papers.

With the explosion of Sora, the DiT architecture has attracted great attention, and the domestic multimodal startup Deep Digital Technology has developed the world's first Diffusion Transformer architecture U-ViT network architecture in September 2022;

The Scaling Prediction of Facing Wall Intelligence, a domestic large-scale model startup, can be ranked in the forefront of the world, and can compete with OpenAI, or even lose to OpenAI;

The innovation and leading nature of domestic large-scale model technology is not inferior to foreign countries, and there are many such examples.

The so-called three days of Shibei should be impressive. I hope we can pay more attention to the innovation of domestic technology and support domestic technology.

The author of this article (vx: zzjj752254) has long been concerned about people, companies and industry trends in the field of AI large models.

Without the authorization of "AI Technology Review", it is strictly forbidden to reprint it in any way on the webpage, forum, and community!

Please leave a message in the background of "AI Technology Review" to obtain authorization for reprinting on the official account, and you need to indicate the source and insert the business card of this official account when reprinting.

Read on