laitimes

Behind the "100-model war" is a gap of two hundred times the computing power, and the last mile of the industry's landing

author:Tech Walker
Author|Zhou Ya

If it weren't for the site of this year's WAIC2023 (World Artificial Intelligence Conference), it would be difficult for you to see so many large models at once. It is reported that there are more than 30 large models participating in the exhibition, and only the domestic large language models that are used to benchmark "ChatGPT" include:

The 100 billion parameter Chinese-English dialogue model ChatGLM-130B of the Knowledge Engineering Laboratory of the Department of Computer Science of Tsinghua University, MOSS of the Natural Language Processing Laboratory of Fudan University, Baidu "Wen Xin Yiyan", Alibaba's "Tongyi Qianwen", iFLYTEK Xinghuo Cognitive Big Model, SenseTime Discussion Chinese Language Model, Yunzhisheng Mountain and Sea Model and so on. Of course, these are only the tip of the iceberg, incomplete statistics, domestic large models to describe the hundred model war is not an exaggeration.

Behind the "100-model war" is a gap of two hundred times the computing power, and the last mile of the industry's landing

The sudden outbreak of domestic large models has also led to unprecedented demand for computing power and a steep growth curve. According to statistics, before the advent of deep learning, the growth of computing power for AI training doubled about every 20 months; After that, the computing power used for AI training doubles approximately every 6 months; After 2012, the global demand for computing power for head AI model training accelerated to double every 3 to 4 months, that is, the average annual growth rate of computing power reached an astonishing 10 times. At present, the development of large models is in full swing, and the demand for training computing power is expected to expand to 10 to 100 times the original.

"In the past two years, large models have brought 750 times the increase in demand for computing power, while the supply of hardware has increased by only 3 times." In an interview, Zhang Dixuan, President of Huawei's Ascend Computing Business, explained the imbalance between the growth of large models and the supply of computing power. In other words, there is a gap of more than two hundred times the computing power.

Even, Gao Wen, academician of the Chinese Academy of Engineering and director of Pengcheng Laboratory, pointed out during WAIC 2023 that "computing power is also an index of the development of the digital economy, and if the computing power is enough, the digital economy can develop well; Otherwise, it will not develop well. He also cited a 2022 consulting report from Tsinghua University to support the view, "The computing power index is directly proportional to GDP, and the stronger the computing power, the stronger the GDP." ”

This means that if according to the previous agency's prediction that "AIGC will create a trillion-level market size by 2030", then for domestic large models, the most important thing at present is to find high-reliability and cost-effective computing power.

A huge Wanka "computing power factory"

As we all know, training AI algorithms requires a large amount of GPU computing resources, so in the era of large models, how can the available computing power resources be turned into useful computing power resources?

An effective solution in the industry is that since a single server is difficult to meet computing needs, why not concentrate firepower to do big things, and multiple servers can be connected into a "supercomputer", and this supercomputer is a computing power cluster.

Take Huawei as an example. In 2018, Huawei released its AI strategy and began to build the Ascend AI technology software platform. Today, Huawei has built Ascend AI into a computing power cluster, which combines the comprehensive advantages of HUAWEI CLOUD, computing, storage, network, and energy. Huawei's concept is "DC as a Computer", which is equivalent to designing the AI computing center as a supercomputer.

In 2019, Huawei released the Atlas 900 AI training cluster, which consists of thousands of Huawei's self-developed Ascend 910 AI chips, which can support 4,000 cards in June this year to 8,000 cards so far. Huawei announced plans to reach 16,000 cards by the end of this year or early next year, becoming the industry's first Wanka AI cluster.

Behind the "100-model war" is a gap of two hundred times the computing power, and the last mile of the industry's landing

Zhang Dixuan, President of Huawei's Ascend Computing Business

Why build a computing power cluster?

Zhang Dixuan explained in the interview that in the past, each scenario of the small model was customized, resulting in high development costs and poor monetization ability; After the emergence of large models, the generalization of models is getting better and better, the ability is getting stronger and stronger, and it can well empower various industries. "We judged at that time that if AI was to develop, it must move towards the computing method of large model + large computing power + big data." Therefore, Ascend AI iterates to the Vanka cluster in order to make large model training faster and faster.

What is the concept of Vancard clustering? Taking GPT-3 model training with 175 billion parameters as an example, using a single NVIDIA V100 graphics card, the training time is expected to be 288 years; The training time of 8 V100 graphics cards is expected to be 36 years; The training duration of the 512 V100 is nearly 7 months; The training time of 1024 A100 can be reduced to 1 month.

According to Huawei's evaluation, training a GPT-3 model with 175 billion parameters and 100B data takes one day in an Atlas 900 AI cluster with 8,000 cards, and can be shortened to half a day in a cluster of 16,000 cards. "It's like writing code, hit a keyboard and the files come out." Zhang Dixuan described.

"About half of China's big model innovations are currently supported by Ascend AI." Ken Hu, Huawei's Rotating Chairman, emphasized during WAIC 2023, "Ascend AI clusters can currently improve the training efficiency of large models by more than 10%, improve system stability by more than 10 times, and support 30 days of uninterrupted long-term stable training. ”

Ken Hu also released the report card of Ascend AI in the past year: the number of developers has doubled from 900,000 to more than 1.8 million; Native hatching and adaptation of more than 30 large models of more than 1 billion, accounting for half of domestic large models; At present, it has developed more than 30 hardware partners, more than 1,200 ISVs (independent software developers), and jointly launched more than 2,500 industry AI solutions; In addition, the Ascend AI cluster has supported the construction of AI computing centers in 25 cities across the country, of which 7 urban public computing power platforms have been selected as the first batch of national "new generation artificial intelligence public computing power open innovation platforms", accounting for 90% of the computing power; At the same time, 23 enterprises have launched the Ascend AI series of new products, covering cloud, edge, and device intelligent hardware, jointly improving the efficiency of large model development, training, fine-tuning, and deployment.

Here, we sort out that facing the promising sea of opportunities of AI, Huawei has mainly taken three paths:

First, in the field of computing power, from single point computing power to cluster computing power, create a strong computing power base. This part is mainly based on Ascend AI.

Second, in the industrial field, adhere to open source and openness to strengthen the Ascend artificial intelligence industry ecology. This part focuses on cooperation between government, industry, learning, research and application.

Third, in the ecological field, promote Ascend's AI services from general large models to industry large models, and promote AI "going deeper and more real". This part of the goal is to do a variety of industries.

Behind the "100-model war" is a gap of two hundred times the computing power, and the last mile of the industry's landing

The big model is put into practice

Corresponding to Huawei's three AI development paths, the public's attention around the "big model" has changed, from the early "what" and "why" to the "how to use", in other words, more people are beginning to care about "where the big model can really play a role".

At this time, a relatively segmented industry large model was noticed.

"Oriental Wing Wind" is a three-dimensional supercritical wing fluid simulation large model developed by COMAC Shanghai Aircraft Design and Research Institute, which can simulate the full-scene flight conditions of large aircraft with high precision, and the time is only one-thousandth of the original, which is equivalent to increasing the design speed of three-dimensional airfoils of large aircraft by 1000 times and shortening the research and development cycle of commercial large aircraft.

You know, 50% of the resistance of an aircraft during flight comes from the wings, so how to create a wing that meets the flight requirements of the aircraft is very important. According to Chen Yingchun, member of the Standing Committee of the Science and Technology Committee of Commercial Aircraft Corporation of China and chief designer of long-range wide-body passenger aircraft, the current design of large passenger aircraft mainly adopts three methods: numerical simulation simulation, wind tunnel experiment and flight experiment, which complement each other.

However, "numerical simulation simulation" is time-consuming and costly, which is a major bottleneck in aircraft design; And "flight experiments" and "wind tunnel experiments" are also expensive, so traditional numerical simulation methods are not moving and fast. The only way to solve this problem is AI technology.

Because of the blessing of AI, the "Oriental Wingwind" large model has achieved breakthroughs in four levels: efficiency, accuracy, model and scene: First, in terms of efficiency, AI models are used instead of traditional Navier-Stokes equations to solve, which greatly improves the global simulation efficiency. Second, in terms of accuracy, the characteristics of the area with severe changes in the flow are finely captured, such as the wave phenomenon during the aircraft cruise stage, which improves the prediction accuracy of the model. Third, in terms of models, the establishment of model componentization and distributed parallelism capabilities under big data samples greatly improves the efficiency of new model development. Fourth, in terms of scenarios, a unified mapping of fluid-to-AI data has been established, which is suitable for various simulation scenarios such as automobiles and high-speed railways.

If you further analyze COMAC's "Oriental Wing Feng" large model, it has two prerequisites: first, the technical base of the large model, which comes from Huawei's Ascend AI; Second, the design ideas, expert experience, and industry data in the fluid field are the scope of COMAC.

From here, we can see the development logic of the big model: when the tentacles of technology are deeply applied to various industry scenarios, it will bring about the healthy operation of the entire business system, and then drive the high-quality development of the industry. In this process, technology vendors and industry manufacturers perform their respective roles and complement each other.

"There is a division of labor in the entire industry, and Ascend is mainly to do a good job of computing power, and will not touch large models." Zhang Dixuan also emphasized in the interview.

How to get through the "last mile"?

Talking about the explosive growth of the big model, Zhang Dixuan said frankly that although it is a "hundred model war" now, the focus in the future should be on each with its own division of labor.

Among them, only some large factories of L0 general large models can "burn" them, and more companies are doing L1 industry large models, and some are doing scene large models. For example, in the financial industry, due to the lack of industry attributes in L0, some companies will take the L1 model to make a financial large model, and then combine some subdivision scenarios (such as precision marketing, risk control, intelligent customer service) to make a scene large model. This is an industry trend.

In Zhang Dixuan's view, the commercial competition of large models is about to open, on the one hand, everyone will quickly build models, on the other hand, they will quickly seize the pattern, but then the situation will converge.

When there is a big computing power and a big model, how to open up the last mile of the industry?

At present, the pain points of the entire industry are long R&D cycles of large models, high deployment thresholds, and business security. To solve this problem, Huawei, together with four partners, Facewall Intelligence, Zhipu AI, iFLYTEK, and Cloudwalk Technology, jointly released an integrated large-model training and promotion solution to provide industry customers with "out-of-the-box" large-model integrated solutions through joint design, joint development, collaborative listing, and continuous iteration.

"Customers only need to select the appropriate large model and input industry data, which can complete the whole process of training, fine-tuning, and inference of the large model." Zhang Dixuan pointed out, "Huawei has achieved more than 20 times the model compression in the range of less than 5/1000, helping large models to be compressed and used in scenarios, reducing deployment difficulty and development costs. ”

"Ascend AI supports nearly half of China's original large models, and it is currently the only system in China that has completed the training and commercial use of large models with hundreds of billions of parameters." , Zhang Dixuan finally gave such a set of numbers seemingly easily.