laitimes

Break through the shackles of computing power! Ascend Embraces the era of "brute force computing" of artificial intelligence big models!

author:Optimist AI talks about technology

In the past two years, the emergence of large models has triggered a huge increase in demand for computing power, which has increased by 750 times, while the supply of computing power for hardware has only increased by 3 times. Zhang Dixuan, President of Huawei's Ascend Computing Business, unveiled the truth of this computing power gap caused by large models at the 2023 World Artificial Intelligence Conference. And this computing power gap is still expanding, and it is expected that by 2030, the computing power required by artificial intelligence will increase by 500 times compared to 2020. At the same time, for various reasons, the localization of computing power has also become very urgent.

Break through the shackles of computing power! Ascend Embraces the era of "brute force computing" of artificial intelligence big models!

Regarding how to solve the shortage of computing power, Zhang Qingjie, Partner of KPMG China Digital Empowerment, believes that it needs to be solved through three ways: computing power construction, infrastructure sharing and optimization, algorithm optimization and data quality, of which computing power construction is the first priority.

Huawei has been very active in building computing power in recent years. According to a July research report by CITIC Securities, Huawei currently occupies about 79% of the market share of urban intelligent computing centers in China.

In addition to the increase in volume, it is also very important to improve the capacity of computing power clusters. At the 2023 World Artificial Intelligence Conference, Huawei announced that the Ascend AI cluster will be fully upgraded, and the cluster size will be expanded from the initial 4,000 cards to 16,000 cards, and the computing power cluster will usher in the era of "10,000 cards".

Break through the shackles of computing power! Ascend Embraces the era of "brute force computing" of artificial intelligence big models!

Ken Hu, Huawei's rotating chairman, said that Ascend AI's cluster will be designed as a supercomputer, improving the performance and efficiency of Ascend AI clusters by more than 10% and improving the stability of the system by more than 10 times.

Zhang Dixuan also revealed in the interview that as early as 2018, Huawei predicted that artificial intelligence would develop rapidly, and changed the development mode of small models in the past, forming a model of large computing power and big data generating large models, so Huawei began to develop computing power cluster products at that time.

In the era of artificial intelligence, it is no longer possible to rely on stacked chips to improve computing power as in the era of stand-alone systems, but it is necessary to systematically reshape the computing power infrastructure. In addition to expanding the huge supply of computing power, it is also necessary to solve the problems of low utilization rate and high threshold of computing power, and finally realize the ecology of computing power.

The emergence of ChatGPT this year has triggered a demand for computing power, GPUs have become the first products to benefit from on the hardware side, and NVIDIA's market value has risen 66% this year to $1.05 trillion.

Break through the shackles of computing power! Ascend Embraces the era of "brute force computing" of artificial intelligence big models!

The GPU based on the NVIDIA A100 has become an excellent choice for large models, but relying only on stacking cards can no longer meet the outbreak needs of the "100-model war". So, how to maximize valuable computing resources?

Since it is already difficult for a single server to meet computing needs, connecting multiple servers into a "supercomputer" is becoming the main direction of the current computing power infrastructure, which is the computing power cluster.

Huawei released the Atlas900 AI training cluster in 2019, which initially consisted of thousands of Huawei's self-developed Ascend 910 (mainly used for training) AI chips, and had supported 8,000 cards by June this year. At the just-concluded World Artificial Intelligence Conference, Huawei announced plans to expand the cluster to more than 16,000 cards by the end of this year or early next year. A WANKA cluster is a target that uses thousands of compute cards, such as graphics cards, for training or inference. For example, when using a GPT-3 model with 175 billion parameters for training, if eight V100 graphics cards are used, the training time is expected to take 36 years; With 512 V100 graphics cards, the training time is close to 7 months; If 1024 A100 graphics cards are used, the training time can be shortened to 1 month.

Break through the shackles of computing power! Ascend Embraces the era of "brute force computing" of artificial intelligence big models!

According to Huawei's evaluation, the Atlas900AI cluster using 8,000 Ascend AI compute cards to train a 100B GPT-3 model can be completed in only one day. A cluster using 16,000 compute cards can complete training in half a day. However, it is not easy to use Vancard clusters for model training.

Gao Wen, an academician of the Chinese Academy of Engineering, pointed out that there are only a few thousand researchers in the world who can simultaneously select models suitable for more than 1,000 computing cards, and the number of people who can train models on more than 4,000 computing cards does not exceed 100, and even fewer people can train models with more than 10,000 computing cards. For the training and inference of thousands and wanka, there are great challenges for software planning and resource scheduling.

Break through the shackles of computing power! Ascend Embraces the era of "brute force computing" of artificial intelligence big models!

Vanka-scale training puts forward higher requirements for distributed parallel training. Distributed parallel training is an efficient way of machine learning that divides a large-scale data set into multiple parts and trains models in parallel on multiple computing nodes. This can greatly reduce the training time and improve the accuracy and reliability of the model.

The distributed parallel training of Ascend computing power clusters relies on Huawei's self-developed MindSpore AI framework. MindSpore supports multiple model types and has developed an automatic hybrid parallel solution that enables hybrid parallel training of data parallel and model parallelism. Through this dual parallel strategy, a larger computing communication ratio can be achieved under the same computing power and network conditions, and the difficulty of manual parallel architecture is also solved, and the efficiency of large model development and tuning is improved.

In addition, due to distributed parallel training, all chips need to be synchronized for each training result, and the probability of errors may occur in this process. In Vanka-scale training, stability is more demanding. The reliability and availability of Huawei's Ascend AI chip design can achieve 30 days of stable training, which is about 3 days compared to the industry's most advanced level, improving performance stability and usability advantages by nearly 10 times.

Break through the shackles of computing power! Ascend Embraces the era of "brute force computing" of artificial intelligence big models!

Recently, Zhuyu Future Technology and a number of listed companies announced the integration of ChatGPT with the company's virtual digital human to develop a smarter and more anthropomorphic virtual digital human. The main product of Zhuyu Future Technology is a "Zhuyu" APP created with a new concept, through which celebrities can create hyper-realistic AI virtual human models for free, and use the company's accumulated AI cross-modal digital human 3.0 technology to achieve highly anthropomorphic "thoughts and behaviors", and users can customize the blessing video of "real people and real voices".

Now, whether for Huawei or other large-model enterprises, how to quickly produce L2 models from L1, and deploy L2 models to the device, edge, and cloud side, has become a problem to open up the last mile of industry applications.

In response to this last mile, Ascend cooperated with iFLYTEK, Zhipu AI, Yuncong and other upstream large model cooperation objects to propose a "training and promotion integration" solution.

Simply understood, model training is equivalent to university learning, inference deployment (the trained model runs in a specific environment) is formal onboarding, and training and promotion integration is "learning while practicing".

For the entire computing power ecology in the field of artificial intelligence, how to open up this last mile as soon as possible has become a top priority, and getting through means being truly activated, and there will be infinite possibilities and sustainable industry ecological development in the future.

Read on