laitimes

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

author:New Zhiyuan

Editor: Editorial Department

All computing is AI, which has become an industry consensus. The scale of large model parameters has changed from 100 billion to trillions, from single to MoE, and the demand for computing power is becoming more and more huge. What we need to clarify is that the computing power driven by a single chip can no longer meet the development of LLMs.

Is domestic AI not good because chips are not good?

Is the gap between us and foreign countries too big a gap with NVIDIA chips?

Recently, there has been a lot of argument in the circle.

In fact, if you dig deeper, you will find that this is not the case at all. Even NVIDIA's most advanced chips still can't meet the current demand for artificial intelligence in computing power.

As the number of model parameters and data increases, wisdom continues to emerge, and the need for larger clusters becomes more urgent. Whether abroad or at home, everyone is far from the destination.

Computing power ≠ chips

Today, the current state of large-scale neural network training is like this.

Freshly released Llama 3 training with 8B and 70B parameters requires a cluster of 24,576 H100s.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Xiaozha has revealed that by the end of this year, Meta will build infrastructure built by 350,000 H100s

GPT-4, which is said to have 1.8 trillion parameters, was trained on 10,000-25,000 A100s.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

The explosive Sora training parameters may only be 3 billion, and it is revealed that it is estimated that 4200-10500 H100 was used for 1 month of training.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Tesla's FSD V12 is trained on 10 million massive video clips, which requires about 10,000 H100 blocks and costs $300 million.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Even Ultraman recently mentioned the "core bottleneck" of OpenAI's current growth in an interview with 20VC:

We have some of the best researchers and research culture in the world. If we don't have enough computing resources, it will slow us down.

In a word: give me computing power!

However, due to Moore's Law limitations, process advances from 14nm to 7nm to 5nm have resulted in increasingly limited performance gains.

We need to have the understanding that AI has an infinite demand for computing power, and we can't just rely on AI chips to meet the demand for computing power.

So what to do?

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

What is the bottleneck?

In fact, the new DGX SuperPOD built by the DGX GB200 system launched by NVIDIA at the GTC 24 conference has already given the answer.

By focusing on accelerated computing, networking, and software, the new cluster provides stable support for the training and inference of trillion-parameter models.

Moreover, compared with the previous generation, the network computing power of the new generation DGX SuperPOD architecture has been increased by 4 times.

In other words, the problem just now is solved - through a larger cluster to break through the bottleneck of computing power.

However, with more and more integrated chips, we have to deal with many technical challenges such as inefficient algorithms, insufficient computing resources, and limited interconnection bandwidth.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Insufficient computing resources

On the one hand, the performance of AI systems is mainly derived from accelerators such as GPUs, so they need to have strong heterogeneous scaling capabilities.

However, the traditional computer architecture uses the accelerated computing module as the CPU and accesses the system through the PCI-e bus, which only supports a limited number of heterogeneous units, which limits the scalability of heterogeneous accelerators.

In addition, the communication bandwidth with the CPU is very limited.

Interconnection bandwidth is limited

On the other hand, connectivity has become the new bottleneck.

AI clusters have grown from 1,000 kcal to 100,000 kcal and 100,000 kcal, and the massive communication requirements generated by parallel nodes have seriously challenged the existing interconnection capabilities.

For example, the GPT-4 cluster just mentioned has 25,000 A100 blocks and the computing power utilization rate (MFU) is only between 32% and 36%.

It can be seen that the utilization rate is very low, but under the current technical conditions, it has almost reached the peak.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Article address: https://www.semianalysis.com/p/gpt-4-architecture-infrastructure

Part of the reason is that there are too many failures and the training needs to be restarted from the checkpoint.

If OpenAI's cost of using the A100 in the cloud is $1/h, then the cost of this training alone will be as high as $63 million.

The algorithm is not efficient

Of course, the system is not everything, and AI training is a super complex computing system.

If the model algorithm structure is not matched with the hardware structure, and the parallelization processing is unscientific, the utilization rate of the entire computing platform will be low.

In addition, if you want to achieve high-speed interconnection between cabinets, you also need to face the challenges of not only consuming power, but also not enough heat dissipation.

All in all, to solve the above problems, we need to innovate: to solve the challenges of AI with systematic and creative thinking.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Vanka cluster

Nowadays, many people like to say that the development of the AI industry is "lacking in cores and souls", as if AI cannot be developed, it is the responsibility of the chip manufacturing industry.

But in reality?

A little analysis will show that the computing power design of AI has reached the level of 10,000 cards, and the performance of a certain card does not play a decisive role.

For large models with hundreds of billions or trillions of parameters, the efficiency of a single machine and a single card is no longer so important. At this time, what we need to look at is the overall efficiency of the computing platform.

Take GPT-3 as an example, its training algorithm efficiency MFU is only 21.3%, and nearly 79% of the computing power is wasted.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Address: https://arxiv.org/pdf/2204.02311.pdf

The reason for this significant waste is that in large-scale computing, single-point efficiency is limited. Therefore, no matter how strong the computing power is, it is useless, close to 80% of the time, it is waiting.

Why? First, due to the limitation of interconnection bandwidth, and second, because the algorithm does not consider the optimization of bandwidth, resulting in extremely low efficiency.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

In this case, the interconnection optimization, efficient organization and coordination, and algorithm optimization of the system are becoming more and more important.

hardware

To this end, Inspur Information released the "Convergence Architecture 3.0" last year.

This is a new large-scale computing architecture that decouples compute storage via a high-speed interconnect bus.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

When the GPU computing power is insufficient, it is necessary to build a GPU pool, so that a server can be connected not only to 8 cards, but also to 16 cards and 32 cards.

At the same time, there is a bottleneck in stacking up relatively low computing power, because there needs to be an optimal ratio between the CPU and GPU.

Depending on the type of model and the amount of interaction between models, some GPUs play a larger role and some play a smaller role.

Multiple nodes are connected through a high-speed system bus, and the CPU, GPU, and memory are all based on pooling, realizing the adaptation between the converged architecture and the algorithm model.

This new architecture is not a stand-alone system with chips as the core, but an architecture with Vanka cluster as the design starting point and system as the core.

In the future, an important innovation point in the field of AI computing will be how to give full play to the value of the system and improve the efficiency of the system.

And in this system, the next problem to be solved is how to interconnect.

Interconnection

Obviously, from 1000 calories to 10,000 calories, high-speed interconnection between system clusters has become more and more important.

The previous single-task AI factory model has long been unable to meet the demand.

Clusters are not only oriented to large model training, but also need to provide services, which is exactly what the AICloud model can solve.

However, in the past, private networks for supercomputing could not well support the flexible requirements of multi-user, multi-task, and multi-tenant.

To improve the high-speed interconnection between GPUs and GPUs, NVIDIA's closed-source NVLink network has become the most typical representative.

In the DGX SuperPOD, NVIDIA leverages the fifth-generation NVLink link and Quantum-X800 InfiniBand networking, which provides up to 1800GB/s of bandwidth per GPU in the system.

IT CAN BE SEEN THAT THE POINT-TO-POINT COMMUNICATION EFFICIENCY OF GPU HAS TRANSITIONED FROM 32GB/S IN 2017 TO 1800GB/S, THE HIGHEST TODAY, AN INCREASE OF 56 TIMES.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

In the future of large model training, Inspur Information is firmly supported by "Super AI Ethernet" - compared with traditional RoCE, it can achieve 1.6 times the efficiency improvement.

Why?

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Because it can achieve "device-network collaboration" and bring ultimate computing efficiency to model training.

Device-network collaboration refers to the close cooperation between AI switches and smart NICs, and the introduction of innovative functions to the network based on open technologies.

Multi-path load balancing is one of the best applications.

Packet-by-packet spraying can be deployed on the switch (network side) to maximize bandwidth utilization, but it can cause packets to be out of order.

This problem is difficult to solve by the switch itself.

However, the smart NIC (device side) has enough computing power and resources to rearrange out of order, making the impossible possible and greatly releasing the potential of the network.

Specifically, the out-of-order reordering (out-of-order reassembly) technology can be used to re-orchestrate out-of-order packets and hand them over to upper-layer AI applications, improving bandwidth efficiency from 60% to more than 95%.

It is the emergence of Super AI Ethernet that realizes a tighter coupling between switches and network cards.

On the one hand, the switch can perform fine-grained routing and scheduling of network packets. On the other hand, the SmartNIC provides order-preserving services to achieve efficient and balanced network traffic.

At the same time, the NIC can dynamically programmable congestion control based on the multi-dimensional telemetry information marked on the switch to achieve no blocking and zero packet loss.

Efficient network by switch + smart NIC is a typical feature of "Super AI Ethernet".

It can be seen that in order to truly give full play to the performance of the network, it is not only necessary to provide large bandwidth, but more importantly, to improve the "effective bandwidth" through good scheduling.

Software

With such a complex system, it is necessary to develop corresponding scheduling software, including business awareness, automatic resource scheduling, and elastic expansion.

In addition, fault isolation and self-healing are becoming more and more important in the development of large models.

For this, the same can be achieved via the software system – in the event of a failure, it can be seamlessly reverted back to the previous checkpoint.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Cooling

At the same time, in the Vanka cluster, in order to improve efficiency, it is necessary to make the computing power of each node stronger and stronger.

Therefore, high-density AI computing is an inevitable trend, so that the power supply of cabinets will go from 12-16 kW to 120 kW, and the heat dissipation will gradually move towards liquid cooling.

Coincidentally, NVIDIA also uses liquid cooling in the latest DGX SuperPOD.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

algorithm

Moreover, computing power is driven not only by chips, but also by algorithms.

Since 2017, the day of Transformer's birth, if you follow Moore's Law (18 months to double the performance of the chip), the chip performance has only increased by 8 times.

However, in fact, the performance of AI computing has been improved by more than 1,000 times.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

This is not only due to the optimization of the chip manufacturing process, but also due to the improvement of the entire system.

From the perspective of algorithms, the accuracy of large models in the past was FP32, which later became FP16, and has entered FP8 this year, and will move towards FP4 in the future.

Under this change, the demand for computing power in algorithms will decrease dramatically, but there will be a hunger for innovation.

Inspur Information is based on technical optimization, including algorithm parallelism, parameter parallelism, etc., which has increased the computing power efficiency by as much as 33%.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Specifically, Inspur Information adopts the method of non-uniform flow parallel + optimizer parameter parallel (ZeRO) + data parallel + Loss calculation and partitioning on the source 2.0, which requires less bandwidth and achieves high performance than the classic 3D parallel method.

For example, when the flow of water is evenly parallel, the 24-layer model is divided into 8 computing devices, and each device is evenly divided into 3 layers.

As you can see from the graph below, the memory has reached the upper limit of the GPU in the first stage. As a result, the training of the model requires more equipment and longer parallel lines, resulting in lower computing efficiency.

By using the non-uniform parallel method, the memory capacity of each layer of the model can be evenly distributed, so that the model can be trained in limited computing resources.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

However, under the parallel pipeline strategy, the whole stage is still relatively long.

To solve this problem, the team introduced the parallelism of optimizer parameters to further reduce the memory overhead on each node.

When the memory space is saved, it can be merged into a larger pipeline to reduce the number of nodes used and save computing resources.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

The concept of algorithm innovation also has a proof in the field of large models - MoE.

It is difficult for a 100-billion-level model to achieve trillion-level because the amount of computing and computing time far exceed the load, and the efficiency is extremely low.

However, in the MoE architecture of the hybrid expert system, it is a mixture of several hundreds of billions of parameter models.

Moreover, such an expert scheduling system is more in line with the complex collaborative wisdom emergence system of the human brain.

Try it for yourself

The innovative strategy of "system-centric" in the development of AI is the result of Inspur Information's deep cultivation in the fields of computing power and large models over the years.

As early as 2021, before ChatGPT was born, Inspur Information had become one of the practitioners of large models and released "Source 1.0".

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

After more than two years of iteration, the 100-billion-level parameter basic model "Source 2.0" has been fully open-sourced.

From a certain point of view, they do not want to be a company that "eats" on large models.

But just to explore: how much computing is required by LLMs, what is the most important thing in Wanka interconnection, what are the application scenarios, and where is the value of innovation?

Because only by trying and doing it can you find the answer and gain a deep understanding.

At the IPF 2024 conference, Peng Zhen, chairman of Inspur Information, gave a chestnut:

When the team was training large models on a domestic platform, they found that the interconnection bandwidth rate was not ideal. In order to overcome this problem, engineers have made a lot of optimizations at the algorithm layer, using algorithm parallelism and parameter parallelism, which improves the entire computing efficiency by 33%.

You must know that a 30% increase in the performance of a chip requires at least one iteration of the process. But through practice, Inspur Information found that software algorithms can quickly solve this problem.

For another example, in the development of the nearly 250 billion parameter "Source 1.0", the team obtained the basis of a cognitive large model, that is, the increase in the number of parameters, and the improvement of LLM accuracy.

OpenAI computing power shortage, domestic manufacturers break the game first!Break the single-chip limit, computing power efficiency increased by 33%

Therefore, innovation is not about standing on the shore and thinking about how to swim in the water, but about putting in it and doing it.

From the process of solving problems, find the path of innovation.

This is the concept that Inspur Information has been practicing, building a computing system through all-round innovation of technology, framework and specifications, and opening up a new era of AI!

Resources:

https://mp.weixin.qq.com/s/Cl6lxxjs2UTXEMlh9-EDfg

Read on