
Please see the title.
This domestic super computing power cluster is a new generation of HCC (High-Performance Computing Cluster) high-performance computing cluster for large model training newly released by Tencent Cloud, and the overall performance has been improved by 3 times compared with the past.
It is equipped with NVIDIA H800 Tensor Core GPU, which can provide high-performance, high-bandwidth, low-latency intelligent computing capabilities.
The current hot artificial intelligence model training is inseparable from high-performance computing power clusters. We are happy to share this good news with you in the first place.
General operations are done by arithmetic cards (chips).
However, in the face of massive computing and the inability of a single chip to support, thousands of servers must be connected through the network to form a large computing power cluster, working together to be higher and stronger.
A large artificial intelligence model usually has to be trained with trillions of words, and the number of parameters has "soared" to trillions. At this time, only high-performance computing clusters can hold on.
The "strength" of the computing power cluster is determined by the single-machine computing power, network, and storage. Like a solid wooden barrel, one is indispensable.
By collaboratively optimizing stand-alone computing power, network architecture, and storage performance, Tencent Cloud's next-generation cluster can provide high-performance, high-bandwidth, and low-latency intelligent computing capabilities for large model training.
In general, there are the following characteristics:
In terms of computing, strong performance -
On the basis of the maximum optimization of single-point computing power performance, we also combine different kinds of chips, GPU + CPU, so that each chip goes to the most appropriate place and does what it is best at.
In terms of networking, the bandwidth is sufficient -
GPUs are good at parallel computing and can do multiple tasks at once. Our self-developed Xingmai high-performance network allows thousands of GPUs to "ventilate" with each other, information transmission is fast and not jammed, and a beautiful cooperation battle is played, and the training efficiency of large model clusters is increased by 20%.
In terms of storage, reading is fast -
When training a large model, thousands of servers will read a batch of datasets at the same time, and if the loading time is too long, it will also become a shortcoming of the barrel. Our latest self-developed storage architecture classifies data into different "containers" and uses them in different scenarios, making reading faster and more efficient.
With the sharp increase in computing power demand, the price of purchasing GPUs by yourself is expensive, and even money cannot buy them, which brings great pressure to start-ups and small and medium-sized enterprises. Our next-generation HCC cluster can help train large models on the cloud and hopefully relieve their stress.
We have the training framework AngelPTM, which supports the training of Tencent's mixed-element large model internally, and has also provided services through Tencent Cloud. In October last year, it completed the first trillion-parameter large model training and reduced the training time by 80%.
Our TI platform (one-stop machine learning platform) has large model capabilities and toolbox to help enterprises fine-tune training according to specific scenarios, improve production efficiency, and quickly create and deploy AI applications.
Our self-developed chips are already in mass production, including Zixiao chips for AI inference. It adopts self-developed memory architecture and self-developed acceleration module, which can provide up to 3 times the computing acceleration performance and more than 45% overall cost savings.
In general, we are building a high-performance intelligent computing network for AIGC based on the integration of software and hardware based on the new generation of HCC, based on self-developed chips and self-developed servers, and continue to accelerate the innovation of the whole society on the cloud.
What do you want the computing power to do in the future? See you in the message area.