laitimes

The Wave Mountain of Data Centers

At the beginning of 2023, a little pig demon in Langlang Mountain was out of the circle, and in the first unit of the national comic "China Qitan", the little pig demon was full of ambition and wanted to make a career, but it took a lot of hard work, but his work results were denied, and then, the little pig demon said the classic line "I want to leave Langlang Mountain".

The experience of the little pig demon pokes at the hidden pain of contemporary workers, and it is also like those little-known and scratchy dilemmas in the data center.

In recent years, we can hear some far-sighted and outlineable words in many forums, summits, and conferences, such as "computing power is productivity", "infrastructure of digital economy", "cloud with data to empower intelligence", etc., these macro-level trends and routes support the rapid development of computing clusters such as cloud data centers and intelligent computing centers, and we have analyzed a lot in previous articles.

However, in the actual construction process, there will be such and such specific challenges, which may be difficult for people sitting in the office/research institute and pointing at the PPT to imagine.

For example, a female staff member of a computing center in a university in western China once told me that the server heat dissipation mainly depends on air cooling, and maintaining cooling will increase the wind power of air supply, and their female employees cannot wear skirts when entering the computer room; There was a lot of noise in the computer room, and the colleagues who were in charge of operation and maintenance all year round also had their hearing damaged.

These detailed and real problems constitute the wave mountain that the data center must climb, otherwise it will be tired and useless like a piggy, and these problems can only come from the ground rooted under the feet and from communicating with front-line personnel. Today we will talk about the mountains that the data center is waiting to climb based on some on-the-ground experience.

The first mountain: electricity

What comes to mind when you think of the differences between China and the United States in data centers? Chips, architecture, software, industry chain? There is one factor that is easily overlooked, but important: power supply.

Since 2018, Yiqi Research Institute has inspected a number of domestic cloud data centers and found that 2-way 2U is the mainstream specification in the domestic server market, and IDC's server market tracking report also confirmed that from 2018 to 2021, 2U specifications accounted for about 70% of cabinet servers. However, in the US market, 1U is more popular.

What exactly are 1U and 2U? What exactly is the reason for this difference? And what does it mean?

(2U server in the computer room of China Electronics Information Innovation Cloud Base (Shunyi))

We know that with the change of IT equipment technology, the server height used in modern data centers is generally 1U or 2U, U refers to the thickness of rack servers, 1U is 4.45 cm, and the rack server height of early data centers is generally 3-5U.

The fewer the number of U, the lower the server height, the higher the single-machine computing density, and the computing density of a 1U server can reach twice that of a 2U server. However, the requirements for data center clusters in the East Data and West Computing Project, the Beijing-Tianjin-Hebei region, the Yangtze River Delta, the Guangdong-Hong Kong-Macao Greater Bay Area, and the Chengdu-Chongqing hub nodes all emphasize "high density". Because only higher density can supply more computing power on a limited land area and improve land resource efficiency.

In this way, 1U should be a better choice, but the results of the field visit are that 2U specifications account for more cloud data centers in China. There is a decisive factor here - the ability to supply electricity.

Because 1U consumes more power than 2U, a single cabinet supporting about 18 2U servers needs to supply 6kW, and if 36 1U servers are deployed, the power supply will reach 12kW. If the power supply capacity of a single cabinet cannot be reached, the density advantage of 1U cannot be fully utilized.

(and the interior of Linger Eastern Supercomputing Cloud Data Center)

At present, the cabinet power of the mainland data center is still generally low, the mainstream power is mainly 4-6KW, in the publicity of the "East Data and West Calculation" project, you can even see the configuration of "2.5 kW standard rack", and the cabinet above 6kW accounts for only 32%.

The power supply system of the data center has both old and new diseases. The old problem lies in the fact that the traditional data center operates separately from each electromechanical system, the acquisition accuracy is insufficient, the scope of regulation is also limited, the power supply capacity and IT requirements cannot be refined and equivalent, once the power density of a single cabinet increases, the reliability of continuous operation of the power supply may be affected, and the risk of downtime and interruption will also increase. For cloud service providers, power outages in cloud data centers will directly lead to customer business terminals and bring economic losses, which is unbearable.

The new problem is that after the state proposed the "dual carbon" strategy, the construction of green energy-saving data centers has become a consensus, and the increase in the power density of stand-alone machines will directly increase the cooling requirements, thereby increasing the power consumption of air conditioning equipment and air cooling. Taking the cloud data center visited by Digital China Miles in 2021 as an example, Tencent Cloud Huailai Ruibei Data Center uses 52U cabinets, UCloud Ulanqab Cloud Base uses 47U and 54U cabinets, if they all switch to 1U servers, not only can not really improve density, but will increase the challenge of server cooling design.

It is known that the data center must increase the computing density, that is, to increase the single-cabinet density, the single-cabinet power needs higher reliable and highly available power supply capacity to guarantee, so it can be concluded that the power supply capacity will be a mountain that China's data centers must climb.

Second Mountain: Cold

As mentioned earlier, the increase in cabinet power density will increase the power consumption of refrigeration. Maybe the little friends of organic wisdom will ask, using more efficient and energy-saving refrigeration methods, can we solve this problem and smoothly evolve to high density?

Indeed, the data center industry has broken its heart for a more energy-efficient cooling system. On the one hand, it is necessary to accelerate the "western calculation", give full play to the climate advantages of Ulanqab and other western regions, build new data centers, and make use of outdoor natural cooling sources. "Digital China Miles" inspected 7 data center clusters and found that the data centers of Zhangjiakou data center cluster and Linger data center cluster can use natural cooling sources for more than 10 months a year, and the average annual PUE can reach 1.2.

In addition, the advantages of liquid cooling in reducing energy consumption are given full play, and the air cooling is gradually replaced by liquid-cooled servers. For example, Alibaba deployed an immersion cooling computer room in Zhangbei County, Zhangjiakou City, Hebei Province in 2018, a horizontal 54U cabinet, 32 1U dual-socket servers and 4 4U JBODs. At the beginning, we mentioned that the air-cooled machine room brings small troubles to the dress of female employees, and liquid cooling technology can solve this problem well.

Does this mean that liquid cooling technology will soon become widespread in the data center industry? After the end of the 2021 Digital China Miles, the "2021 China Cloud Data Center Investigation Report" launched by Yiqi Research Institute gave the answer of "cautious wait-and-see".

In our view, there are three reasons:

1. Ecological problems in the mature period.

Although liquid cooling efficiency is much higher than air cooling, but for a long time, air-cooled computer rooms occupy the mainstream in data center construction, decades of consistent air-cooled servers have formed a mature ecological chain, construction and operating costs have advantages, so some climate advantageous areas, air-cooling solutions can meet the needs of PUE reduction, such as Huawei Ulanqab cloud data center is dominated by 8 kilowatt air-cooled cabinets. In addition, in some eastern and central regions there is a demand and willingness to introduce liquid cooling, but also to consider the cost, if you can achieve significant energy-saving effects by optimizing the UPS architecture and adopting intelligent energy efficiency management solutions, then air-cooled is air-cooled.

2. Technical issues during the transition period.

Of course, for HPC, AI and other computing, the advantages of liquid cooling are great, so some companies want to try liquid cooling technology, but do not want to transform the air-cooled machine room, so from air-cooled to liquid-cooled transition period, there is a market demand for "air-liquid mixing".

We know that air-cooled servers can be loosely coupled with refrigeration equipment, with high environmental adaptability and flexibility, while immersion liquid cooling requires the server's boards, CPUs, memory and other heat-generating components to be completely immersed in coolant, and spray liquid cooling requires the chassis or cabinet to be transformed, both of which bring relatively high costs. During the transition period, cold plate liquid cooling and air cooling are mixed with each other, which is a more suitable solution. However, cold plate liquid cooling to fix the cold plate on the main heating device of the server, relying on the liquid flowing through the cold plate to take away the heat, full sealing and leakage prevention requirements are high, design and manufacturing is very difficult.

(The Atlas 900 cluster deployed in HUAWEI CLOUD Dongguan Songshan Lake Data Center uses air-liquid hybrid technology to dissipate heat.)

3. Collaboration of the industrial chain.

Liquid-cooled data centers require collaborative innovation in the upstream and downstream of the industry chain, including manufacturing, design, materials, construction, operation and maintenance. The air-cooled mode is also because of the loose coupling, resulting in the refrigeration industry and the data center industry is relatively separated, to promote the transformation of the data center to liquid cooling, it is necessary to build a new ecology, strengthen the connection between various roles, and reduce the pre-manufacturing cost and subsequent maintenance cost of the liquid-cooled server. This requires a multi-party running-in and cooperation process, which cannot be achieved overnight.

From these perspectives, liquid-cooled data centers, although the general trend, still have a long way to go, and the entire industry continues to pay attention to change.

The third mountain: the core

If power supply efficiency and air-cooled liquid cooling are important changes in the infrastructure of cloud data center computer rooms, then chips may be the focus of IT infrastructure.

In 2021, the Digital China Journey, exclusively sponsored by Arm Technology, discovered a new phenomenon during its inspection of Guizhou, Ulanqab, Inner Mongolia, and Linger - China's "core" power is rising, and the maturity and application of domestic technology are increasing, catching up with the mainstream. Alibaba Cloud's Esky 710, AWS's Graviton, Ampere's Altra, etc., have all achieved great development and application.

There are many reasons for this situation, such as the autonomy of the cloud full stack, which provides market support for China's "core"; The acceleration of digitalization in government affairs, finance, transportation, electric power, manufacturing and other industries has provided application scenarios for China's "core". The coexistence of x86 and ARM provides a research and development basis for the customization and optimization of China's "core" based on the new architecture.

But it must be pointed out that the moon has a dark side. Behind the rise of China's "core", we must also see that China's semiconductor field is still difficult to explore.

First of all, there is the shackles of the process. We know that the continuation of Moore's Law is based on the advancement of process technology, but the improvement of semiconductor process technology has reached the ceiling for a long time, and it cannot keep up with the speed of chip specification improvement. Therefore, cloud data centers began to adopt the practice of "heap CPU" to improve cabinet density, but the performance improvement brought by stacking material has boundaries, and cannot stop there.

So in the post-Moore era, chiplets began to be selected by many domestic chip manufacturers. This new chip design paradigm, which allows multiple silicon wafers to be packaged together to form a network of chips, is a technology that is being adopted by both the x86 and ARM ecosystems. However, it should be noted that in the current IP reuse method, there are relatively mature methods for IP testing and verification, but how to test multiple chiplets after packaging and how to ensure yield are still problems that China's "core" must solve.

More importantly, the packaging of the chiplet relies on advanced packaging technology, and the chip I/O interface can be co-designed and optimized with the package, which is very important for the improvement of chip performance. This requires advanced packaging design and chip design to have strong interaction, but also put forward certain requirements for design tools, we know, EDA tools have always been one of the "soft underbelly" in the mainland semiconductor field, this point is not solved, in the Chiplet more and more important today, China "core" is difficult to sit back and relax.

At present, as an important part of digital infrastructure, data center clusters are undergoing a series of changes, and how well they are doing and what questions are to be solved is a question that must be answered but not easy to answer.

I don't know the true face of Lushan Mountain, but I am in this mountain. Many things, only close to the front line of the field, and then pulled out to look at the overall situation, can we see the "wave mountain" that trapped the progress of the data center.

There are still many mountains that the data center needs to cross in 2023, and although the road is obstructed and long, as long as you keep walking on the road, there will always be a day when the sea and sky can be flown by birds.

Read on