laitimes

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

author:Let's make an association
Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

With just a few keywords, a static photo can also dance to the rhythm of the music; after feeding a few paragraphs, you can generate a masterpiece with beautiful writing and strict logic according to the requirements; it only takes 10 seconds of sound to realistically achieve timbre cloning, so that the singing voice of the out-of-tune king is comparable to that of a professional singer, and the cross talk actor can speak English ......#联想全栈智能#

In the past, when it came to these applications, people might have thought of them as far-fetched fantasies. However, in the era of AI, thanks to the strong support of the troika of artificial intelligence - data, algorithms, and computing power, these former fantasies have become reality one by one. #联想方案服务#

Among these three pillars, computing power plays a particularly critical role. It is not only the basic foundation for data and algorithms to play an effective role, but also the core driving force to promote the high-quality development of artificial intelligence. It can be said that without the support of powerful computing power, the progress and development of artificial intelligence will be greatly limited. #联想服务器#

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

With the rise of AI applications such as generative AI and the emergence of new needs such as large model training, the scale of computing power is experiencing unprecedented rapid growth. In this context, AI servers have become the core carrier of intelligent computing power. Compared with traditional servers, AI servers have significant advantages in computing, storage, and network transmission capabilities, and can meet the growing demand for intelligent computing power.

However, the exponential growth in demand for intelligent computing power has led to rising IT infrastructure spending. How to maximize server utilization to achieve cost optimization and benefit maximization while ensuring business continuity and stability has become a common challenge faced by many enterprises.

In this issue of Digital Intelligence QA, we will discuss the key capabilities required to build an efficient, stable, and reliable AI server from multiple dimensions such as hardware configuration, software optimization, and product design through Q&A.

Q What are the trends in the development of computing power under the wave of AI?

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

In the future, the development of computing power will show the following trends:

  • Heterogeneous computing has become mainstream: The traditional heap CPU model can no longer meet the growing demand for AI computing. Heterogeneous computing with GPUs, NPUs, ASICs, and other chips is gradually becoming mainstream. Heterogeneous mode can greatly improve AI computing efficiency and meet the needs of various complex application scenarios.
  • Edge computing has become an important supplement: Edge computing deploys computing resources closer to terminal devices to meet business requirements such as real-time and security of AI applications.
  • Increasing rack density: Increasing rack density is an important trend in data center design due to space constraints in data centers.
  • The importance of intelligent computing power is becoming increasingly prominent: In order to adapt to this trend, the construction of intelligent computing centers has shifted to a hybrid architecture model, which has become an inevitable trend in the development of the industry.

Q What is the difference between an AI server and a regular server?

An AI server is a server specially designed for AI application scenarios. AI servers are mainly used to handle large-scale and complex computing tasks, such as AI deep learning training and inference, to meet the needs of various AI applications.

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

The main differences between AI servers and ordinary servers are as follows:

  • Processing power: Thanks to high-performance processors and dedicated accelerators, AI servers have higher processing power, which can meet the needs of large AI model training and other applications that require a lot of computing power. Ordinary servers, on the other hand, are mainly optimized for general network applications, and there will be certain bottlenecks for processing large-scale data and complex computing tasks.
  • Storage capacity: AI servers often have large storage clusters to meet the needs of processing large-scale data. On the other hand, ordinary servers have different storage configurations based on application scenarios, and the storage capacity is relatively limited.
  • High-speed network: AI servers have higher requirements for network bandwidth, latency, jitter, and packet loss. In general, AI servers require high-speed networks in the form of InfiniBand and RoCE to meet the requirements of AI large-scale parallel computing. However, ordinary servers generally use TCP/IP networks to meet business needs.
  • Energy consumption: AI servers need to process a large number of computing tasks, so their energy consumption is relatively high, and the power consumption of mainstream AI servers can even reach 10kW when fully loaded. However, when ordinary servers handle general network applications, the energy consumption is relatively low, and the power consumption is only about 0.5kW.
  • Application scenarios: AI servers are mainly used to process computing tasks in AI application scenarios, such as deep learning training and inference. Ordinary servers are widely used in various network applications, such as web applications, database applications, etc.

Q What are the types of AI servers that are suitable for different scenarios?

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

AI applications can be divided into two application scenarios: AI training and AI inference. In response to the different requirements for computing power in these two application scenarios, AI servers are divided into training servers, training and pushing integrated servers, inference servers, and edge servers.

  • AI training server: It is mainly used to train machine learning models, which needs to provide powerful intelligent computing power to meet the training requirements of large models.
  • AI inference server: It is mainly used to run trained AI models and perform tasks such as predicting or classifying new input data. The Lenovo ThinkSystem SR645 V3 server is a typical example of this. The server can handle complex AI inference workloads, with two 4th Gen AMD EPYC processors delivering up to 256 cores. Multiple PCIE4.0 and PCIE5.0 slots allow users to flexibly expand configurations according to business needs. The device supports up to four single-width GPUs to fully meet the needs of AI inference applications.
  • AI training and pushing server: It combines the functions of training and inference to provide a one-stop AI intelligent computing solution. Taking Lenovo Wentian WA5480 G3 AI training and pushing all-in-one server as an example, the server can support multiple computing power, rich PCIE5.0 interfaces can support up to 10 double-wide GPUs, support inference, training, rendering, scientific computing and other scenarios and a variety of topologies, further expanding the applicability in different applications.
  • AI edge server: It is mainly used for inference tasks in edge computing scenarios, that is, computing is performed closer to the user to reduce data transmission latency and improve response speed. Edge servers typically have a smaller footprint and power consumption to accommodate the constraints of the edge environment. Recently, Lenovo launched the new ThinkEdge SE455 V3 edge server, further enriching Lenovo's AI edge server product line. Powered by AMD EPYC 8004 series processors, the product delivers up to 34% faster performance for maximum multitasking efficiency. Thanks to Lenovo's technological innovation and design optimization, the SE455 V3 can save up to 50% of energy. Rich expansion features to meet storage, network, and GPU expansion requirements.

Q How do I ensure that AI servers are efficient, stable, and reliable?

AI servers can be effectively ensured to run efficiently, stably, and reliably through reasonable hardware configuration, excellent heat dissipation and energy management, system optimization and tuning, high-availability and fault-tolerant design, and high-standard quality control.

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?
  • Reasonable hardware configuration: By selecting high-performance processors, GPUs and other acceleration devices, memory, and storage devices, it can meet the high computing capacity, high memory, and high storage requirements of AI applications, and significantly improve the training and inference efficiency of large AI models. For example, the Lenovo Wentian WR5220 G3 server uses Intel's latest fifth-generation Xeon ® Scalable processor, which can support up to two 64-core, 385W TDP thermal power design. The next-generation platform is equipped with 5600MT/s high-performance DDR5 memory, low-latency and high-bandwidth NVMe and PCIe 5.0 expansion slots, and the latest GPU performance to maximize system performance.
  • Excellent heat dissipation and energy management: A well-designed heat dissipation system ensures that the server can maintain high performance output even when running under high load. At the same time, an effective energy management strategy can significantly reduce energy consumption and improve energy efficiency. In the face of the increasing TDP thermal power consumption value of CPUs and GPUs, liquid cooling is considered to be a key technology to break through the bottleneck of air cooling. Lenovo's acclaimed Neptune™ warm water cooling technology enables a fanless design with full water cooling for the entire cabinet, a server heat dissipation efficiency of up to 98%, and support waste heat recovery, reducing energy consumption by 42% and reducing the PUE of data centers to 1.1. At the same time, the parallel cooling channel design can reduce the performance jitter of CPU and other devices, and the performance of Linpack can be improved by 5-10% compared with the air-cooled heat dissipation method. Lenovo Poseidon has deployed more than 70,000 sets of warm water cooling technology globally, once again consolidating its position as a leader in the field of server water cooling technology and continuing to help enterprises achieve green and sustainable development. In terms of energy management, Lenovo's LiCO management platform can monitor the energy consumption of clusters and provide energy management strategies. LiCO can dynamically adjust the running frequency of the CPU and the running speed of the fan according to the running conditions of the system, thereby reducing the energy consumption of the entire cluster.
  • System optimization and tuning: Optimize and tune the operating system, AI framework, and algorithm library to improve the overall performance and stability of the server. For example, in terms of job scheduling optimization, Lenovo LiCO can reasonably allocate parallel computing tasks to computing nodes through intelligent job scheduling algorithms and the use of cluster management software, reduce resource competition and queuing waiting time between tasks, improve cluster efficiency and reduce energy consumption.
  • High standard quality control: In order to pursue higher quality and improve the reliability and stability of the server, each server should implement strict quality control standards. Strict quality control measures run through every stage of Lenovo server design, R&D, production and testing. For example, before leaving the factory, Lenovo servers will do a 1000% 1000V DC Hipot test (high voltage withstand voltage test) to ensure that the ultra-high withstand voltage quality of each board exceeds the industry's test standards. As of December 14, 2023, Lenovo servers have won a total of 536 world records in performance tests and passed 87 NCTC testing certifications.

Q How to improve server O&M efficiency and ensure business continuity?

An easy-to-maintain server can significantly reduce the operation and maintenance costs of the enterprise, shorten the downtime, and ensure the continuous and stable operation of the IT system, helping the enterprise to avoid the operation and maintenance troubles.

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

Take Lenovo AI servers as an example. Lenovo has adopted an innovative tool-less installation design in the server, which enables the quick and easy replacement of faulty components, making it easier to install and deploy internal core components. At the same time, the replaceable components inside the server are uniformly marked in blue, so that O&M personnel can quickly and accurately distinguish and replace the components by themselves, thereby reducing the risk of damage caused by improper operation. Thanks to the use of common components, Lenovo servers simplify support for all architecture platforms, greatly facilitating later maintenance.

In addition, Lenovo servers offer other easy-to-maintain designs. For example, optical path diagnostics, which use LEDs to identify faulty memory slots and hard drives, can significantly reduce maintenance and downtime. Relying on hot-swappable parts makes it easy to replace server parts without cutting off power, reducing downtime and avoiding the risk of data loss or damage that can result from replacing hardware devices.

Digital Intelligence QA|What factors are indispensable for the necessary skills of AI servers?

Lenovo servers also support one-click second maintenance function. For example, with two patented plastic parts, the Internal Raid card can be fixed to the server motherboard with one click, instead of the cumbersome way of locking screws in the past, reducing the difficulty of operation. In addition, the one-button fixing method greatly improves the assembly efficiency of components, and enables second-level maintenance.

The agency predicts that the global AI server is expected to exceed 1.6 million in 2024, with an annual growth rate of 40%, and the industry has exploded strong demand for intelligent infrastructure, including AI servers. As the world's leading computing infrastructure and service provider, Lenovo will rely on full-stack intelligent products, solutions and services to promote the continuous development and application of AI technology, empower thousands of industries to accelerate intelligent transformation, and jointly grasp new opportunities in the AI era.

Read on