laitimes

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

author:CICC Research
This week, Open AI and Google respectively released a new generation of models: GPT-4o and Gemini series models. This article will introduce the progress of the two AI giants in the field of large models, and discuss the hardware, operating system, and computing power. We believe that with the gradual implementation of AI on the device side, it will drive the innovation and upgrading of consumer electronics terminals, and put forward higher requirements for cloud computing hardware systems, especially the inference side.

summary

What are the similarities and differences between Gemini 1.5 Pro and GPT-4o? We believe that GPT-4o is an innovation in the end-to-end model, bringing a new breakthrough in human-computer interaction. Google's Gemini performance has been upgraded, and AI capabilities have been widely integrated into its ecosystem. In contrast, both are native multimodal large models, which are expected to arouse the industry's enthusiasm for imitation, and native multimodality may become the future development trend; But the difference is that Gemini has a larger context window and more attractive pricing; The GPT-4o model has stronger performance and emphasizes more innovation in human-computer interaction in practical application scenarios.

The implementation of AI on the device side has brought about a change in the human-computer interaction mode of consumer electronics terminals, focusing on the upgrade of the operating system and the application prospect. On the hardware side, we believe that the release of the two models has accelerated the progress of AI landing on the device side from four aspects: 1) innovation of multimodal interaction methods; 2) anthropomorphism of AI voice assistants; 3) the application prospect of AI functions in mobile devices; 4) Commercialization prospects. Although the current large model is still dominated by cloud computing power calling, from the current efforts of various companies in the compression of model parameters, combined with the prospect of end-side commercial realization, it will become the only way for part of the computing power to sink to the end side in the future, and the corresponding consumer electronics terminals will also usher in innovation and upgrading at the hardware level. On the operating system and application side, the anthropomorphism of voice assistants has been improved, which on the one hand makes AI agents possible, and on the other hand, changes in the interaction mode in the future may bring changes in traffic entrances, which will profoundly affect the ecological pattern.

Cloud computing hardware: Some of GPT-4o's functions are free and open, and the improvement of Gemini's capabilities may put forward requirements for the reduction of unit computing power costs, and AI infra is facing significant optimization. We have seen that the current industry's measurement of computing hardware performance and cost has gradually changed from training-oriented to inference-oriented. In addition to the continuous upgrade of the chip side and the network hardware side (such as optical modules), system engineering capabilities are also being continuously strengthened: in order to obtain lower hardware utilization and reduce inference costs, optimization of video memory, implementation of operator fusion/operator implementation optimization, low-precision (quantized) inference, and distributed inference are all mainstream implementation methods. We believe that the computing hardware market is expected to enter the era of exchanging price for volume with the implementation of applications, and the market size may continue to grow.

risk

The progress of AI algorithm technology and application is less than expected, the AI monetization model is uncertain, and the demand for consumer electronics smart terminals is sluggish.

body

GPT-4o VS谷歌Gemini:大模型迭代到哪了?

Figure 1: Google I/O Conference and OpenAI Spring Conference

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Google I/O Conference 2024, OpenAI Spring Conference, CICC Research

Open AI: GPT-4o is an innovation in the end-to-end model, bringing new breakthroughs in human-computer interaction

On May 13, OpenAI launched a new generation of flagship model GPT-4o (o is omni, that is, all-encompassing) at the spring conference. GPT-4o adds language processing capabilities on the basis of GPT-4, which can accept any combination of text, audio, and images as input, and generate any combination of text, audio, and image output. At the same time, it is closer to human expression in terms of latency, human tone simulation, expression, etc., which is a step towards more natural human-computer interaction.

Low latency and fast response improve the anthropomorphic ability of voice assistants. Prior to GPT-4o, the average latency of conversations using the voice model was 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4). Thanks to the shift from three models to one end-to-end model, GPT-4o is able to react to audio input in less than 232 milliseconds, with an average reaction time of 320 milliseconds, which is similar to the reaction time of humans in a conversation, improving the user experience.

It is free and open, and the cost performance of API calls is improved, which is expected to open up the space for commercialization. OpenAI announced at the launch event that GPT-4o will be available for free to all users[1], and paid users will be able to enjoy five times the amount of calls. GPT-4o API is 2x faster than GPT-4 Turbo API at half the price.

In addition, the GPT-4o launch conference also focused on the combination of AI and practical application scenarios. At the conference, Open AI demonstrated many real-world scenarios of GPT-4o interacting with users in the form of voice assistants, including voice search, image recognition, emotional feedback recognition, etc. The display of these actual scenarios allows the market to see a broad space for potential AI application scenarios in the future.

Figure 2: GPT-4o's functional characteristics and application scenarios

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: OpenAI official website, Shanghai Kaili Wuzhi Technology Co., Ltd. official website, CICC Research Department

Google: Gemini's performance has been upgraded, and AI capabilities have been widely integrated into its ecosystem

On May 14, 2024, Google held the 2024 I/O Conference and released a series of large-scale model products and AI applications. We have seen that OpenAI and Google have held new product launch conferences before and after, and there is a great tendency to compete with each other, such as Gemini 1.5 Pro for GPT-4o, Project Astra for ChatGPT-4o, Gems for GPTs, Veo for Sora, etc., reflecting that Google is accelerating the narrowing of the gap with OpenAI in terms of AI large models. In addition, we believe that the launch of features such as AI Overview, Ask Photos, and AI+Workspace also reflects that Google is actively leveraging its own industrial and ecological advantages to promote the integration of AI and applications.

Project Astra benchmarks against ChatGPT-4o to create a smoother and richer human-computer interaction experience. Project Astra is based on Google's Gemini model, which can process multi-modal signals such as vision and speech at the same time, and shows strong comprehension, memory, and instant response capabilities. We have observed that Project Astra is running on at least two hardware devices, the Google Pixel and the prototype glasses, and we believe that the AI model is accelerating the deployment of applications on various smart devices.

图表3:Project Astra演示demo

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Note: The picture on the left shows the operation of the smartphone, and the picture on the right shows the operation of the smart glasses;

Source: Google I/O Conference 2024, CICC Research

The functions of device-side applications have been upgraded. 1) Gemini Nano: Currently, the Nano model running on the device side is only used for text modality, but Google announced that it will implement real-time voice interaction function this summer[2], and launch video interaction function later this year, and the multi-modal function tends to be improved. 2) Gems: Similar to OpenAI's GPTs, Gems can customize AI assistants with specific characteristics to provide users with personalized assistance for tasks such as fitness, companionship, cooking, programming, and writing. 3) Android system upgrade: Google announced three new AI functions: Circle to search, AI agent, and model localization deployment, which improves the intelligence level of the Android system while ensuring privacy and security.

Relying on its own ecological advantages, Google is actively promoting the integration of AI and applications. 1) Search: AI Overview can automatically summarize the content of the entire network in the search, and realize functions such as overview, reasoning, planning, and typesetting. 2) Albums: Ask Photos uses natural language to search for photos in a specific album. 3) Office: Functions such as work summary, email Q&A, and intelligent reply are added to AI Workspace to empower AIGC to automate enterprises and improve office productivity. 4) Multimodal: Google has launched large models such as Imagen 3, Music AI Sandbox, and Veo, which correspond to image, music, and video generation, respectively. Among them, the Veo model can generate 1080P HD video with a duration of more than 1 minute based on the prompts of multiple modal information, further narrowing the gap with OpenAI Sora. Google's current industry covers smart terminals, Internet, enterprises, medical care, unmanned driving and other industries, as a technology company with a fairly extensive ecological layout, we believe that Google has inherent advantages in the application side. As large-scale model technology continues to mature, we expect Google to accelerate the adoption of applications.

对比:Gemini 1.5 Pro与GPT-4o有何异同?

Traditional multi-modal large models are often trained separately and then fused together, although a large model can be used to process data of different modalities, but there is a lack of coordination between different modalities. The training corpus of Gemini and GPT-4o includes text, images, audio and video and other modal data at the same time, and all inputs and outputs are processed in the same neural network. Judging from the demos of the two companies, the final effect is that the large model can understand multimodal information and the relationship between information at the same time. We believe that Google and OpenAI, two industry leaders, have coincidentally developed native multimodal large models, which is expected to trigger the industry's enthusiasm for imitation, and native multimodality may become a future development trend.

Figure 4: Traditional multimodal large model architecture

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: V7 Labs, CICC Research

Figure 5: Gemini's native multimodal large model

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Google, CICC Research

Gemini context window is larger. At the 2024 I/O conference, Google announced that it will expand the number of context window tokens of Gemini 1.5 Pro from 1 million to 2 million, which is equivalent to 2 hours of video, 22 hours of audio, more than 60,000 lines of code or more than 1.4 million words of text, far ahead of other large models (Claude 3 has 200,000 tokens, while GPT-4o is only 128,000).

图表6:Gemini 1.5 Pro、Claude 3、GPT-4 Turbo上下文窗口长度对比

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Google, CICC Research

Gemini pricing is more attractive. Let's take the context length of 128,000 tokens as an example, according to the information on Google's official website, the input and output prices of Gemini 1.5 Pro are $3.5/1M tokens and $10.5/1M tokens, respectively, compared to GPT-4o (the context window is 128,000 tokens) input $5/1M tokens and output $15/1M tokens, Gemini 1.5 The call cost of Pro is 30% lower than that of GPT-4o.

图表7:Gemini 1.5 Pro与GPT-4o的定价

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Note: The length of GPT-4o context tokens is 128,000;

Source: Google, OpenAI, CICC Research

GPT-4o emphasizes the innovation of human-computer interaction in practical application scenarios. We can see that the GPT-4o press conference did not show too many technical details, but devoted a lot of time to showing how GPT-4o can be used in possible application scenarios on mobile phones/PC products, especially the AI voice assistant played an important role and performed well in cross-modal human-computer interaction.

The GPT-4o model is even better. According to the evaluation data on OpenAI's official website, GPT-4o has achieved better performance than Gemini 1.5 Pro in various tasks such as text tests (such as MMLU, MATH, HumanEval, etc.) and visual comprehension tests (such as MMMU, MathVista, etc.). In our opinion, OpenAI's technological level is still ahead of the industry.

Figure 8: Comparison of the performance of text and visual comprehension tests of mainstream large models

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: OpenAI official website, CICC Research Department

Terminal hardware: Human-computer interaction mode is transformed, and AI is accelerated on the device side

After the development of AI gradually enters the second half of application monetization, how to empower AI capabilities to consumers has become an important issue. We observed that in addition to the regular model and technology releases, both GPT-4o and Google will focus on showing the practical application scenarios of AI in mobile devices, such as mobile phones and PCs. We believe that in the future, the application of AI on the device side and the monetization of consumer reach may become a new development focus.

AI+ consumer electronics terminals, the trend of hardware upgrade is clear

AI mobile phones/PCs are gradually approaching, and the prospects for AI end-side landing are broadening.

► Innovation of interaction mode: End-to-end multi-modal capability makes human-computer interaction no longer limited to text, enriches the interaction form, and has stronger synergy with existing mobile phone applications.

► Anthropomorphism of AI voice assistant: low latency, the ability to interrupt at any time, flexible adjustment of output based on instant feedback, and rich emotional colors make AI voice assistants more anthropomorphic, changing the cold image of AI voice assistants that can only answer mechanically in turn-based ways in the past.

► Demonstration of AI functions on mobile devices and expansion of application scenarios: The combination of Google's Gemini model and the Android ecosystem, and the demonstration of GPT-4o on iPhone, allow consumers to see the possibility of combining AI in the mobile phone system, making it possible for AI to call existing APPs and even connect across APPs, and expand more abundant application scenarios.

► Commercialization prospects: In addition to diverse application scenarios, GPT-4o is open to free users, and considering the huge user groups in the To C market such as mobile phones/PCs, the broad prospects of AI on the device side have attracted more attention.

Figure 9: GPT-4o can change the tone of voice to answer user questions

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: GPT-4o press conference, New Zhiyuan, CICC Research

Figure 10: GPT-4o instructs users to do math problems on a tablet

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: GPT-4o press conference, CICC Research

AI mobile phones: At present, Xiaomi, Samsung, Google and other manufacturers have successively launched their own AI mobile phone products. Counterpoint predicts that the global penetration rate of AI mobile phones will be about 8% in 2024, and shipments are expected to exceed 100 million units. In 2027, the global penetration rate of AI mobile phones will be about 40%, and shipments are expected to reach 522 million units.

Exhibit 11: Latest AI phones released by brand

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Note: The statistics are as of May 15, 2024 and are incomplete

Source: Companies' official websites, CICC Research Department

AIPC: The listing of AIPC may promote the arrival of the PC replacement cycle. Considering the potential of AI to improve productivity and promote application innovation, IDC predicts that AIPC penetration is expected to reach 85% by 2027.

Figure 12: Layout of AIPC products by PC vendors

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Lenovo official website, HP official website, Dell official website, Acer official website, Honor official website, CICC Research Department

The GPT-4o application shows the rudiments of spatial computing, which has both spatial awareness and user perception. GPT-4o demonstrates the initial recognition ability, users can recognize handwritten equations through the camera, provide clues for problem solving, and guide the problem solving process step by step, providing real-time feedback. At the same time, GPT-4o can perceive the user's face, posture, voice and expression, and emotional changes, understand the interruption habit of human conversations, and can stop and listen in a timely manner, and give corresponding responses, generating natural, coherent and non-mechanical dialogues according to the user's intonation. We believe that the application of GPT-4o shows the rudimentary form of spatial computing, and although there is still some room for improvement in the fields of 3D reconstruction, spatial perception, and user perception, a new ecosystem and interaction model combining software and hardware are being gradually constructed.

Figure 13: GPT-4o perceives user facial expressions during a "video call".

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: OpenAI GPT-4o press conference, CICC Research

The advantage of AI on the landing side lies in the use of peripheral perception data, which complements spatial computing, and may lead AR products to more application scenarios. We have seen that AR glasses products such as Meta RayBan launched by Meta have successively appeared in AI use cases based on multimodality. On the hardware side, chip manufacturers have made simultaneous efforts, and the importance of AR has been increasing, and the product definition has changed from mobile phone accessories to independent devices.

We believe that in the future, it will become the only way for some computing power to sink to the device side.

Figure 14: Hardware upgrade trend of AI mobile phones

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: AI Mobile Phone White Paper (IDC & OPPO, 2024), IDC official website, CICC Research

Figure 15: AIPC Hardware Upgrade Trend

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

资料来源:Counterpoint,Trendforce,中金公司研究部

Operating System & Application: AI has entered a new era of real-time interaction, focusing on operating system upgrades and application prospects

Human-computer interaction is upgraded, and real-time interaction capabilities expand AI application landing scenarios

The multi-modal interaction capability greatly enriches the possible scenarios of AI applications. The end-to-end interaction capability of GPT-4o makes it possible to connect different modalities such as voice, text, video, and image, which greatly enriches the use scenarios.

Figure 16: The mobile app Be My Eyes is connected to GPT-4o to assist blind people to search for their surroundings through the camera and output voice

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: OpenAI GPT-4o press conference, CICC Research

图表17:谷歌Gemini “Ask with Video”可实现视频搜索功能

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Google I/O 2024, CICC Research

Pay attention to the two camps of Apple and Android, and it will become a trend to open up the underlying system

Android Ecosystem: Google Gemini amplifies the advantages of the whole ecosystem, which is expected to open up the underlying Android system. Furthermore, we believe that the interaction between AI and consumers in the future is inseparable from the in-depth integration of large models and mobile phone operating systems, including the permission opening of the underlying operating system, cross-APP content calling, and unified output. In this regard, Google has begun to lay out based on its strong influence in the Android ecosystem. At this I/O conference, Google demonstrated the deep integration of Gemini with Google's native products, especially at the level of the Android operating system. Google said that the multi-modal Gemini Nano model running locally will be launched on Pixel phones, and the Gemini app will support real-time voice and video interaction. Google will launch Gems, a custom AI assistant feature that can interact with the Google Family Bucket. Looking ahead, we expect Google to leverage its strong strength in the Android ecosystem to accelerate the penetration of AI capabilities in mobile devices, especially Android phones.

Exhibit 18: Google I/O 2024 upgrade to Android

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Google I/O Conference 2024, CICC Research

Apple: Device-cloud collaboration optimizes user experience and accelerates the intelligent development of terminals. This GPT-4o display is mainly based on iPhones, but also Macs. In addition to the mobile app, OpenAI has also launched a desktop-level app for macOS. Previously, it was reported in the news [3] that Apple is in talks with Open AI and Google on AI cooperation. We believe that Apple's generative AI may be realized through a hybrid approach of devices and clouds, such as mobile phones, tablets, computers, and MRs, Apple can complete the training of small and medium-sized models through the computing power support of A+M series chips, and can also borrow computing power from the cloud to complete the training requirements when needed. At present, Apple has accumulated a large number of native apps, such as Music, TV+, Fitness+, and News, etc., and we believe that Apple is expected to achieve personalized user push after training on large cloud models (such as Ajax, etc.), combined with users' daily search and usage habits. Considering startups such as AI Music, which Apple has acquired, we believe that Apple is expected to achieve customized content generation based on user preferences and further improve user stickiness.

Figure 19: Apple's AI mode exploration (* indicates that Apple has not officially released it yet, but has potential layout direction in the future)

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Apple's official website, OpenAI, Google, Anthropic's official website, CICC Research

Android mobile phone manufacturers: In addition to large-scale model manufacturers, Android mobile phone manufacturers are also accelerating the research and development of their own large-scale models. In an effort to take the lead in the future AI device-side era, Huawei, Xiaomi, OPPO, vivo, Samsung, Transsion and other manufacturers have launched large models and integrated them into their own operating systems.

Voice assistants make AI agents possible, and traffic entrances may usher in changes

The anthropomorphism of voice assistants has increased, and AI agents have become possible. The most eye-catching thing about GPT-4o is the ability of the AI voice assistant with emotional colors and real-time multi-modal feedback, while the Astra released by Google also has the ability to provide multi-modal feedback. From the perspective of the development trend of mobile AI, we believe that the future development direction of mobile AI Agent is that the agent can independently call the mobile application, so that users can enjoy the service of the exclusive mobile phone intelligent assistant, so as to break the barrier of the APP and realize cross-application operation through independent planning and decision-making.

Figure 20: GPT-4o interacts with users through voice assistants

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: OpenAI GPT-4o press conference, CICC Research

Changes in interaction methods may bring about changes in traffic entrances, which will profoundly affect the ecological pattern of the mobile Internet. The interaction mode of the AI Agent mentioned above corresponds to or will replace the direct contact between the independent APP and the consumer, but integrates all the needs into the AI Agent, which is of far-reaching significance. We believe that in the future, the mode of human-computer interaction is expected to evolve from text to speech, and human-computer interaction will also show the characteristics of multi-modal combination. At the same time, voice assistants are expected to become an important entry point for users to obtain information and interact, and even directly help users with content screening and content generation. From a long-term perspective, when the interaction occurs across APP calls, the entrance function of APP and app stores will be weakened, and the current business model of the mobile Internet ecosystem may change.

Cloud computing hardware: Utilization is the highest, and the demand for inference is driving AI Infra into a period of substantial optimization

Some of GPT-4o's functions are free and open, and the improvement of Gemini's capabilities may put forward requirements for the reduction of unit computing power costs, and AI infra is facing significant optimization. We see that although GPT-4o is still a GPT-4-level model capability, this release has greatly expanded for device-side applications; At the same time, in the free limited number of uses, the interaction delay is shortened to less than 232 milliseconds, which is close to human feedback. Gemini 1.5 Flash is a new model with a focus on optimized response times, speed and cost-effectiveness. From a functional point of view, we believe that the promotion of application capabilities and the reduction of interaction latency have put forward higher requirements for the inference capabilities of cloud computing chips.

Figure 21: GPT-4o is open to users to use some of its features for free

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: OpenAI GPT4o press conference, CICC Research

Application acceleration development promotes the tilt of computing resources to inference tasks; The prefill phase of the inference process has a high demand for the computing power of a single card, and the memory bandwidth of the decoding phase is the limiting factor for inference performance. If the computing hardware system is optimized for inference tasks, on the one hand, it can be directly solved by upgrading the hardware performance. On the other hand, since every minute and second of GPU operation is a cost, for the actual inference task, we need to first measure what causes the bottleneck of system efficiency, and implement engineering optimization in a specific and targeted manner based on the sensitive indicators of specific system service capabilities. In practical applications, we mostly use "computing for storage" or "storage for computing" to improve hardware utilization (computing power utilization MFU and memory access utilization MBU) to reduce latency, increase throughput, and improve hardware utilization to reduce inference costs.

Figure 22: Detailed explanation of the inference optimization method of large models

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Nvidia's official website, CICC Research

From model innovation to accelerated application implementation, the capital expenditure structure may be tilted towards the edge inference side. In 2023, the total capital expenditure of the top four cloud vendors in North America (Amazon, Microsoft, Google, and Meta) will reach $147.45 billion, and combined with their guidance, the current Bloomberg consensus expectation is that the total capital expenditure in 2024 will increase by 33% year-on-year to $196.61 billion, and the development of AIGC will drive up the total capital expenditure. Combined with the GPT-4o press conference, we see that with the accelerated deployment of applications, the development context of AI is gradually tilting from model innovation to the end-side deployment of large models, and the resulting changes in computing resources may drive the capital expenditure structure to the edge inference side.

Hardware optimizations

NVIDIA launched the GB200 NVL72, which can provide 30 times the real-time LLM inference performance for trillion-parameter language models compared to H100, which is expected to help the industry explore application development. Looking ahead, we are optimistic that the GB200 NVL72 will continue to help the industry explore application development and provide cost-effective computing solutions on the inference task side with its integrated high computing power, excellent interconnection capabilities, and large memory bandwidth.

Figure 23: GB200 NVL72 vs. H100 inference speed

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Nvidia's official website, CICC Research

Google's TPU has been iteratively upgraded, and the self-development of chips has been further deepened. TPU v6 Trillium was officially launched, and Google said that its single-chip peak computing performance is 470% higher than that of TPU v5e, and the energy efficiency is more than 67% higher than that of TPU v5e. Internally, the TPU v6 expands the matrix multiplication unit MXU to increase clock speeds, double the capacity and bandwidth of HBM, double the bandwidth of inter-chip interconnects, and is equipped with a dedicated SparseCore accelerator to optimize workloads, ultimately achieving significant performance and energy efficiency improvements.

图表24:谷歌TPU v6 Trillium

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Google I/O Conference 2024, CICC Research

We believe that because GPU parallel computing in AI data centers requires high-frequency intermediate computing result communication, and communication efficiency affects the overall computing cluster performance, there are high requirements for communication bandwidth, network latency, network stability, and automatic deployment.

In terms of C2C, the GB200 NVL72 leverages the fully interconnected fifth-generation NVLink network, and a single GB200 Tensor core GPU can support up to 18 NVLink connections of 100GB/s, with a total bidirectional bandwidth of 1.8TB/s, which is double that of the fourth-generation NVLink and 12 times higher than the 160GB/s of the first generation in 2014.

图表25:NVLink双向带宽升级至1.8TB/s

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Source: Nvidia's official website, CICC Research

NVSwitch is an extension of NVLink technology that solves the problem of uneven communication between GPUs. The next-generation NVLink Switch system can connect up to 576 GPUs with a total connection bandwidth of up to 1PB/s.

In terms of B2B, 200G SerDes empowers X800 series switches to upgrade bandwidth and port speed. The NVIDIA X800 series switches and ConnectX-8 NICs represent the highest level of B2B interconnection in the data center. The ConnectX-800G SmartNIC is upgraded to 800GB/s and supports up to 48 lanes of PCIe 6 connectivity. It combines communication optimization and computing offload functions, and can work with switches to improve B2B transmission efficiency.

Google's liquid-cooled data center is steadily advancing. In 2018, Google released the TPU3.0 Pod to introduce liquid cooling technology into data centers, and so far 1GW of data centers have deployed liquid cooling systems. We believe that liquid cooling, as a more efficient way to dissipate heat, is expected to replace traditional air cooling by increasing the deployment density of computing power and reducing system power consumption, and the penetration rate is expected to continue to increase in the future.

Optimization of systems engineering

1) Optimize video memory: In the inference process, in addition to the weight of the model at each level occupies the video memory, the KV Cache (that is, the Key and Value matrix in the Attention block, the part of the cache required to reduce repeated computation in the inference decode stage) also occupies a large amount of video memory. At present, most of the optimization work for KV Cache is mainly focused on engineering. However, we have recently seen innovative changes for Attention blocks, such as the introduction of MLA mechanism into DeepSeek's latest V2 version of the large model, which has achieved a good KV Cache reduction effect and maintained good stability in model performance. We see that the industry is looking for more optimized ways to reduce KV Cache overhead in order to achieve a larger batch size for higher hardware utilization.

2) Operator fusion/operator optimization: When model training, engineers generally choose small operators to repeatedly explore the relationship between the input and output results of each step to optimize the model, while for inference tasks, after the model training is completed, more large operators will be used to obtain more hardware execution efficiency. In terms of operator implementation (i.e., how to combine computing logic and chip architecture), operators that are executed relatively frequently can also be discovered, and the GPU utilization per unit time can be increased through a more optimized compilation strategy in the physical implementation of the GPU.

3) Low-precision (quantization) acceleration: In the inference process, the quantization of weights is one of the important methods to accelerate inference. When inferring, we found that FP16 weights generally provide similar accuracy to FP32, which means that by weighting the weights to FP16 for inference, we can get the same results with only half the GPU memory, or even quantize with lower precision such as INT8/INT4 for better results. We believe that there are often some acceleration units specially designed for integer computing in some dedicated processors, which can achieve better computing power utilization and energy consumption ratio.

4) Distributed inference: Although the amount of computation for inference is far from that of the training side, we mentioned earlier that some scenarios in large model inference are limited in memory access (especially for longer context windows after model upgrades). Therefore, in order to improve GPU utilization, Tensor Parallelism (TP) is used to segment LLM model parameters to reduce the time spent reading model parameters from video memory.

Risk Warning

The implementation of AI algorithms is not as fast as expected: ChatGPT\GPT-4 and other models are not open source, and there are data security problems such as privacy data leakage, model theft, data reconstruction, and prompt injection attacks, as well as answer accuracy issues, and moral issues, which threaten the implementation of model applications.

AI monetization model is uncertain: Although the emergence of AI may change the digital content production relationship, but: 1) On the ToC side, in addition to GPT-4, users of other AI models are still in the mode of free experience, while applications represented by Microsoft 365 and New bing are still in the mode of free experience, and the charging model is still uncertain; 2) On the ToB side, the ChatGPT and GPT-4 API interfaces accessed by a large number of start-ups currently have low fees, and the future charging standards and models are also uncertain.

Sluggish demand for consumer electronics smart terminals: Affected by the overall macro economy, international geopolitical conflicts and the downward trend of the semiconductor cycle, the consumer electronics market has been greatly impacted, and the domestic and foreign market demand has shown varying degrees of weakness. If the recovery of consumer electronics demand in 2024 is less than expected, we believe that the progress of hardware to benefit from AI may be less than expected.

[1]https://openai.com/index/hello-gpt-4o/

[2]https://io.google/2024/intl/zh/

[3]https://www.japantimes.co.jp/business/2024/04/27/tech/iphone-openai-ai-features/

Article source:

This article is excerpted from: "The Top of the AI Wave Series: AI End-to-End Acceleration and Opening a New Era of Real-time Interaction" released on May 16, 2024

彭虎 分析员 SAC 执证编号:S0080521020001 SFC CE Ref:BRE806

温晗静 分析员 SAC 执证编号:S0080521070003 SFC CE Ref:BSJ666

Cheng Qiaosheng Analyst SAC License No.: S0080521060004

李诗雯 分析员 SAC 执证编号:S0080521070008 SFC CE Ref:BRG963

黄天擎 分析员 SAC 执证编号:S0080523060005 SFC CE Ref:BTL932

Kong Yang Contact SAC License Number: S0080122110018

Zha Yujie Contact SAC License Number: S0080122120012

李澄宁 分析员 SAC 执证编号:S0080522050003 SFC CE Ref:BSM544

石晓彬 分析员 SAC 执证编号:S0080521030001

贾顺鹤 分析员 SAC 执证编号:S0080522060002 SFC CE Ref:BTN002

陈昊 分析员 SAC 执证编号:S0080520120009 SFC CE Ref:BQS925

Legal Notices

CICC | AI Wave Top Series: AI device-side landing acceleration, opening a new era of real-time interaction

Read on