AI big computing power chip industry report: a hundred boats compete for the stream, innovators first

Image source: @VisualChina

Text|Insight into Cyprus

From ChatGPT on November 30, 2022 to 360 Wisdom Brain Big Model 2.0 on June 13, 2023, the global AI community has been crazy for the big model for more than seven months. ChatGPT has sprung up like mushrooms, dropping "bombs" on the AI market: office, healthcare, education, manufacturing, and urgently needs AI empowerment.

And AI applications are thousands, and building a large model is the last word.

For the large model "world", the algorithm is the "production relationship", which is the rule and way of processing data information; Computing power is "productivity", which can improve the speed and scale of data processing and algorithm training; Data is the "means of production", and high-quality data is the nutrient that drives the continuous iteration of algorithms. Among them, computing power is the premise of making the large model turn.

What we all know is that the big model is putting forward unprecedented requirements for computing power, and the specific performance is: according to NVIDIA data, before the large model with the Transformer model as the infrastructure, the demand for computing power increased by 8 times every two years; Since the use of the Transformer model, the demand for computing power has increased by 275 times every two years. Based on this, the Megatron-Turing NLG model with 530B parameters will swallow more than 1 billion FLOPS computing power.

(Iteration of computing power of AI algorithms of different models Source: Gelonghui)

As the brain of the large model, the AI chip is the basic premise to support the efficient production and application of ChatGPT. Ensuring the efficient and sufficient supply of computing power is an urgent problem that AI large computing power chip manufacturers need to solve.

While GPT-4 and other large models open their mouths to chip manufacturers, they also bring good news to chip manufacturers, especially start-up chip manufacturers: the importance of software ecology is declining.

When the technology was not mature enough, researchers could only start by solving a specific problem, and small models with less than one million parameters were born. For example, DeepMind, an AI company owned by Google, allows AlphaGO to specifically "learn" the chess steps of millions of human professionals.

After there are more small models, the adaptation of hardware such as chips is imminent. Therefore, when NVIDIA launched the unified ecological CUDA, GPU+CUDA quickly won the recognition of the computer science community and became the standard configuration for artificial intelligence development.

Today's emerging large models have multimodal capabilities, can handle text, pictures, programming and other issues, and can also cover multiple vertical fields such as office, education, and medical care. This also means that adapting to the mainstream ecology is not the only choice: when the demand for chips in large models skyrockets, chip manufacturers may be able to adapt only 1-2 large models to complete orders for multiple small models in the past.

In other words, the emergence of ChatGPT provides opportunities for start-up chip manufacturers to overtake in corners. This means that the AI chip market structure will change dramatically: it will no longer be a one-man show of individual manufacturers, but a group of innovators.

This report will sort out the development overview of the AI chip industry and the situation of players, summarize the path for players to improve computing power in the era of large computing power, and based on this, spy on the development trend of AI large computing power chips.

Domestic AI chips are moving towards the AI 3.0 era

At this stage, AI chips are divided according to the type of technical architecture, mainly including GPGPU, FPGA, ASIC represented by VPU and TPU, and storage and computing integrated chips.

According to its location in the network, AI chips can be divided into cloud AI chips, edge and terminal AI chips;

The cloud mainly deploys AI training chips and inference chips with high computing power to undertake training and inference tasks, such as intelligent data analysis and model training tasks.

The edge and terminal mainly deploy inference chips to undertake inference tasks, and need to independently complete data collection, environmental awareness, human-computer interaction, and some inference decision control tasks.

According to its goals in practice, it can be divided into training chips and inference chips:

Throughout the history of the development of AI chips in China, the localization process of AI chips is roughly divided into three eras.

The 1.0 era belongs to the era of ASIC architecture

Since the Internet wave kicked off the prelude to AI chips in 2000, around 2010, the gradual maturity of the four major factors of data, algorithms, computing power and application scenarios officially triggered the explosive growth of the AI industry. Shenwei, Boiling, Zhaoxin, Loongson, Soulcore and cloud AI chips came out one after another, marking the official start of domestic AI chips.

In May 2016, when Google revealed that the culprit behind AlphaGo was TPU, ASIC immediately became a "hot chicken". So in 2018, domestic domestic manufacturers such as Cambrian and Horizon have followed the pace and launched ASIC architecture chips for cloud AI applications, opening the era of domestic AI chips 1.0.

ASIC chips can achieve better performance and lower power consumption in a specific scenario and when the algorithm is fixed, which meets the pursuit of extreme computing power and energy efficiency of enterprises.

Therefore, the manufacturers at that time were mainly based on bundled cooperation: most chip manufacturers looked for large customers to achieve "special scenarios" landing, while large manufacturers with comprehensive ecology chose to go it alone.

AI chip manufacturers such as Horizon and Neon Technology respectively focus on the subdivision of AI chips and use the "big customer bundling" model to enter the supply chain of large customers.

At the time when the factories are binding the coordinated development of large customers, Alibaba, a large manufacturer with its own ecology, established a wholly-owned chip company Pingtou Ge, focusing on AI and quantum computing.

In 2019, the first AI chip Hanguang 800 released by Pingtou Ge was built based on ASIC architecture for cloud reasoning. According to Alibaba, the computing power of 1 light 800 is equivalent to 10 GPUs, and the inference performance of light 800 reaches 78563 IPS, and the energy efficiency ratio is 500 IPS/W. Compared with traditional GPU computing power, the cost performance is improved by 100%.

In the 1.0 era, the newly born domestic chip manufacturers choose to bind large customers, and the large manufacturers with a comprehensive ecology choose to develop inward and jointly embark on the journey of exploring AI chip computing power.

In the 2.0 era, the more versatile GPGPU "leads the way"

Although ASICs have extreme computing power and energy efficiency, there are also problems such as limited application scenarios, reliance on self-built ecosystems, difficult customer migration, and long learning curves.

As a result, the more versatile GPGPU (General Purpose Graphics Processing Unit) has become the latest development direction in the field of AI computing in continuous iteration and development, and has become the guide in the era of AI chip 2.0.

Since 2020, the GPGPU architecture represented by NVIDIA has begun to have good performance. By comparing NVIDIA's recent three generations of flagship products, it is found that from the perspective of FP16 tensor computing power, while the performance has doubled generation by generation, the cost of computing power is declining.

As a result, many domestic manufacturers have laid out GPGPU chips, focusing on CUDA compatibility, testing the limits of AI computing power chips. Since 2020, new forces such as Zhuhai Core Power, Bicheng Technology, Muxi, Denglin Technology, Tianzhixin, and Hanbo Semiconductor have gathered their efforts, and everyone's consistent action is: self-developed architecture, follow the mainstream ecology, and cut into the edge side scene.

In the first two eras, domestic AI chip manufacturers are trying their best to follow the trend of the times, follow the pace of international manufacturers, and solve the challenges of AI computing power chips by developing the latest chips.

The changes we can see are that in the 2.0 era, domestic AI chip manufacturers have awakened their independent consciousness and tried to develop their own architecture to make a breakthrough.

In the 3.0 era, storage and computing integrated chips or GPT-4 and other large models are the most preferred

The weak versatility of ASIC chips is difficult to cope with the endless downstream applications, GPGPU is subject to high power consumption and low computing power utilization, and large models put forward unprecedented high requirements for computing power: at present, the large computing power required by large models is at least 1000TOPS and above.

Take the GPT-3 pre-trained language model released in 2020 as an example, which uses the most advanced NVIDIA A100 GPU in 2020, with a computing power of 624TOPS. In 2023, with the iteration of the model in the pre-training stage of the model, the demand for a blowout in the access stage will be added, and the demand for chip computing power for future models will at least exceed 1,000.

Another example is the field of autonomous driving, according to the Caitong Securities Research Institute, the computing power of a single chip required for autonomous driving will be at least 1000+ TOPS in the future: in April 2021, NVIDIA has released a DRIVE Atlan chip with a computing power of 1000TOPS; This year, NVIDIA directly launched the chip Thor, reaching 2000TOPS.

As a result, the industry urgently needs new architectures, new processes, new materials, and new packaging to break through the ceiling of computing power. In addition, the increasingly tense geopolitical relationship has undoubtedly posed new challenges to AI large-computing chip manufacturers that rely heavily on advanced process technology.

In these contexts, a group of start-ups established from 2017 to 2021 chose to break away from the traditional von Neumann architecture and lay out emerging technologies such as storage and computing integration, and the era of China's AI chip 3.0 officially kicked off.

At present, the integration of storage and computing is on the rise:

In the academic world, the number of articles related to storage and computation on ISSCC has increased rapidly: from 6 in 20 years to 19 in 23 years; Among them, the number of internal calculations has rapidly increased from 21 years to 4 in 22 years.

In the industry industry, giants have laid out the integration of storage and computing, and there are nearly a dozen startups betting on this structure in China:

At the end of Tesla's 2023 Investor Day trailer, Tesla's dojo supercomputing center and storage-computing integrated chip were unveiled one after another; Earlier, Samsung, Ali Damo Academy, including AMD, also laid out and launched related products early: Ali Damo Academy said that compared with traditional CPU computing systems, the performance of storage-computing integrated chips is increased by more than 10 times, and the energy efficiency is improved by more than 300 times; Samsung said that compared to HBM-PIM-only GPU accelerators, the energy consumption of HBM-PIM-equipped GPU accelerators is reduced by about 2100GWh in one year.

At present, more than ten domestic startups such as Yizhu Technology, Zhicun Technology, Pingxin Technology, and Jiutian Ruixin use the storage-computing integrated architecture to invest in AI computing power, among which Yizhu Technology and Qianxin Technology are biased towards large computing power scenarios such as data centers.

At this stage, industry insiders said that the integration of storage and computing will be expected to become the third computing power architecture after CPU and GPU architecture.

The confidence of this proposal is that the integration of storage and computing theoretically has the advantage of high energy efficiency ratio, and can bypass the blockade of advanced processes, taking into account stronger versatility and higher cost performance, and the development space of computing power is huge.

On this basis, the new memory can help the integration of storage and computing better realize the above advantages. At present, mature memories that can be used for storage and computing integration include NOR FLASH, SRAM, DRAM, RRAM, MRAM, etc. In contrast, RRAM offers low power consumption, high computational accuracy, high energy efficiency, and manufacturing compatibility with CMOS processes:

At present, the new memory RRAM technology has landed: in the first half of 2022, the domestic startup company Xinyuan Semiconductor announced that the first RRAM 12-inch pilot production line in mainland China officially completed the installation acceptance and achieved mass production and commercial use in the field of industrial control. According to Dr. Qiu Shengdi, CTO of Xinyuan Semiconductor, the yield rate of Xinyuan RRAM products has exceeded 93%.

With the mass production of new memory devices, AI chips that integrate storage and computing have entered the AI large-computing power chip landing competition.

Whether it is a traditional computing chip or a storage-computing integrated chip, when actually accelerating AI computing, it often needs to deal with a large number of computing tasks in non-AI accelerated computing fields such as logic computing and video encoding and decoding. As multi-modality becomes the general trend of the era of large models, AI chips will need to process text, voice, images, videos and other types of data in the future.

In this regard, the start-up company Yizhu Technology is the first to propose a technology path of ultra-heterogeneous AI large computing power that integrates storage and computing. Yizhu's vision is that if the new memristor technology (RRAM), storage and computing integrated architecture, chip technology, 3D packaging and other technologies can be combined, greater effective computing power, more parameters, higher energy efficiency ratio, better software compatibility, and thus raise the development ceiling of AI large computing power chips.

Standing at the door of the 3.0 era, the independent consciousness of domestic AI large computing power chip manufacturers has exploded, in order to provide the possibility of overtaking on curves for China's AI large computing power chips.

The development momentum of the AI chip market probably comes from the following factors.

Central and local governments are scrambling to provide sufficient computing power

In February 2023, the central government issued a number of relevant reports and layout plans, emphasizing the mobilization of computing power in the East Data and West Computing, and there is now a son: the East Data and West Computing Integrated Service Platform.

At the local government level, such as Chengdu, in January 2023, Chengdu released "computing power coupons", that is, sharing government computing power resources with computing power intermediary service institutions, technology-based small and medium-sized enterprises and makers, scientific research institutions, universities, etc., effectively improving the utilization rate of computing power; In March 2023, Beijing issued relevant opinions on accelerating the implementation of computing power, accelerating the construction of infrastructure such as computing centers, computing power centers, industrial Internet, and the Internet of Things.

Based on the relevant policy guidelines of the national and local governments, AI manufacturers have established supercomputing/intelligent computing centers, and what is different from the past is that this year's first market-oriented operation model of computing power was born, and the scale of intelligent computing center computing power has also achieved a qualitative leap: according to the "Intelligent Computing Center Innovation and Development Guide" jointly issued by the State Information Center and relevant departments, more than 30 cities across the country are currently building or proposing to build intelligent computing centers.

The layout planning of the AI chip industry continues to land

It can be seen that the policy on AI chips has moved from the planning stage of the "13th Five-Year Plan" to the landing stage of the "14th Five-Year Plan": improving AI chip research and development technology and promoting AI applications.

At the same time, all localities have clearly proposed to strengthen the layout of the AI chip industry. Among them, Zhejiang, Guangdong, Jiangsu and other provinces have proposed the specific development direction of the field of artificial intelligence chips by 2025.

The integration of storage and computing is becoming a new opportunity for the local computing power industry

The integration of storage and computing is becoming a new opportunity for the innovation and development of Shenzhen's computing power industry chain, and it is actively landing.

On April 2, 2023, at the 2nd China Industrial Chain Innovation and Development Summit New Generation Information Technology Industry Development Forum, Yang Yuchao, deputy dean of the School of Information Engineering of Peking University Institute of Advanced Research, said that Shenzhen will base itself on a relatively complete industrial chain cluster and solve the challenges of integrated storage and computing in industrial application from four aspects: advanced process and packaging, innovative circuit and architecture, EDA tool chain, software and algorithm ecology.

In April this year, China's large model officially broke out, and in the future, the demand for AI large-computing power chips will only increase.

The existing large model is opening towards the NVIDIA A100 large computing power chip lion:

Therefore, AI manufacturers such as SenseTime are focusing on domestic AI chips: on April 10, 2023, SenseTime disclosed that the proportion of localized AI chips currently used by SenseTime has reached 10% of the total. This will undoubtedly accelerate the growth of domestic AI chip manufacturers.

NVIDIA said that in the future, it will start from the GPU architecture and move towards "GPU + DPU super-heterogeneity": launch NVLink-C2C and support UCLe + die + 3D packaging; Launched the Thor "super-heterogeneous" chip 2000T;

AMD said that it will be more difficult to make breakthroughs in hardware innovation in the future, and will move towards "system-level innovation", that is, from the upstream and downstream of the overall design of multiple links of collaborative design to complete the improvement of performance.

The hundreds of billions of dollars of AI chip market is hot in 2023

The overall artificial intelligence industry chain is basically divided into three levels: basic layer, technical layer and application layer:

The basic layer includes AI chips, smart sensors, cloud computing, etc.; The technical layer includes machine learning, computer vision, natural language processing, etc.; The application layer includes robots, drones, smart healthcare, smart transportation, smart finance, smart home, smart education, smart security, etc.

As the foundation of the development of the artificial intelligence industry, the basic layer provides data and computing power support for artificial intelligence, among which AI chips are the basis of artificial intelligence computing power.

When the AI industry is not yet mature, the current value of basic layer enterprises is the largest, and the proportion of basic layer enterprises in the Chinese intelligent industry chain reaches 83%, the proportion of technology layer enterprises is 5%, and the proportion of application layer enterprises is 12%.

The base layer determines whether the building is stable, while the downstream application level determines the height of the building. At the application layer, smart terminals such as intelligent robots and drones have unlimited potential, and there is a lot of gold that can be mined in the fields of smart cities and smart healthcare. At present, the scale of the mainland intelligent robot market continues to grow rapidly.

Data show that from 2017 to 2021, the market size of mainland intelligent robots increased from 44.8 billion yuan to 99.4 billion yuan, with an average annual compound growth rate of 22.05% during the period, and its market size is expected to reach 130 billion yuan in 2023.

According to statistics from the China Academy of Information and Communications Technology, China's smart city market has maintained a growth of more than 30% in recent years, reaching 21.1 trillion yuan in 2021 and is expected to reach 28.6 trillion yuan in 2023.

In the 100 billion dollar market, AI chips are infinitely attractive

Under the wave of global digitalization and intelligence, the technology of the technical layer is constantly iterating: autonomous driving, image recognition, computing and other technologies are deepening their application in various fields; At the same time, the application layer of IoT devices is constantly enriching: industrial robots, AGV/AMR, smart phones, smart speakers, smart cameras, etc.

This will undoubtedly promote the rapid growth of the AI chip and technology market at the basic layer. According to China Insights Consulting, the global AI chip market reached $96 billion in 2022 and is expected to reach $308.9 billion in 2027, with a compound annual growth rate of 23% from 2022 to 2027:

The domestic AI chip market is even hotter: According to China Insights Consulting data, China's AI market reached US$31.9 billion in 2022 and is expected to reach US$115 billion in 2027, with a compound annual growth rate of 29.2% from 2022 to 2027.

In 2021, the AI chip track ushered in the wind

With the increase in market demand for downstream security and automobiles, coupled with the continuous sanctions of domestic manufacturers by the United States since 2019, in 2021, the domestic AI chip track ushered in the wind. In this year, capitalists competed to select "potential dogs" belonging to China's AI chip market, in order to grasp the right to speak in the future chip market. Although the investment enthusiasm has declined in 2022, the overall amount still exceeds 10 billion yuan.

(2016-2023 Chinese overall financing of the industry Source: Prospective Economist APP)

There is less financing after the C round, and the AI chip market is still in its infancy

By analyzing the investment rounds, it is found that the AI chip market is still in its infancy: the current financing rounds of the AI chip industry are still in the early stages, and the number of financing after the C round is small.

(Investment and financing rounds of the Chinese smart chip industry from 2016 to 2023 Source: Prospective Economist APP)

The integration of storage and calculation becomes fragrant food

In terms of subdivision tracks, GPUs are the most valuable tracks, and GPU players such as Moore threads have raised more than 1 billion yuan and won the "MVP";

The largest number of financing companies in the deposit-computing integrated track, and seven deposit-computing integrated players, such as Yizhu Technology and Zhicun Technology, are favored by capital. It is worth noting that the four startups under the integrated storage and computing track, Yizhu Technology, Zhicun Technology, Pingxin Technology, and Houma Intelligent have received financing for two consecutive years.

Domestic AI big computing power track, player geometry?

At present, players in the 1.0 era such as Cambrian and Pingtou Brother have now become listed companies with high-quality AI computing power chips; unlisted AI computing chip companies emerging in the 2.0 era such as Bicheng Technology, Denglin Technology, and Tianzhixin continue to make efforts on the product side; In the 3.0 era, startups such as Qianxin Technology and Yizhu Technology are seeking breakthroughs in the architecture of storage and computing integration.

At present, most AI chip companies have laid out small computing power scenarios on the edge side and center side, such as smart security, smart city, smart medical and other application scenarios. Bicheng Technology, Pingtou Brother, and Yizhu Technology can cover the edge side and the center side of the large computing power scene; In the new batch of start-ups, Yizhu Technology has made a bold attempt to use the integrated storage and computing architecture to do large computing power scenarios.

Therefore, we classify according to architecture and application scenarios to present the following panorama of AI computing chip midstream manufacturers:

ChatGPT is booming, triggering a huge wave in the AI industry, and domestic AI chips are ushering in the 3.0 era. In the 3.0 era of large models, AI large computing power chips are urgently needed to provide sufficient computing power to quickly roll up increasingly heavy large models.

Large models are prevalent, how do chip manufacturers solve the problem of large computing power?

Computing power, that is, national power

With the opening of the "meta-universe" era, GPT-4 and other large models are menacing, and data traffic will usher in explosive growth. According to IDC forecast data, it is estimated that in the next five years, the global computing power scale will grow at a rate of more than 50%, and the overall scale will reach 3300EFlops by 2025. By 2025, the number of global IoT devices will exceed 40 billion, generating nearly 80 zettabytes of data, and more than half of the data needs to rely on terminal or edge computing power for processing.

(Future growth of global computing power demand Source: China Galaxy Securities Research Institute)

(The growth rate of global computing power is significantly lagging behind the growth of data volume Source: China Galaxy Securities Research Institute)

The amount of data has skyrocketed, and countries urgently need computing power to maintain the normal operation of data, and the competition for computing power between countries has officially begun. In fact, it is far more than the battle for computing power, behind which is the competition for national strength of various countries.

In March 2022, the "2021-2022 Global Computing Power Index Evaluation Report" jointly compiled by IDC, Inspur Information, and the Global Industry Research Institute of Tsinghua University revealed the basic relationship between "computing power and national power" today:

The scale of computing power in countries around the world is significantly positively correlated with the level of economic development, and the larger the scale of computing power, the higher the level of economic development. For every 1 point increase in the computing power index, the digital economy and GDP will increase by 3.5‰ and 1.8‰ respectively. The United States and China have a computing power index of 77 and 70 points, respectively, which is significantly ahead of the computing power index of other countries.

There are many scenarios, and different computing power scenarios have different requirements for chips

From small headphones, mobile phones, PCs, to automobiles, the Internet, artificial intelligence (AI), data centers, supercomputers, aerospace rockets, etc., "computing power" plays a fundamental core role. Different computing power scenarios have different requirements for chips:

It can be seen that data centers have particularly high requirements for chips due to their diverse algorithms and faster iteration speed: not only their high computing power, but also their low power consumption, low cost, high reliability, and higher versatility.

Data center construction is imminent

Among the many application scenarios, data centers are particularly important. As an AI infrastructure, the data center carries multiple applications of center-side and edge-side computing power:

1. The national data center cluster supports industrial Internet, financial securities, disaster early warning, telemedicine, video calling, and artificial intelligence reasoning.

2. As the "edge" end of computing power, the data center in the city serves high-frequency trading in the financial market, VR/AR, ultra-high-definition video, Internet of Vehicles, networked drones, smart power, smart factory, intelligent security, etc.

Nowadays, the battle for computing power, and even national strength, has begun.

U.S. sanctions on China's data centers, intelligent computing centers, and supercomputing centers have begun since 2021: In April 2021, the U.S. Department of Commerce added Chinese supercomputing entities such as China's National Supercomputing Center Jinan, Shenzhen, Wuxi, and Zhengzhou Center to the "Entity List."

Based on the growth of demand in the downstream market, geopolitics and other factors, mainland data centers have also been quickly put on the agenda: in May 2021, the state proposed the "East Data and West Computing" project, clearly focusing on 8 national computing power hubs, promoting the construction of national data center clusters and urban internal data centers.

Today, there is still a certain gap between China's data center construction and that of the United States:

According to the 2021-2022 Global Computing Power Index Assessment Report, there are currently about 600 hyperscale data centers in the world, each with more than 5,000 servers, of which about 39% are in the United States, four times that of China, and the number of servers in China, Japan, the United Kingdom, Germany and Australia together accounts for about 30% of the total.

By the end of 2021, the total scale of data center racks in use in Continental reached 5.2 million standard racks, the scale of data center servers in use was 19 million, and the total computing power exceeded 140 EFLOPS.

Under the background of computing power is national strength, under the catalysis of large models, large computing power with low cost and low power consumption will definitely become a rigid demand. China urgently needs autonomous and controllable data centers that can carry computing power, and the computing power of data centers depends on the progress of domestic replacement of chips.

In the data center scenario, there is still a gap between domestic mainstream AI chips

Servers account for 69% of the data center infrastructure. Today, in the data center acceleration server market, GPGPUs dominate with higher performance and higher versatility:

According to IDC data, in 2021, GPU/GPGPU servers dominated the mainland acceleration server market with a share of 91.9%; The non-GPU accelerated servers such as ASICs and FPGAs we mentioned earlier accounted for only 8.1%.

At this stage, in the cloud data center scenario, there is still a gap between domestic GPGPU chips and the international top level.

Before comparing, we need to make it clear that in the cloud (server-side), the requirements for training chips and inference chips are not exactly the same:

The training chip needs to train a complex neural network model through massive data to adapt it to specific functions, correspondingly, it has high requirements for performance and accuracy, and needs to have certain versatility;

The inference chip uses the neural network model for inference prediction, and the peak computing performance requirements are low, and more attention is paid to comprehensive indicators such as unit energy consumption computing power, delay, and cost.

AI training chips, there is still a gap between domestic production

At present, players such as Bicheng Technology, Pingtou Brother, Kunlun Chip, Muxi, and Tianzhixin have layouts for cloud data centers, among which, most manufacturers such as Kunlun Chip and Pingtou Brother have launched inference chips; Cambrian, Mu Xi, and Tian Zhixin launched a training and push integrated chip.

In recent years, domestic manufacturers have made continuous breakthroughs in the hardware performance of training chip products, but there is still a certain gap with the mainstream NVIDIA A100 products in the market:

Taking the T20 product as an example, its 32-bit single-precision floating-point performance reaches 32TFLOPS, which is higher than the 19.5TFLOPS of the A100, and has more advantages in power consumption, but the memory bandwidth is less than 1/3 of the A100, and there is still a gap in coping with the bandwidth requirements of machine learning and deep learning.

At the same time, according to the analysis of Zheshang Securities, the Siyuan 590 series launched by Cambrian at the end of last year may show better performance in some models due to its ASIC specificity, but due to its lack of versatility, it still needs post-adaptation and technical support. In contrast, China's AI training chips still have a certain gap with NVIDIA in performance and ecology (compatible).

AI reasoning chips, domestic is expected to equal

At present, Cambrian, Flint, Kunlun Core and other domestic manufacturers have the ability to compete head-on with the mainstream Tesla T4 in the market: its energy efficiency ratio is 1.71TOPS/W, and the gap with T4's 1.86TOPS/W is small.

Computing power optimization path

The gap is still there, and domestic AI manufacturers urgently need to catch up with the international speed. The first step for everyone to improve chip performance is to roll advanced processes.

At this stage, the design cost of advanced process chips is high: the cost per unit area increases sharply after 14/16nm.

(The cost per unit area of advanced process chips has increased, Source: TF Securities)

1. According to singular Moore data, as the process evolves from 28nm to 5nm, R&D investment has also increased sharply from $51.3 million to $542 million, and the development cost of 2nm is close to $2 billion.

2. According to EETOP public account data, in the 7nm node, the cost of designing a chip is as high as 300 million US dollars. And with the continuous slowdown of Moore's Law, transistors are approaching the physical limit and cost limit at the same time.

As a result, chip upstream companies are also frantically increasing prices: the price of advanced process wafers of supplier TSMC is rising every year, and the higher and more outrageous.

3. Previously, it was still a price increase in separate processes: In 2021, TSMC notified customers of a comprehensive price increase at noon on August 25, and the price of 7nm and 5nm advanced processes will increase by 7% to 9% from now on, and the price of the remaining mature processes will increase by about 20%;

4. At the beginning of 2023, TSMC increased prices sharply across the board: According to the Electronic Times, the price of TSMC's 12-inch 5nm wafer was as high as $16,000 per piece, a 60% increase over the previous generation of 7nm wafers.

Rising costs will become the norm, and it is even more regrettable that in the case that domestic manufacturers have rolled the process to 7nm, the performance has not caught up with NVIDIA.

If it is rolled to 5nm to achieve higher performance, chip manufacturers will not lose much:

The first is that the cost is unaffordable, and NVIDIA's moat in GPGPU is smashed out by money. According to NVIDIA Huang Jenxun, the research and development cost of the A100 chip alone is 2-3 billion US dollars (tens of billions of dollars) and 4 years. In the short term, domestic start-ups do not have such a large size and cannot afford the time cost.

At present, the high cost of research and development has made Cambrian and other manufacturers still not profitable.

The second is that money is spent, which is not effective: performance does not remain "positive". Logic chips are still evolving along Moore's Law, memory chips continue to shrink in size no longer have cost and performance advantages, and the shrinking of analog chip processes may lead to a reduction in analog circuit performance.

At the same time, 7nm chips are more cost-effective than 5nm in the long run:

Georgetown University in the United States released an AI chip research report, which analyzes the economic benefits of AI chips using different process nodes. The report reveals through quantitative models that 7nm process chips are cost-effective compared to 5nm process nodes.

From this cost analysis model, the researchers came to two conclusions:

1. Within two years of normal operation, the energy consumption cost of advanced process (7/5nm) chips exceeds their production costs, and the energy consumption costs of chips using the old process (10nm and above) grow faster. When production costs and operating costs are taken into account, advanced process chips are 33 times more cost-effective than older process chips.

2. Compared with 7nm and 5nm chips, when the normal operation is used for 8.8 years, the cost of the two is comparable. This means that if the chip is replaced within 8.8 years, 7nm is more cost-effective. Given that most of the AI accelerators used in data center AI training and inference are replaced every 3 years, 7nm chips are more cost-effective than 5nm from a cost-effective point of view.

In addition, there are geopolitical implications, and domestic advanced process research and development have been repeatedly hindered. Chips have suffered from advanced processes for a long time, and improving chip computing power is not only to improve the performance of a single chip, but to consider the macro total computing power of the chip.

Macro total computing power = performance * number (scale) * utilization, and at present, in the CPU, GPU, AI and other large computing power chips, we can see that many solutions can not take into account these three factors:

1. Some computing power chips can achieve soaring performance, but less consideration is given to the versatility and ease of use of the chip, resulting in low chip sales and small landing scale. For example, through FPGA customization, the scale is too small, the cost and power consumption are too high.

2. Some computing power improvement solutions focus on scale investment, but they cannot solve the foundation of future computing power demand order of magnitude improvement.

3. Some solutions improve the utilization rate of computing power through various resource pooling and sharing of computing power across different boundaries, but they cannot change the nature of the current performance bottleneck of computing power chips.

In order to achieve large computing power, it is necessary to take into account the three major influencing factors of performance, scale, and utilization rate, and have a solution with a big picture.

Computing power solution, ready to go

Taking the AI cloud inference card as an example, what we can see is that from 2018 to 2023, it will be difficult to balance the cost, power consumption, and computing power due to various reasons such as the "roll" of the process process.

However, the battle for national strength has begun, ChatGPT has arrived, and the market urgently needs a solution that takes into account cost, power consumption, and computing power.

At present, international manufacturers, domestic mainstream manufacturers, and start-ups are all seeking computing architecture innovation, trying to find solutions that take into account performance, scale, and utilization rate, and break through the ceiling of computing power.

For architecture innovation, the industry gives many technologies and solutions: quantum computing (quantum chip), photonic chip, storage and computing integration, chiplet, 3D packaging, HBM...

Among them, it is now compatible with CMOS process and can be mass-produced as soon as possible, including HBM, core, 3D packaging, storage and computing. The integration of storage and computing and chiplet are currently two clear routes that the industry generally believes can break through the dilemma of AI computing power and carry out architectural innovation.

Eliminate data barriers with storage and computing

From the traditional von Neumann architecture to the integrated storage and computing architecture, in layman's terms, it is to eliminate the gap between data and make it work more efficiently.

Under the traditional von Neumann architecture, the storage and computing areas of the chip are separated. When calculating, data needs to be transported back and forth between the two regions, and with the continuous growth of the number of neural network model layers, scale and data processing, the data has faced the situation of "not running", becoming a bottleneck of high-performance computing performance and power consumption, which is commonly known as the "storage wall" in the industry.

(Specific performance of storage wall restrictions Source: Zheshang Securities)

The storage wall correspondingly brings the problem of energy consumption wall and compilation wall (ecological wall). For example, the compilation wall problem is that due to a large amount of data handling prone to congestion, the compiler cannot optimize operators, functions, programs or networks as a whole under static and predictable conditions, and can only optimize the program manually, one by one or layer by layer, which consumes a lot of time.

These "three walls" will lead to a needless waste of computing power: according to statistics, in AI applications with large computing power, data handling operations consume 90% of the time and power consumption, and the power consumption of data handling is 650 times that of computing.

The integration of storage and computing can integrate storage and computing, completely eliminating memory latency and greatly reducing power consumption. Based on this, the Zheshang Securities report pointed out that the advantages of the integration of storage and computing include, but are not limited to: greater computing power (more than 1000TOPS), higher energy efficiency (more than 10-100TOPS/W), cost reduction and efficiency increase (can exceed an order of magnitude)...

As shown in the figure below, compared with GPGPU, storage-computing integrated chips can achieve lower energy consumption and higher energy efficiency ratio, which can help data centers reduce costs and increase efficiency in terms of application implementation, and empower green computing power.

Based on this, if you deal with the amount of inquiries in one day, the initial investment of the storage and computing integrated chip is 13%-26% of the A100, and the daily electricity bill is 12% of the A100.

Give chips more power with Chiplet

In addition to breaking down the walls between data, chip designers are trying to give chips more capabilities: distributing tasks to hardware computing units of different architectures (such as CPUs, GPUs, FPGAs), allowing them to perform their respective duties, work synchronously, and improve efficiency.

Looking back at the history of computer development, AI chip processors from single-core-multi-core, computing from serial-parallel, from homogeneous parallel to heterogeneous parallel.

When Moore's Law was still an iron law of industry, the first stage, computer programming was almost always serial. The vast majority of programs have only one process or thread.

At this point, performance depends on the hardware process. After 2003, because the process reached a bottleneck, hardware upgrading alone could not work. Subsequently, even with the ushering in homogeneous computing (superimposing multiple cores to forcibly increase computing power), the overall ceiling remained.

The advent of heterogeneous parallel computing has opened up a new technological change: distributing tasks to hardware computing units of different architectures (such as CPU, GPU, FPGA), allowing them to perform their respective duties, work synchronously, and improve efficiency.

From a software perspective, heterogeneous parallel computing frameworks enable software developers to efficiently develop heterogeneous parallel programs and make full use of computing platform resources.

From a hardware point of view, on the one hand, many different types of computing units increase computing power by increasing the number of clock frequencies and cores; On the other hand, various computing units improve execution efficiency through technical optimization.

Among them, Chiplet is the key technology.

Under the current technological progress, the Chiplet solution can achieve chip design complexity and design cost reduction. In the IC design stage, the SoC is decomposed into multiple dies according to different functional modules, and some of the chips are modularly designed and reused in different chips, which can reduce the design difficulty, facilitate subsequent product iteration, and accelerate the product launch cycle.

Widening the "data lane" with HBM technology

Due to the development and demand of the semiconductor industry, processors and memories have gone to different process routes, which means that the processes, packaging, and requirements of processors and memories are very different.

As a result, the performance gap between the two has widened since 1980. The data shows that from 1980 to 2000, the speed mismatch between processors and memory increased at a rate of 50% per year.

(From 1980 to 2000, the speed mismatch between processors and memory increased at a rate of 50% per year. Source: Electronic Engineering Album)

The memory data access speed cannot keep up with the data processing speed of the processor, and the narrow data exchange path between the two and the resulting high energy consumption problems build a "memory wall" between storage and computing.

In order to reduce the impact of memory walls, increasing memory bandwidth has always been a technical concern for memory chips. Huang has said that the biggest weakness of computing performance scaling is memory bandwidth.

HBM is the solution to this problem.

High bandwidth memory is a hardware storage medium. Based on its high throughput and high bandwidth, it has attracted the attention of industry and academia.

One of HBM's advantages is that the distance between the memory and the processor is shortened through the interposer, and the memory and computing unit are packaged together through advanced 3D packaging to improve the speed of data handling.

Ultra-heterogeneous, an emerging solution that takes into account performance, scale, and utilization

Super-heterogeneous computing is a calculation that can integrate and reconstruct more heterogeneous computing, so that various types of processors can fully and flexibly interact with each other.

Simply put, it is to aggregate the advantages of multiple types of engines such as DSA, GPU, CPU, and CIM, and combine emerging architectures such as chiplets and 3D packaging to achieve a leap in performance:

√ DSA is responsible for relatively certain amounts of computation;

√ GPU is responsible for some performance-sensitive and elastic work in the application layer;

√ CPU can do anything, responsible for the bottom of the pocket;

√ CIM is in-memory computing, and the main difference between super-heterogeneous and ordinary heterogeneous is the addition of CIM, which can achieve the same computing power and lower energy consumption; Same energy consumption, higher computing power. In addition, CIM can afford more computing power than DSA due to the advantages of the device.

Hyperheterogeneous computing can solve the problems of performance, scale, and utilization.

In terms of performance, due to the addition of storage and computing integration, the same computing power and lower energy consumption can be achieved. Same energy consumption, higher computing power;

At the scale level, because super-heterogeneous is based on a computing platform that can aggregate multiple types of engines, it can take into account flexibility and versatility, and there is no small scale due to lack of generalization. And because the program is more versatile, it can cope with various types of tasks, and the utilization rate can also be improved.

Super-heterogeneous future research directions

The reality is that just heterogeneous computing faces the dilemma of difficult programming, and NVIDIA has worked hard for several years to make CUDA's programming friendly enough for developers to form a mainstream ecosystem.

Super-heterogeneity is even more difficult: the difficulty of super-heterogeneity is not only reflected in programming, but also in the design and implementation of the processing engine, and also in the integration of software and hardware capabilities of the entire system.

For better control of hyperheterogeneity, the integration of software and hardware gives the direction:

1. Take into account performance and flexibility. From the perspective of the system, the task of the system sinks from CPU to hardware acceleration, how to choose the appropriate processing engine, to achieve the best performance and the best flexibility. And it's not just balance, it's balance.

2. Programming and ease of use. The system gradually moved from hardware-defined software to software-defined hardware. How to take advantage of these characteristics, how to leverage existing software resources, and how to integrate into cloud services.

3. Products. In addition to the requirements themselves, the needs of users also need to consider the differences in the needs of different users, and the long-term iteration of individual user needs. How to provide users with better products to meet the short-term and long-term needs of different users. It is better to teach people to fish than to teach people to fish, how to provide users without specific specific functions, extremely performing, fully programmable hardware platform.

Computing power is national strength, and data centers are the "base" for countries to carry out national strength competition. Data centers urgently need large computing power chips to meet the needs of major center-side and edge-side application scenarios.

However, in the data center application scenario, the existing cloud AI training and inference chips in China are still far behind the top NVIDIA A100 chips. At the same time, the process at this stage has reached the physical limit and cost limit, and seeking a more efficient computing architecture is the best choice.

Nowadays, technologies such as storage and computing integration, chiplet, and 3D packaging are now mature, and solutions such as super-heterogeneous are highly implementable. In terms of traditional architecture, there are obvious differences between countries, and in terms of new technologies, countries are difficult to distinguish.

The pattern of computing power competition is quietly changing.

Domestic AI chips, hundreds of boats compete for the current, the victory is undecided

Under the traditional structure, NVIDIA is the only one

According to the market structure, in the field of AI chips, there are currently three types of players.

One is the old chip giants represented by Nvidia and AMD, which have accumulated rich experience and outstanding product performance. According to the above, in the cloud scenario, whether it is an inference chip or a training chip, domestic manufacturers have a gap with it.

The other is cloud computing giants represented by Google, Baidu, and Huawei, which have laid out general large models and developed their own AI chips and deep learning platforms to support the development of large models. Such as Google's TensorFlow and TPU, Huawei's Kunpeng, Ascend, Ali Pingtou's Hanguang 800.

Finally, AI chip unicorns, such as Cambrian, Bicheng Technology, Horizon, etc., with strong technical strength, capital base, and R&D team, broke into the AI chip track.

At present, NVIDIA occupies more than 80% of the Chinese accelerator card market share, and domestic AI chips need to be developed: according to IDC data, the number of acceleration cards shipped in China in 2021 has exceeded 800,000, of which Nvidia occupies more than 80% of the market share. The remaining share is occupied by brands such as AMD, Baidu, Cambrian, Flintstone, New H3C, Huawei, etc.

Behind the technical path, there is a hidden mystery

According to the classification of computing architecture, China is currently divided into three camps: ASIC, GPGPU, and storage and computing players.

By combing through the use architecture, application scenarios, and resource endowments of each vendor, the following clues can be found:

Large manufacturers and autonomous driving professional chip manufacturers prefer ASICs.

Domestic manufacturers Huawei's HiSilicon, Baidu, and Pingtou all choose ASICs as their chip architecture:

1. Huawei chooses to deploy a complete end-to-end ecosystem, such as the Ascend 910, which must be used with Huawei's large model support framework MindSpore and Pangu large model.

2. Alibaba's positioning in this regard is a system integrator and service provider, using its own chip products to build an acceleration platform and export services.

3. Baidu Kunlun Core is mainly used in its own intelligent computing clusters and servers, as well as domestic enterprises, research institutes and governments.

Although ASICs have a very high degree of integration, full performance can be fully exerted, and power consumption can be well controlled, their disadvantages are also obvious: limited application scenarios, reliance on self-built ecosystems, difficult customer migration, and long learning curve.

However, large manufacturers have multiple specific scenarios, and the drawbacks of ASICs "limited application scenarios and difficult customer migration" no longer exist in large factory scenarios, and the difficulty of selecting ASICs in the mass production and manufacturing supply chain is significantly lower than that of GPUs.

AI chip manufacturers focusing on autonomous driving scenarios, such as Horizon and Black Sesame, have also avoided the drawbacks of ASICs due to their multiple orders: as of April 23, 2023, Horizon Journey chip shipments have exceeded 3 million pieces, and more than 20 car companies have reached mass production cooperation with more than 120 models.

After 2017, AI chip unicorns joined the GPGPU camp.

Since ASICs can only exert extreme performance under specific scenarios and inherent algorithms, manufacturers either need to have specific scenarios (such as large manufacturers such as Huawei) or bind large customers (such as Neoneng Technology). After the more general GPGPU shows the performance it should have, it has become the first choice for domestic AI chip companies.

It can be seen that Denglin Technology, Tianzhixin, and Flinton Technology, which choose GPGPU, have fully covered training and reasoning, while most ASIC chips, such as Pingtou Brother, can only focus on reasoning or training scenarios.

Around 2019, a new batch of AI chip unicorns bet on the integration of storage and computing

After the development of AI computing power chips to around 2019, domestic AI chip manufacturers found that under the traditional architecture, CPUs, GPUs, and FPGAs have been monopolized by foreign countries, and are highly dependent on advanced process processes, and lack certain advanced process technology reserves of domestic AI manufacturers have looked for a new solution - storage-computing integrated chips. At present, the integrated pattern of storage and computing is undecided, or it will become the key for domestic manufacturers to break the situation. The mainstream division method of storage and computing integration is to roughly divide it into near-memory computing (PNM), in-memory processing (PIM), and in-memory computing (CIM) according to the distance between the computing unit and the storage unit.

Tesla, Ali Damo Academy, Samsung and other large manufacturers choose near-existing computing.

According to Ganesh Venkataramanan, the head of the Dojo project, compared with other chips in the industry, the D1 chip used in Tesla's Dojo (AI training computer) has 4 times the performance at the same cost, 1.3 times the performance at the same energy consumption, and 5 times less space. Specifically, in terms of D1 training modules, each D1 training module is arranged by a 5x5 D1 chip array, interconnected in a two-dimensional mesh structure. The on-chip cross-core SRAM reaches a staggering 11GB and has an efficiency ratio of 0.6TFLOPS/W@BF16/CFP8 due to the near-memory computing architecture. Industry insiders said that for the CPU architecture, this energy efficiency ratio is very good.

In 2021, Alidama Academy released 3D stacking technology using hybrid bonding, which interconnects computing chips and memory chips face-to-face with specific metal materials and processes. According to the calculation of Ali Damo Institute, in the actual recommendation system application, compared with the traditional CPU computing system, the performance of the storage and computing integrated chip is improved by more than 10 times, and the energy efficiency is improved by more than 300 times.

Samsung released its memory product HBM-PIM (PNM in the strict sense) based on the in-memory processing architecture. Samsung says the architecture achieves higher performance and lower energy consumption: Compared to other GPU accelerators without HBM-PIM chips, HBM-PIM chips double the performance of AMD GPU accelerator cards, reducing energy consumption by about 50% on average. Compared to HBM-only GPU accelerators, the energy consumption of HBM-PIM equipped GPU accelerators is reduced by about 2100GWh in one year.

Domestic Zhicun Technology chose to process in memory: In March 2022, the WTM2101, a PIM-based SoC chip mass-produced by Zhicun Technology, was officially put into the market. Less than one year ago, WTM2101 has been successfully commercialized on the device side, providing AI processing solutions such as voice and video and helping products achieve more than 10 times more energy efficiency.

The in-memory computing is what most domestic startups call the integration of storage and computing:

Yizhu Technology, based on the CIM framework and RRAM storage medium, develops the "all-digital storage and computing integration" large computing power chip, which improves the computing energy efficiency ratio by reducing data handling, and at the same time uses the digital storage and computing integration method to ensure the operation accuracy, which is suitable for cloud AI inference and edge computing.

At the end of 2022, Zhixin launched the industry's first edge-side AI-enhanced image processor based on SRAM CIM.

In the integrated camp of storage and computing, large manufacturers and start-ups have also taken different paths because of their technical paths.

Large companies and startups "consciously" divided into two camps: Tesla, Samsung, Alibaba and other large manufacturers with rich ecosystems, as well as traditional chip manufacturers such as Intel and IBM, almost all of them are laying out PNM; Startups such as Zhicun Technology, Yizhu Technology, and Zhischipke are betting on PIM, CIM and other "storage" and "computing" more intimate storage-computing integrated technical routes.

The comprehensive ecological manufacturers are thinking about how to quickly break through the bottlenecks of computing power and power consumption, and quickly implement their rich application scenarios; Chip manufacturers have developed technologies that meet customer needs in response to the needs of customers for efficient computing power and low power consumption.

In other words, the demand put forward by large manufacturers for the integrated architecture of storage and computing is "practical and fast to land", and near-memory computing, as the technology closest to project landing, has become the first choice of large factories.

Chinese startups, due to their short establishment and weak technical reserves: lack of advanced 2.5D and 3D packaging capacity and technology, in order to break the US technology monopoly, Chinese startups focus on CIM without considering advanced process technology.

In the cloud scenario, players go from shallow to deep

Different business scenarios have shown their own advantages, and they are exploring business models at home and abroad. Whether it is a domestic or foreign company, cloud reasoning is the same direction.

The industry generally believes that the difficulty of research and development and commercialization of training chips is more difficult, and training chips can do reasoning, but reasoning chips cannot do training.

The reason is that in the process of AI training, the neural network model is not fixed, so there is a high demand for the versatility of the chip. The reasoning is simpler and the growth rate is faster, so the training chip is a higher test for the design ability of the chip company.

From the perspective of the global AI chip market, reasoning first and training later is the mainstream path, including Habana, the AI chip company acquired by Intel, and many domestic AI startups.

Such a choice is also a catalyst for the downstream market:

With the gradual maturity of AI model training and the gradual landing of AI applications in recent years, the market of cloud inference has gradually exceeded the training market:

According to the "2020-2021 Chinese Intelligent Computing Power Development Evaluation Report" jointly released by IDC and Inspur, the inference load of AI servers in the Chinese market exceeded the training load in 2021, and as AI enters the application period, the compound growth rate of data center inference computing power demand is more than twice that of the training side, and it is expected that the proportion of accelerators used for inference will exceed 60% by 2026.

The threshold for AI chip "Nova Nova" storage and computing integration is extremely high

After 2019, most of the new AI chip manufacturers are in the layout of storage and computing integration: according to incomplete statistics of Sorrow Insight, there are 20 new AI chip manufacturers in 2019-2021, of which 10 choose the storage and computing integration route.

This all shows that the integration of storage and computing will become a rising star after GPGPU, ASIC and other architectures. And this new star, not everyone can pick it.

Under the circumstance that academia, industry and capital are unanimously optimistic about the integration of storage and computing, strong technical strength, solid talent reserves and accurate control of migration cost acceptance are the keys for startups to maintain competitiveness in the industry, and are also the three major thresholds in front of new players.

The integration of storage and computing breaks down three walls and can achieve low power consumption, high computing power, and high energy efficiency ratio, but there are many challenges to achieve such performance:

First of all, the integration of storage and computing involves the whole link of chip manufacturing: from the lowest level of devices, to circuit design, architecture design, tool chain, and then to the research and development of the software layer;

Secondly, while making corresponding changes at each layer, it is also necessary to consider the degree of adaptation between each layer.

We look at layer by layer, what kind of technical problems are there when a storage and computing integrated chip is created.

First of all, in the selection of devices, manufacturers are "walking on thin ice": the memory design determines the yield of the chip, and once the direction is wrong, the chip may not be mass-produced.

The second is the circuit design level. After having a device at the circuit level, it is necessary to use it for the circuit design of the storage array. At present, in the circuit design, the in-memory calculation is not guided by EDA tools, and it needs to be completed manually, which undoubtedly greatly increases the difficulty of operation.

Immediately after there are circuits at the architecture level, it is necessary to do the design of the architecture layer. Each circuit is a basic computing module, the whole architecture is composed of different modules, and the design of the storage-computing integrated module determines the energy efficiency ratio of the chip. Analog circuits are subject to noise interference, and the operation of the chip will encounter many problems after being affected by noise.

In this case, the architect needs to understand the process characteristics of simulated in-memory computing, design the architecture according to these characteristics, and also consider the compatibility of the architecture and software development.

After the software level architecture design is completed, the corresponding toolchain needs to be developed.

Since the original model of the integrated storage and computing is different from the model under the traditional architecture, the compiler must adapt to the completely different integrated storage and computing architecture to ensure that all computing units can be mapped to the hardware and run smoothly.

A complete technical chain tests the ability of all links of device, circuit design, architecture design, tool chain, and software layer development, and coordinates the adaptability of each link, which is a time-consuming, labor-intensive and cost-consuming protracted battle.

According to the above operation process, it can be seen that the storage and computing integrated chip urgently needs experienced circuit designers and chip architects.

In addition, in view of the particularity of the integration of storage and computing, companies that can make integrated storage and computing need to have the following two characteristics in terms of personnel reserves:

1. The leader needs to have enough courage. It is necessary to have a clear idea in the selection of devices (RRAM, SRAM, etc.) and the calculation mode (traditional von Neumann, storage-computing integration, etc.).

This is because as a subversive and innovative technology, the integration of storage and computing is not led by anyone, and the cost of trial and error is extremely high. Founders of enterprises that can achieve commercialization often have rich experience in the industry, large manufacturers and academic backgrounds, and can lead the team to quickly complete product iteration.

2. In the core team, it is necessary to have experienced talents at all levels of technology. An architect, for example, is the heart of the team. Architects need to have a deep understanding and cognition of the underlying hardware and software tools, and be able to realize the proposed storage and computing architecture through technology, and finally achieve product landing;

3. In addition, according to the qubit report, there is a lack of high-end talents in circuit design in China, especially in the field of hybrid circuits. In-memory computing involves a large number of analog circuit designs, and compared to digital circuit design, which emphasizes teamwork, analog circuit design requires individual designers who are extremely familiar with the process, design, layout, model PDK, and packaging.

Landing is the primary productive force. When delivering, customers consider not only the storage-computing integrated technology, but whether the performance indicators such as energy efficiency ratio, surface-to-area ratio and ease of use of the overall SoC are sufficiently improved compared with previous products, and more importantly, whether the migration cost is within the affordable range.

If choosing a new chip to improve the performance of the algorithm requires re-learning a programming system, and the labor cost spent on model migration is higher than the cost of purchasing a new GPU, then the customer will most likely not choose to use the new chip.

Therefore, whether the storage and computing integration can minimize the migration cost during the landing process is a key factor for customers when choosing products.

At present, NVIDIA has dominated the Chinese AI acceleration card market with a more general GPGPU.

However, with its low power consumption but high energy efficiency ratio, the storage and computing integrated chip is becoming a rising star in the chip track.

The integrated market of storage and calculation is still in the stage of "small lotus showing sharp corners". But we cannot deny that the storage and computing integrated players have built three high walls, non-technical strength, solid talent reserves, do not enter.

Industry trends

The integration of storage and calculation, the next level of computing power

With the rise of big data applications such as artificial intelligence, the integrated storage and computing technology has been widely researched and applied by academia and industry at home and abroad. At the 2017 Microprocessor Top Annual Conference (Micro 2017), including NVIDIA, Intel, Microsoft, Samsung, University of California, Santa Barbara, etc. all launched their storage-computing integrated system prototypes.

Since then, the number of articles related to storage/near-storage on ISSCC has increased rapidly: from 6 in 20 years to 19 in 23 years; Among them, the number of in-house calculations has rapidly increased from 21 years to 4 in 22 years and 6 in 23 years.

(ISSCC2023 storage and computing integration related articles Source: ISSCC2023)

System-level innovation on the rise

System-level innovation is frequently appearing at semiconductor TOP level conferences, showing the potential to break the ceiling of computing power.

In her keynote speech "Innovation for the next decade of compute efficiency," by AMD President and CEO Lisa Su, she mentioned the rapid development of AI applications and the demand it brings to chips.

Lisa Su said that according to the current law of 2.2 times the increase in computing efficiency every two years, it is expected that by 2035, if you want to reach 10 trillion billion computing power, you need 500MW of power, which is equivalent to the power that half a nuclear power plant can produce, "which is extremely outrageous and unrealistic."

In order to achieve such efficiency improvements, system-level innovation is one of the most critical ideas.

(The relationship between computing power and power consumption Source: ISSCC2023 Conference)

In another keynote speech by IMEC/CEA Leti/Fraunhofer, three of Europe's most famous semiconductor research institutes, system-level innovation was also a central keyword.

The speech mentioned that as semiconductor processes gradually approach the physical limit, the demand for chips for new applications must also be considered from the system level to meet, and mentioned the next generation of smart cars and AI as two core applications that especially require chip innovation from the system level to support their new needs.

"From head to toe" breaks the hash ceiling

System-level innovation is to co-design multiple links in the upstream, middle and downstream to achieve performance improvement. There is also a saying that the system process is co-optimized.

System process co-optimization is an "outside-in" development model, starting with the workload and software that the product needs to support, through the system architecture, to the chip types that must be included in the package, and finally the semiconductor process technology.

(System process collaborative optimization Source: ISSCC2023 Conference)

Simply put, it is to optimize all the links together so that the final product can be improved as much as possible.

In this regard, Lisa Su gave a classic case: while using an innovative number system (such as 8-bit floating-point FP8) at the model algorithm level, the algorithm level is optimized and supported at the circuit layer, and finally achieves an order of magnitude efficiency improvement at the computing level: compared with the traditional 32-bit floating point number (FP32), FP8 with system-level innovation can increase computing efficiency by as much as 30 times. If you only optimize the efficiency of the FP32 computing unit, it is difficult to achieve an order of magnitude improvement in efficiency in any case.

(Domain-specific computing supports workload optimization to improve performance and efficiency Source: ISSCC2023 Conference)

This is why system-level innovation is the critical path: if the circuit design only stops at the circuit layer, just thinking about how to further optimize the efficiency of the FP32 computing unit, it will be difficult to achieve an order of magnitude improvement in efficiency in any case.

In this regard, in the speech of the Future Development Opportunity Module, Lisa Su gave a general look of the future system-in-package architecture: including heterogeneous computing clusters, specific acceleration units, advanced packaging technologies, high-speed interchip UCIe interconnect, storage and computing integration and other memory technologies.

(Future system-in-package architecture Source: ISSCC2023 Conference)

Innovators first

The technical path and plan have been clarified, and the next step is the stage of hard work.

Every emerging technology R & D manufacturer undoubtedly faces problems at various levels in the early stage, such as encountering a wall in technology exploration and downstream manufacturers not agreeing. In the early stage, whoever first predicts the future development trend and uses it to take the step of exploration and lay reasonable resources to try, will seize the opportunity.

Chip giant NVIDIA has set a good example in this regard.

When the wave of data centers has not yet overwhelmed and artificial intelligence training is still a niche field, NVIDIA has invested heavily in the development of general-purpose computing GPUs and unified programming software CUDA, making a good errand for NVIDIA - computing platform.

At the time, making GPUs programmable was "useless and lossless": it was not known whether its performance could be doubled, but product development would double. For this, no customer is willing to pay for it. However, NVIDIA, which predicted that a single-function graphics processor was not a long-term solution, decided to use CUDA in all product lines.

In an interview with Dr. Lai Junjie, Senior Director of Engineering and Solutions of NVIDIA China, Lai Junjie said: "In order to realize the vision of computing platform, Huang quickly mobilized a lot of resources from NVIDIA up and down in the early days. ”

Foresight + heavy investment, in 2012, NVIDIA won the innovator's award: in 2012, the computational performance of deep learning algorithms sensationalized the academic circle, as a high computing power and more versatile, easy-to-use productivity tools, GPU+CUDA quickly swept the computer science community, becoming the "standard" for artificial intelligence development.

Nowadays, the integration of storage and computing has shown powerful performance, and has excellent performance in large computing power scenarios such as artificial intelligence neural networks, multimodal artificial intelligence computing, and brain-like computing.

Domestic manufacturers have also laid out storage and computing integration around 2019, while choosing emerging technologies such as 3D packaging and chiplets, and emerging memories such as RRAM and SRAM to break through the ceiling of computing power.

In the war of AI big computing power chips, innovators come first.

Epilogue:

ChatGPT is booming, triggering a huge wave in the AI industry, domestic AI chips are ushering in the 3.0 era; in the 3.0 era, the chip architecture that is more suitable for large models - storage and computing integration will emerge, and system-level innovation will become the future development trend, and manufacturers who bet first will eat the dividends brought by ChatGPT.