laitimes

The era of AI chip customization is coming

The era of AI chip customization is coming

The increasing complexity of AI models and the explosion in the number and variety of networks have left chipmakers torn between fixed-function acceleration and programmable accelerators, and have created new ways to include both.

Overall, general-purpose AI processing isn't up to par. General-purpose processors are just that. They are not designed or optimized for any particular workload. And because AI consumes a significant portion of a system's power consumption, focusing on a specific use case or workload can lead to greater power savings and better performance in a smaller footprint.

"Over the past decade, AI has had a profound impact on the computing and semiconductor industries – so much so that specialized processor architectures are now being adopted, and specialized components are also being developed and adopted to serve the AI market only," said Steven Woo, a Rambus researcher and distinguished inventor. ”

But this specialization comes at a price. Ian Bratt, Arm Researcher and Vice President of Machine Learning Technologies, said: "When it comes to ML and AI, the compute demands are endless. If you can do 10 times more calculations, people will use it because you can do better when you run a model that's 10 times bigger. Because that demand is endless, it's going to push you to optimize for that workload, and different types of NPUs have been built that can achieve very good energy efficiency on a particular class of neural network models, and you can get great operands and performance per watt in those spaces. However, this comes at the cost of flexibility, as no one knows where the model is headed. So it sacrifices the future-oriented aspect. ”

As a result, some engineering teams are working on different optimization methods. "General-purpose computing platforms, such as CPUs and GPUs, have been adding more internal acceleration to neural networks without sacrificing the general-purpose programmability of these platforms, such as CPUs," Bratt said. "Arm has a roadmap of CPU instructions and has been adding architecture and CPUs over the years to improve ML performance." While this is still on a generic platform, you can get a lot there. It's not as good as a dedicated NPU, but it's a more flexible and future-proof platform," he said.

Improving efficiency is critical, and it affects everything from the energy required to train AI models in hyperscale data centers to the battery life of edge devices for inference.

"If you take a classical neural network where there are multiple layers of nodes, and information is passed from one node to another, the essential difference between training and execution is that during training, you have backpropagation," said Marc Swinnen, director of product marketing at Ansys. You take the dataset and run it in the node. Then calculate the error function, which is how wrong the answer is compared to the labeled result that you know needs to achieve. Then you take that error and backpropagate it, and adjust the ownership weight of the connections on and between nodes to reduce the error. Then you scan it again with more data, and then backpropagate the error again. You go back and forth, that's training. With each scan you improve the weights, eventually you want to converge to a set of trillions of weights and values made up of nodes, biases, and weights and values that can provide reliable output. Once you have the weights and all the parameters for each node, and the actual AI algorithm is executed, then you don't need to backpropagate. You don't need to correct it anymore. All you have to do is enter the data and pass it on. It's a simpler, one-way way of processing data. ”

Backpropagation requires a lot of energy to do all the calculations.

"You have to average all the nodes and all the data to form an error function, and then you have to weight and divide it and so on." Swinnen explains. "Backpropagation requires all the math, which doesn't happen in the actual execution (during inference). This is one of the biggest differences. There is much less math required in reasoning. ”

However, this still requires a lot of processing, and as AI algorithms become more sophisticated and the number of floating-point operations increases, the trend line will only point upwards and to the right.

Russ Klein, Project Director of the Advanced Integration Division for Digital Industry Software at Siemens, said: "Over the past five years, the winning ImageNet 'Top1' algorithm has performed a 100-fold increase in the number of floating-point operations. "Of course, LLMs are setting new records for model parameters. As the computational load increases, it becomes increasingly impractical to run these models on general-purpose CPUs. AI algorithms typically have a high degree of data parallelism, which means that operations can be distributed across multiple CPUs. This means that more CPU needs to be applied to the problem to meet the performance requirements. But the amount of energy required to perform these calculations on the CPU can be very high. GPUs and TPUs typically have higher power consumption but are computationally faster, reducing the energy consumption for the same operation. ”

Still, the demand for more processing power is growing. Gordon Cooper, product manager for the Solutions Group at Synopsys, noted that the sharp rise in the number of benchmark requests for generative AI inference is indicative of growing interest. "More than 50 percent of our recent benchmark requests have at least one generative AI model on the list," he says. "What's harder to assess is whether they have a specific use case, or if they're betting on both sides and saying, 'This is the trend. I have to tell people I have this. 'I think it's necessary to claim that this capability is still ahead of the use case. ”

At the same time, the pace of change in these models is accelerating. "We're still a long way from hardwired AI, i.e., ASICs, to the point where 'this is it. The criteria have been set. These are the benchmarks, and this will be the most efficient'," Cooper said. "So programmability is still crucial because you have to be able to provide a certain level of programmability for what's coming up next to make sure you have some flexibility. But if you're too programmable, then you're just a general-purpose CPU or even a GPU, and you're not taking advantage of the power and area efficiency of the edge device. The challenge is to optimize as much as possible while providing programmability for the future. That's where we and some of our competitors try to wander around in areas that are flexible enough. An example is an activation function, such as ReLU (Rectifier Linear Unit). We used to hardwire them, but now we find it ridiculous because we can't guess what they'll need next time. So now we have a programmable lookup table to support anyone in the future. It took us generations to realize that we had to start making it more flexible. ”

AI processing is constantly evolving

The rapid development of AI has been driven by tremendous advances in computing performance and capacity. "We're in the AI 2.0 era now," says Rambus' Woo. "The real feature of AI 1.0 is the first attempt to apply AI to the entire field of computing. Voice assistants and recommendation engines, among others, are starting to gain traction because of their ability to use AI to deliver higher-quality results. But looking back, they were limited in some ways. Systems can use certain types of inputs and outputs, but they don't really produce the high-quality information that is capable of generating today. Where we are today is built on top of AI 1.0. AI 2.0 is characterized by the fact that systems can now create something new from the data they learn from and the input they get. ”

The most important of these technologies are large language models and generative AI, as well as co-pilots and digital assistants that help humans become more productive. "These systems are characterized by multimodal inputs and outputs," Woo explains. "They can accept a lot of input, text, video, voice, even code, and something new can be generated from it. In fact, they can also generate many types of media from them. All of this is another step towards the larger goal of Artificial General Purpose (AGI), and we, as an industry, are working to deliver more human-like behaviors that build on what AI 1.0 and AI 2.0 have set for us. The idea here is to be able to really adapt to our environment and tailor the results to specific users and specific use cases. The way content is generated will be improved, especially in things like video, and even in the future, AGI will be used as a way to guide autonomous agents, such as bot assistants that can both learn and adapt. ”

Along the way, the size of AI models has been growing dramatically – by about 10x or more per year. "Today, the largest model available in 2024 has already broken through the trillion parameter mark," he said. "That's because larger models provide more accuracy, and we're still in the early stages of getting the model to a very efficient stage. Of course, this is still a stepping stone to AGI. ”

Three or four years ago, before the advent of vision converters and LLMs, SoC requirements for new NPU capabilities were typically limited to a small selection of well-known and optimized detectors and image classifiers, such as Resnet50, ImageNet v2, and legacy VGG16. Steve Roddy, Chief Marketing Officer at Quadric, said, "Semiconductor companies typically evaluate third-party IP for these networks, but ultimately decide to build their own accelerators for common building block graph operators in these benchmark networks. In fact, the vast majority of AI acceleration in batch SoCs is an in-house developed accelerator. The teardown of all the leading mobile SoCs in 2024 will prove that all six major volume mobile SoCs use internal NPUs. ”

Many of these may be superseded or complemented by more flexible commercial NPU designs. "Requests for proposals for new NPU IPs typically include 20, 30, or more networks, covering a range of classic CNNs such as Resnet, ResNext, etc., new complex CNNs (i.e., ConvNext), vision converters (such as SWIN converters and deformable converters), and GenAI LLM/SLM where there are too many model variants to count," says Roddy. "It's not feasible to build hard-wired logic to accelerate such a diverse network of hundreds of different variations of AI graphics operators. As a result, SoC architects are looking for more fully programmable solutions, and most internal teams are looking for external third-party IP vendors that can provide the more robust compiler toolset needed to quickly compile new networks, rather than the labor-intensive approach of manually porting ML graphs before. ”

History repeats itself

This evolution of AI is similar to what has happened in the field of computing over time. "First, computers come out in the data center, and then computing starts to proliferate," said Jason Lawley, director of product marketing for Cadence Neo NPUs. "We moved to the desktop, then into people's homes, and outward. Then we had laptops, and then mobile phones. It's the same with artificial intelligence. We can look at the computational intensity required to start AI in the data center. We're seeing that now with NVIDIA.

That being said, mainframes and data centers will always have a place. We're going to see AI proliferate from the data center outward, and we're going to see AI proliferate from the data center to the edge. When you move to the edge, you get a variety of different types of apps. Cadence focuses on video, audio, and radar, as well as other computing classes around these, each pillar being an accelerator for application processors. In each of the pillars, they may need to do more AI, so the AI NPU becomes an accelerator for accelerators. ”

Customer behavior is also evolving. "More and more system companies and end users have their own proprietary models, or models that have been retrained using proprietary datasets," says Roddy. "These OEMs and downstream users can't or won't release proprietary models to wafer vendors, allowing the wafer vendors' porting teams to develop new models. Even if you can have NDA protections in place up and down the supply chain, working models that rely on manual tuning and porting of ML models won't scale enough to support the entire consumer and industrial electronics ecosystem. The new working model is a fully programmable, compiler-based toolchain that can be used by data scientists or software developers who create the final application, exactly how the toolchains that have been leading for CPUs, DSPs, and GPUs for decades have been deployed. ”

The complexity of the algorithm is constantly increasing,

Put more pressure on the engineering team

As algorithms continue to grow in complexity, designers are forced to pursue higher levels of acceleration. "The more an accelerator is tailored to a specific model, the faster and more efficient it will be, but the less versatile it will be," says Klein from Siemens. And it's less adaptable to changes in application and requirements. ”

The era of AI chip customization is coming

Figure 1: Power and performance relationships for different execution platforms running AI models, CPUs, GPUs, TPUs, and custom accelerators

The era of AI chip customization is coming

Figure 2: The increasing complexity of inference

Rambus's Woo also sees a trend toward larger AI models because they can deliver higher quality, more powerful, and more accurate results. "This trend shows no signs of slowing down, and we expect the demand for greater DRAM capacity and greater DRAM bandwidth to continue to increase significantly in the future. We expect this to continue. We all know that the AI training engine is the showcase part of AI, at least from the hardware side. Compute engines from companies like NVIDIA and AMD, as well as specialized engines (TPUs) produced by companies like Google, have made huge strides in computing the industry and the ability to deliver better AI. But these engines have to input a lot of data, and data movement is one of the key factors that limits the speed at which we can train models today. If these high-performance engines are waiting for data, then they are not doing their job. We have to make sure that the entire pipeline is designed to provide data in a way that keeps these engines running.

If we look from left to right, it is usually the case that a large amount of data is stored, sometimes in a very unstructured way, so they are stored on devices such as SSDs or hard drives, and the task of these systems is to extract the most relevant and important data to train the model that we are training and convert it into a form that the engine can use. These storage systems also have a lot of regular memory for buffers and so on. For example, some of these storage systems can have up to 1TB of memory. Once the data is extracted from storage, it is sent to a set of servers for data preparation. Some people call it the read layer. The idea here is to take this unstructured data and then prepare it so that it can be used in a way that the AI engine can best train. ”

At the same time, alternative numbers indicate that PPA can be further improved. "Floating-point numbers are often used for AI training and inference in Python ML frameworks, but they're not the ideal format for these calculations," Klein explains. "The numbers in AI computing are mostly between -1.0 and 1.0. Data is usually normalized to this range. While 32-bit floating-point numbers can range from -10 38 to 10 38, this leaves a lot of unused space in the numbers and the operators that perform calculations on those numbers. The hardware of the operator and the memory that stores the value occupy silicon area and consume power. ”

Google has created a 16-bit floating-point number format called brain float (bfloat), which is targeted at AI computing. PPA has been greatly improved due to the halving of the storage area for model parameters and intermediate results. Vectorization (SIMD) bfloat instructions are now an optional instruction set extension for RISC-V processors. Some algorithms are deployed using integer or fixed-point representations. Moving from a 32-bit floating-point number to an 8-bit integer requires a quarter of the memory region. Data moves four times faster in the design, and the multiplier shrinks by 97 percent. Smaller multipliers allow more operators to be used in the same wafer area and power budget, resulting in higher parallelism. "Posits" is another fancy representation that works well on AI algorithms.

"General-purpose AI accelerators, such as those produced by NVIDIA and Google, must support 32-bit floating-point numbers because some AI algorithms require them," Klein said. In addition, they can add support for integers of various sizes, and possibly brain floating-point numbers or assumptions. But supporting each new numeric representation requires an operator for that representation, which means more wafer area and power are needed, to the detriment of PPA. In addition to 32-bit floating-point numbers, some Google TPUs also support 8-bit and 16-bit integer formats. But if the optimal size of the application is 11-bit features and 7-bit weights, it is not quite suitable. A 16-bit integer operator is required. But a custom accelerator with an 11 x 7 integer multiplier will use about 3.5 times the area and energy. For some applications, this will be a strong reason to consider a custom accelerator. ”

All roads lead to customization, and chip designers need to know a lot about customizing AI engines.

"When you license a product with a high degree of customization or varying degrees of customization, you get something different," said Paul Karazuba, vice president of marketing at Expedera. "It's not a standard product. Therefore, you need a little time to learn. You're getting boutique products, and some of those products have hooks that are unique to you as a chip designer. This means that, as a chip designer and architect, you need a learning curve to understand exactly how these products will function in your system. Doing so has its advantages. If standard IP, such as PCIe or USB, contains something you don't want or need, the hooks in it may not be compatible with the architecture you've chosen as a chip designer. ”

This is essentially a margin in the design, and it affects performance and power consumption. "When you get a custom AI engine, you can make sure that those hooks that you don't like don't exist don't exist," Karazuba says. "You can make sure that the IP is working well in your system. So, there are definitely benefits to doing so. But there are drawbacks. You can't get the scale that a standard IP has. But for something highly customized, you'll have it. You'll get some custom stuff, which has some benefits for your system, but you'll need to deal with longer lead times. You may have something unique to deal with. There will be some complications. ”

However, these benefits can outweigh the learning curve. In an early customer example, Karazuba recalls, "They developed their own in-house AI network designed to reduce noise in 4k video streams. They want to achieve 4K video rates. This is a network that they developed in-house. They spent millions of dollars to build it. They were originally going to use the existing NPU on their application processor, which, as you might guess, was a general-purpose NPU. They put the algorithm on this NPU and got a frame rate of two frames per second, which is obviously not the video rate. They approached us, and we licensed them a targeted, customized version of our IP. They built a chip that included our IP and ran the exact same network, getting a frame rate of 40 frames per second, so by building a focused engine, the performance was 20 times better. Another benefit is that since it's focused, they're able to run it at half the power consumed by the NPU on the application processor. As a result, 20x throughput is achieved at less than half the power.

To be fair, it uses the same process node as the application processor, so it's really an apples-to-like comparison. These are the benefits you see from such things. Now, obviously, there is a question of cost. Building your own chip is much more expensive than using what you already have on the chip you've already bought. However, if you can leverage this artificial intelligence to differentiate your products, and you can get this level of performance, then the extra cost may not be an obstacle. ”

conclusion

In terms of where to go, Arm's Bratt says there is enough AI/ML. "We're going to see that in situations where people really care about energy efficiency and the workload is slow, like deep embedded environments, you'll see that these dedicated NPUs have highly optimized models for those NPUs, and you're going to get great performance. But in general, programmable platforms like CPUs will continue to move forward. They're going to keep advancing in ML, and they're going to run those whole new workloads. Maybe you can't map them to existing NPUs because they have new operators or new data types.

But as things stabilize up, for some verticals, you're going to take those models that run on programmable platforms and optimize them for NPUs, and you're going to get the best performance in embedded verticals like surveillance cameras or other applications. These two models will coexist for quite some time to come. ”

According to Lawley of Cadence, chip architects and design engineers need to understand the changes that AI processing brings down to three things: storing, moving, and computing data.

"Fundamentally, these three things haven't changed since the beginning of Moore's Law, but the most important thing they have to be aware of is the trend towards low power consumption and optimal data usage, and advances in quantification – the ability to pin memory into a system and reuse it efficiently. So what kind of layer fusion should be used in data movement, data storage, and data computing? Software plays just as much of a role in this as much as hardware, so algorithms are able to calculate what doesn't need to be calculated without incorrectly and move things that don't need to be moved – that's what we're focused on. How do we get the most performance with the least amount of energy? This is a difficult problem to solve. ”

Reference Links

https://semiengineering.com/mass-customization-for-ai-inference/

Source | Semiconductor Industry Insights compiled from SemiEngineering

Recommended reading - United States adds 13 more Chinese companies to the "unverified list" (with Chinese list)

Japan suddenly, update export controls! Contains 5 semiconductor products

Understand the technical characteristics of HBM, the core of computing power, in one article

United States upgraded the export ban on AI chips, and 13 Chinese GPU companies were included in the entity list (with list) again! Foreign media: The United States Department of Commerce will impose export controls on 42 Chinese companies (with a Chinese list) The US Department of Commerce will add another 11 Chinese entities to the Entity List (with a list) Europe and the United States are worried that China will accelerate the production of traditional chips! Expert: The US government has no clear strategy on the issue of chips in China Japan has taken effect since July 23, affecting 23 kinds of manufacturing equipment (with details) involving nearly 20 Chinese companies! United States Congress Launches Investigation into Four United States Venture Capital Firms' Investment in China (List Attached) United States Announces Sanctions on 13 Chinese Entities and Individuals, Our Embassy in the United States Responds (With List of Sanctioned Objects) [Semiconductor] On the grounds of involving the Russian military, US imperialism added 12 Chinese companies to the "Entity List" for export control

The era of AI chip customization is coming

☞商务合作:☏ 请致电 010-82306118 / ✐ 或致件 [email protected]

The era of AI chip customization is coming

Read on