Under AMD's Lisa Su, Chairman and CEO of AMD's vision of "AI Everywhere", the company's layout in the field of AI is becoming more and more comprehensive and deep. A variety of products suitable for AI applications, such as EPYC and Ryzen series processors, have been widely used in cloud and enterprise applications. However, for the next revolution in edge AI, AMD needs to come up with more efficient and compact solutions. In fact, the revolution of edge AI in many industries, such as healthcare, transportation, smart retail, smart factories, and smart cities, has already begun. As the demand for computing power for new applications increases, the industry faces a range of challenges such as power consumption and size constraints. In the past, AMD mainly relied on Versal, Zynq and other products to meet the needs of the industry. However, in order to cope with the higher level of computing requirements, AMD is constantly upgrading to provide more robust support.
Embedded AI bottleneck – Single-chip acceleration is urgently needed
Embedded systems have historically faced severe constraints, such as extreme temperature conditions, limited power supply/space size, and must respond in real-time to ensure safety and reliability. And as AI becomes more ubiquitous, embedded systems need to meet higher workload demands in addition to traditional challenges. In AI-driven embedded systems, data processing includes three key steps: pre-processing, AI inference, and post-processing, each of which needs to be accelerated to achieve real-time performance of the system. Pre-processing, which involves the fusion and intersection of data from multiple sensors, is a critical step in real-time processing, while AI inference is typically performed by vector processors and post-processing relies on high-performance embedded CPUs. Since there is no single type of processor that can optimize these three phases, a range of different processors are required to optimize for each part. Multi-chip solutions are often used to build such systems. In general, the optimization is performed in the pre-processing stage by combining FPGAs and SoCs, the inference stage uses non-adaptive SoCs, and the post-processing stage uses high-performance embedded CPUs. Of course, AMD's first-generation Versal AI Edge family offers an alternative to be able to use programmable logic for preprocessing, vector processing, or an AI engine for inference, but post-processing still requires the support of an external processor.
Either of these approaches requires a multi-chip solution, which introduces a number of problems, such as higher power requirements, power supply complexity, larger footprint and system size, higher external memory requirements, and increased latency due to interconnecting chips. In addition, more components increase security vulnerabilities and potential points of failure, increase the risk of obsolescence, and increase board design time and effort, reducing productivity.
AMD's 2nd Gen Versal Adaptive SoC Enables "Single-Chip Smart"
In response to these industry pain points, AMD announced the launch of the second generation of Versal adaptive SoCs for embedded systems, including the second generation of the Versal AI Edge series, which is designed for AI-driven embedded systems, and the second generation of the Versal Prime series, which is designed for classic embedded systems. According to Manuel Uhm, Director of Product Marketing at Versal in AMD's Adaptive & Embedded Computing Group (AECG), the core lies in the ability of a single device to provide end-to-end acceleration, covering three stages: data preprocessing, inference, and post-processing.
Manuel Uhm, Director of Product Marketing, Versal, Adaptive & Embedded Computing Group (AECG), AMD
The second-generation Versal Adaptive SoC uses programmable logic for pre-processing, including sensor fusion, data conditioning, and new hardware image and video processing capabilities, a 3x increase in TOPS per watt in the inference phase, and a 10x scalar computing power in the post-processing phase through the integration of 8X Arm Cortex-A78AE application processors and 10X Arm Cortex-R52 real-time processors.
In addition, considering the stringent requirements for information security and functional safety of edge computing, the second-generation Versal series supports standards such as ASIL D and SIL 3, ensuring that safety performance is taken into account from the early stage of design. According to Manuel Uhm, "Unlike the first generation, which was more CPU-accelerating, the second generation of the Versal AI Edge series is primarily designed to be able to form the central computing of the system. Building on decades of experience in the embedded space, AMD has strong support for embedded AI. A set of intuitive comparisons shows the next-level system performance gains of the second-generation Versal in ADAS, smart city, and video streaming applications:
- In L2+/L3 ADAS applications, the second-generation AI Edge series has 4 times more image processing capability with similar power resources due to the addition of hard image processing functions.
- In smart city scenarios, the second-generation AI Edge series reduces the footprint of edge AI devices by 30% and supports 2x the size of video streams, which means that each video stream occupies a 65% smaller footprint.
- In video streaming, the second-generation Versal Prime Series delivers up to 2x the video processing power for multi-port encoding and streaming compared to the efficiency of the Zyng MPSoC, resulting in a 35% smaller footprint per stream.
How to achieve "single-chip intelligence" in the three processing stages?
Manuel Uhm provides an in-depth explanation of how the second-generation Versal Adaptive SoC performs and implements the three phases of pre-processing, inference, and post-processing. The main goal of the pre-processing phase is to reduce latency and increase certainty. At this stage, if a non-adaptive SoC is used, the number of I/O interfaces or hard ISPs is very limited and lacks flexibility. If you want to import different sensors or data types, you must use external storage or cache, which will lead to low processing efficiency and increased latency. "In the pre-processing phase, adaptability is equivalent to flexibility, which means that it can be connected to any sensor, to any interface. Processors are limited by instruction set content, while adaptability allows hardware to be customized to suit different performance while enabling real-time. With programmability, true flexibility can be achieved", points out Manuel Uhm.
In terms of AI inference, unlike the first generation, which mainly implements AI engine control through programmable logic, the control processor of the second generation product is included in the AI engine array and hardened. In other words, the work controlled by the AI engine does not need to be handled by programmable logic, and the extra programmable logic resources can be used for the processing of sensors and other data. Due to the dual requirements of high throughput and accuracy for AI inference, the second-generation Versal AI Edge series meets different levels of accuracy and throughput requirements by supporting multiple data types. For example, the introduction of the shared exponential data type has resulted in a significant increase in throughput without sacrificing accuracy, with a top-end performance of 369 TFLOPS in the MX6 data type Dense configuration, which is about 60% higher than the maximum performance of the INT8 type, which can achieve a maximum of 184 TOPS. In addition, the AIE-ML v2 AI engine is also capable of processing data signals such as FIR and FFT.
To get the most out of the AI engine, the accompanying software package must also be robust and easy to use, so that developers can deploy and optimize with familiar tools. Vitis AI is one such package that allows developers to use open-source tools such as PyTorch and TensorFlow for model optimization and inference to better realize the potential of the Versal AI Edge family of devices. In the post-processing stage, as mentioned earlier, the new product can provide up to 10 times the scalar computing power. This is driven by the Application Processing Unit (APU) for complex decision-making and similar workloads, with 8x the Arm Cortex-A78AE cores with up to 2.2GHz per core and up to 200.3K DMIPS computing power, and up to 10x Arm for the real-time processing unit (RPU) for control functions The Cortex-R52 core, with a maximum frequency of up to 1.05GHz per core and up to 28.5K DMIPS computing power, and the ASIL D and SIL3 level design also greatly improve the ability of the new product to withstand system failures.
"Compared to previous multi-chip AI-driven embedded systems, the second-generation Versal AI Edge family enables end-to-end embedded system acceleration in a single device, and minimizes the need for external security microcontrollers or external memory, eliminating the need to share workflows across multiple processors, improving efficiency and eliminating additional overhead," concludes Manuel Uhm. The Subaru EyeSight vision system is a prime example of the use of the second-generation Versal AI Edge family of products. The partnership will further enhance the performance of the next-generation EyeSight vision system for pre-collision braking, lane departure warning, adaptive cruise control and lane keeping assist. In addition, using programmable logic, Subaru can also modify the processing algorithm of the stereo camera in real time, further enhancing the safety performance of the vehicle.
It is understood that the second-generation Versal AI Edge series and the second-generation Versal Prime series early trial program has been launched, early access documents have been released, and major customers, including Subaru, are currently in contact. Chip samples will be released in the first half of 2025, evaluation kits and system-on-modules (SOMs) will be available in mid-2025, and production chips will be available in late 2025.
Promote "AI Everywhere" and achieve broader intelligence
AI is undergoing rapid development and change, emerging models such as Transformer have become the focus of the industry in just a few years, and the emergence of new models in the future is unpredictable. To remain competitive in such a rapidly changing environment, the adaptability and flexibility of the platform becomes critical. That's why AMD is committed to developing a highly scalable platform that can be flexibly adapted to the processing needs of tomorrow's markets. Currently, AMD's AI layout is primarily focused on inference and training. According to Manuel Uhm, the training side will mainly rely on the powerful capabilities of CPUs and GPUs, supplemented by adaptive acceleration products such as Alveo, and in terms of edge inference, it will mainly rely on AI engines and programmable logic to perform inference tasks, giving full play to the key capabilities of the adaptive platform. Under the trend of distributed machine learning, training and learning tasks are also pushed to edge devices for execution, rather than being centralized in the cloud. This approach reduces the latency caused by data backhaul to the cloud, allowing edge devices to learn and adapt in real time, which can also be used by AMD's products. In addition, privacy protection has become an important consideration in AI applications. With the growing concern about data privacy, more and more users and businesses want to process data locally on their devices instead of uploading it to the cloud. AMD is also looking at solutions for training and inference on edge devices to meet the need for privacy. Manuel Uhm said that through the above strategic layout, AMD is actively addressing the main challenges in the field of AI, aiming to promote the realization of "ubiquitous AI" and realize broader intelligence.