laitimes

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

Recently, the 3rd Tsinghua University Automotive Chip Design and Industrial Application Seminar and Alumni Forum was successfully held in Wuhu. As a special guest of this event, Dr. Liang Shuang, co-founder and CEO of Chaoxing Future, attended and delivered a keynote speech "Software and Hardware Collaborative Optimization, Empowering the New Era of AI 2.0".

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

Large models are the "steam engine" of the AI 2.0 era

The implementation of AI+X applications and edge computing will become the key

Since the release of ChatGPT, large models have detonated the "Fourth Industrial Revolution" and become the "steam engine" in the AI 2.0 era, driving the intelligent transformation of thousands of industries. Paul · Mantu once said: "The steam engine does not create a large industry, but it provides power for a large industry", and the same is true for the large model, which itself will not directly create new industries, but combine with existing industry application scenarios and data to create value.

After WAIC 2024 came to an end, some media commented: there are no new players in the large model, and the second half of AGI is computing and application. Liang Shuang believes that the second half of AGI will be the landing of AI+X applications and edge computing. In the era of AI 1.0, the neural network model on the server side has gradually sunk to the edge in applications such as security and intelligent driving, and this trend will definitely be interpreted again in the era of AI 2.0, and will create a broader incremental market in the fields of smart cities, automobiles, robots, and consumer electronics.

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

Looking back at the evolution history of AI, we can see that the main mode in the AI 1.0 era is to complete a single task through a single model, such as security, face recognition, voice recognition, and intelligent assisted driving solutions based on perception-decision-control sub-modules. Liang Shuang believes that we are now entering an "AI 1.5 era", in complex systems such as intelligent driving and robots, the neural network is used to complete the implementation of each module function, minimize manual rules, and improve performance through a data-driven paradigm, greatly reducing the difficulty of manually dealing with various long-tail problems. In the AI 2.0 era, the system will be supported by a unified general basic model to deal with multi-source data input and complete a variety of complex tasks.

End-to-end with large models on board the car

Smart cars are a necessary stage on the way to general-purpose robots

In recent years, intelligent driving systems are gradually upgrading from traditional single-sensor CNN perception to multi-sensor CNN BEV, Transformer-based BEV and Occupancy solutions, and are evolving to end-to-end large models. With the gradual modelling of the regulatory part and the absence of rules in the middle, the performance ceiling will be greatly improved driven by massive high-quality data, and the manual participation in dealing with long-tail problems will be greatly reduced, so that the amount of software engineering can be reduced by up to 99%. In addition, the introduction of the visual model helps the intelligent driving system to further increase its understanding of the complex semantics of the physical world, make the driving behavior closer to that of people, and improve the generalization processing ability of unknown scenes.

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

Liang Shuang pointed out that smart cars will be a necessary stage towards general robots in the future, for example, TESLA's Optimus robot and smart cars use the same FSD platform, and the system configuration, functional tasks are the same. Although the system composition and iterative upgrading of the two are highly similar, the dimension of the robot is higher, the task is more complex, and the large model is deployed to the equipment on the edge side to form a "Robot-Brain", which will become the key to the development of the industry.

There are great challenges on the edge side of the large model landing

Collaborative optimization of software and hardware is a realistic and feasible landing path

The past decade has been called the golden decade of AI accelerators, and the energy efficiency of CNN accelerators has been improved to 100 TOPS/W. The growth rate of the scale and parameters of large models far exceeds that of the CNN era, and greatly exceeds the growth rate of traditional computing hardware. However, the processor energy efficiency of the current large model is still less than 1TOPS/W, and there is a gap of two orders of magnitude between the processor and the application requirements of the edge side, which seriously limits the implementation of the large model.

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

(Excerpted from Professor Wang Yu's report published in January 2024 "End-side Large Model Inference, Current Situation and Prospect of Smart Chips")

At present, many "small" models within 2B deployed locally on mobile phones usually have capability limitations such as forgetting historical information when applied to edge-side scenarios, while large models above the level of 7B with greater demand and significantly improved effect are usually difficult to deploy to existing edge-side chips, mainly due to the following reasons:

(1) The computing power gap of the traditional architecture matrix is obvious, and 50-80% of the computing power in the large model is calculated in various matrices in the Attention layer, and the KV matrix has obvious sparsity, which requires special support;

(2) The parameters and bandwidth requirements of the large model are huge, and the 7B level floating-point model alone requires 28GByte storage space, and the locality of the weights is relatively low, so the process of computation and processing of the large model needs to be read from external memory frequently, and the bandwidth demand of each token will be greater than 10GB/s;

(3) The current architecture accuracy type is insufficient, and the traditional CNN network with computing accuracy can usually achieve better processing effect with INT8, while various operators in the large model will need the calculation support of different precisions such as INT4/FP8/BF16, and the data dynamic range such as the activation layer and the norm layer is large, resulting in many existing quantization algorithms cannot be well supported.

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

From the perspective of improving the energy efficiency of large models at the edge, one is to scaling down by improving the process level, but it is difficult to continue to continue due to the influence of Moore's Law and the international situation. The other is through new devices and new systems, but the maturity of the application still needs to be further improved and perfected in technology. However, at present, the most realistic means of implementation is to do software and hardware collaborative optimization for large model applications, through new hybrid quantization methods and sparsity processing in software, and accelerated design for common algorithm structures in large models in hardware, so as to achieve an overall energy efficiency improvement of 2-3 orders of magnitude.

In-depth optimization for new requirements of large model tasks

Chaoxing Future achieves industry-leading AGI computing on the edge side

Chaoxing Future mainly provides energy-efficient computing solutions with AI computing chips as the core and software and hardware collaboration for various edge intelligence scenarios, and is committed to becoming a leader in edge-side AGI computing.

l "Pinghu/Gaoxia" NPU: The team has sharpened a sword for ten years to achieve the ceiling of the performance industry

For the neural network computing tasks required by intelligent driving and large models, SUPERSTAR has developed its own high-performance AI processing cores "Pinghu" and "Gaoxia". The "Pinghu" NPU mainly provides efficient computing for perception tasks using CNNs and a small number of Transformers, while the "Gaoxia" NPU is a specially designed acceleration core for high-level intelligent driving and real-time processing of large models.

Compared with a widely recognized competitor in the market, the inference frame rate per unit of computing power is 10 times higher than that of CNN tasks and 25 times higher than that of Transformer models.

The "Gaoxia" NPU architecture adopts a mixed-granularity instruction set design, a single cluster can achieve 40TOPS computing power, supports INT4/INT8/FP8/BF16 different computing accuracy, and optimizes the internal cache design, and designs a special acceleration structure for Sparse Attention and 3D sparse convolution. Through these optimized designs, the "Gaoxia" NPU realizes real-time computing support for typical generative large models, and the generation speed of LLaMA3-8B can reach up to 60 tokens/s. In addition, the "High Gap" NPU can achieve nearly equivalent 3D sparse convolution processing rates with 1% of the computational logic area compared to the NVIDIA Orin chip.

l "Stinging" series chips: It has been implemented in batches in many fields, and the latest products realize real-time computing on the edge side of large models

Based on the self-developed NPU core, Chaoxing Future released the edge-side AI computing chip "Surprise R1" at the end of 2022, with a 16TOPS@INT8 NPU computing power and a typical power consumption of only 7-8W, which can support the natural heat dissipation design of various system solutions. "Amazing R1" has been implemented in batches in the fields of automobiles, electric power, coal mines, and robots.

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

Superstar will also release the next-generation chips of the "Shocking" series in the future, which can realize real-time processing of large models, and the processing effect of SOTA mobile phone chips such as Snapdragon 8 Gen3 and Dimensity 9300 will be equivalent to those under the 12nm process. According to the future development path of Chaoxing's chip products, the company will continue to maintain the scalability of the product matrix, from edge perception to intelligent driving upgrade, and gradually move towards "Robot-Brain".

l "Luban" model deployment tool chain: integrates a new method of large model optimization, and software collaboration achieves 40 times performance improvement

On the basis of efficient hardware architecture, SUPERSTAR has built a deeply optimized "Luban" tool chain for neural network applications, which can increase the inference speed of the edge side by more than 40 times, including:

(1) Industry-leading mixed-precision quantization tools, supporting PTQ/QAT/AWQ functions, supporting INT4/INT8/FP8/BF16 accuracy, and quantization loss less than 1%;

(2) Efficient model optimization tools, supporting sensitivity analysis, distillation, Lora, and the model compression rate is more than 10 times when the accuracy loss is less than 1%;

(3) High-performance compilation tools, providing rich computing graph optimization technologies and efficient instruction scheduling for heterogeneous cores, can improve the inference efficiency by more than 4-5 times.

Especially for large-scale model tasks, "Luban" can reduce the weighted bit width to an average of 2.8 bits through the unique method of sparse outlier point preservation and mixed bit width quantization. Based on the sparse mask, LLaMA3-8B can be compressed by more than 90% under the condition of comparable model processing capacity, which greatly reduces the parameters and computational cost of the model.

l "Cangjie" data closed-loop platform: realize automatic data production and build a closed-loop application iteration

In the era of large models, high-quality algorithm iteration requires powerful data closed-loop tools. Therefore, Chaoxing Future has built the "Cangjie" platform, including data management, data mining, data augmentation, ground truth production, model production and algorithm evaluation, and has applied large models in many links to provide functional enhancements.

Based on the platform, by building a complete process, customers can obtain effective data from the environment, and minimize the degree of manual involvement, realize automatic data mining and annotation, so as to help customers achieve data-driven algorithm iteration. At present, the "Cangjie" platform has provided services for car companies, Tier1 and other customers, and is also extending its ability to support robot customers.

Keep your feet on the ground and move forward quickly

Provide customers with efficient "AI+"

Based on the team's more than ten years of R&D and practical experience in the field of AI, Chaoxing Future closely follows the development path from AI 1.0 to AI 2.0, continuously polishes its core products, and realizes the implementation of AI+X applications.

In the edge-side scenario, Chaoxing Future has realized the batch landing of chip products in the pan-security fields such as electric power and coal mines, achieved large-scale revenue returns, and continued to iterate the product-related ecology through landing, forming a feedback on long-term directions such as intelligent driving and AGI. "In the current harsh market environment, achieving rapid landing is the king of survival."

Chaoxing Future Liang Shuang: Collaborative optimization of software and hardware to empower the new era of AI 2.0

In intelligent driving scenarios, the "Stinging" series chips can support multi-dimensional intelligent driving solutions, such as intelligent forward-looking all-in-one machine, dual-view solution, 5-7V cost-effective driving and parking, 11V1L high-performance driving and parking integration, etc., and cover mainstream driving, parking, and binocular functions common to intelligent driving and robots. The reference solutions for the related products have been connected and optimized for engineering based on real vehicles. At present, Chaoxing Future has cooperated with a leading commercial vehicle OEM in an industry, and has reached business cooperation with a number of passenger car OEM customers, and is expected to achieve mass onboarding as early as 2025.

In the edge-side large model inference scenario, based on the software and hardware collaborative optimization capability of the "Luban" tool chain, the latest chip product of Chaoxing Future can achieve a generation speed of more than 15 tokens/s on the verification platform, and the 10W chip can support the edge landing of high-performance large models. The 1.5 version of the "High Gorge" NPU platform, Stable Diffusion, can complete image generation in 3.5s. Based on the above capabilities, Chaoxing Future has reached cooperation with the industry's leading robot customers and large model manufacturers.

The road is long and the line is coming

Jointly build a new era of AI 2.0

"Our prediction and awareness of technological development are usually underestimated and lagging behind, and once the development of technology breaks through a certain threshold, it will explode in growth and coverage, such as from the release of ChatGPT to today's 'thousand-model war'. Whether it is high-level intelligent driving or general-purpose robot applications, as long as the technical paradigm is correct and people and funds continue to be invested, the 'ChatGPT moment' will definitely come, and this moment may come sooner than we think." Liang Shuang said, "In the future, Chaoxing looks forward to working with all partners to gradually move forward from the AI 1.0 era and build a new era of AI 2.0." ”

Read on