Decode video in real time and smoothly with AI on mobile phones: Qualcomm developed the first neural video decoder

Reports from the Heart of the Machine

Machine Heart Editorial Department

Using a neural network to decode the video, the efficiency is actually quite high.

With the advancement of communication and Internet technology, especially the popularity of smart phones and the maturity and development of 4G and 5G mobile communication technologies, diverse video entertainment methods such as video chat and video games are emerging in an endless stream, and the consumer demand for video by ordinary users is also growing. The 2018 Cisco Vision Network Index report predicts that by 2022, 82% of Internet traffic will be created by video.

In addition to the entertainment and communication uses in daily life, video is also playing a role in more industry scenarios. For example, the security field with video technology as the core, the video surveillance and recognition of workers' behavior in smart factories, the real-time detection of the environment through camera recording video in assisted and automated driving, and the live video marketing in which more and more celebrities have also participated in in recent years. At the same time, with the vigorous development of computer vision (CV) technology in the field of AI, the technology combination of CV + video will play an indispensable role in more and more application scenarios.

Decode video in real time and smoothly with AI on mobile phones: Qualcomm developed the first neural video decoder

However, massive amounts of video data pose enormous challenges to the transmission, storage, and other processing of video. Video processing technologies such as video compression and codec have become crucial. When watching videos, users want to experience higher image quality and fluency, which relies on more efficient video processing technology. For many years, the decoding of video in computers was mostly done by the CPU, which was easy to use but not very efficient. Decoding video with the GPU is another option. With the rise of applications such as short videos, real-time video decoding with the help of dedicated decoding units on mobile phones and other mobile terminals has also become a new development direction, which is of great significance for real-time video services such as live video broadcasting.

At the same time, with the development of deep neural networks in the field of AI, more and more companies are exploring how to empower neural networks to their own products. The AI engine in Qualcomm Snapdragon SoC fully incorporates neural network capabilities, including the hardware component Hexagon vector processor supports 8-bit fixed-point acceleration neural network operation, and the software component Snapdragon Neural Processing (SNP) SDK supports CNN, LSTM and custom layers.

The flagship SoC Snapdragon 888 integrated sixth-generation Qualcomm AI engine has achieved 26 TOPS OF AI computing power, and the neural network processing SDK has brought a series of improvements, adding support for RNN models, and helping the AI performance of the mobile phone side to a new level.

So, is it possible to apply the huge computing power contained in the AI engine to the video field more extensively and deeply? Recently, Qualcomm has made more attempts in this regard, using the BUILT-in AI engine and CPU of the Snapdragon 888 for video decoding. The results show that the neural video decoding effect based on the neural network is not bad.

The new work of Qualcomm AI Research Institute has realized the industry's first neural video decoder that runs in real time on commercial smartphones and is based on a combination of software and hardware, and realizes real-time decoding of more than 30 fps on videos with nearly 720p HD resolution.

From soft/hard decoding to AI neural video decoding

As an important video processing technology, video codec is widely used in the fields of communication, computer and radio and television, and has spawned a series of practical applications such as network television, radio and television, digital cinema, distance education and conference television.

In terms of the main role, video codec technology is within the available computing resources, the pursuit of the highest possible video reconstruction quality and the highest possible compression ratio to achieve bandwidth and storage capacity requirements. A video codec is a program or device that compresses or decompresses digital video.

For a long time, CPU-based software codec technology (also known as soft decoding) has been dominating the market, such as Intel's video codec engine built into its CPU and the libavcodec decoder in the open source software FFmpeg, although easy to use, but it will occupy CPU resources, improve power consumption, codec efficiency is not high, prone to stuttering, screen and other abnormalities, affecting the normal operation of other applications.

Therefore, the use of GPUs or dedicated processors to code and decode video (also known as hard decoding) has become another option, such as Nvidia's GPU-based hardware decoder module NvCodec, which not only achieves good encoding performance, but also uses graphics card encoding does not occupy too much system resources, and will not affect the performance of the application.

However, the growing demand for video consumption puts higher demands on future video codecs and should have the following functions:

Direct optimization of bit rate and perceived quality metrics

Simplified codec development

Intrinsic massive parallelism

The ability to efficiently execute and update deployed hardware

Downloadable codec updates

With significant advances in deep neural network (DNN) technology and its widespread application in the field of computer vision and communication systems, it is possible for neural network-based video codecs to provide all of the above desired functions. Specifically, such video codecs can not only run on AI hardware accelerators developed for other AI applications, but also enable more efficient parallelization of entropy encoding.

Driven by this potential, neural network video codecs have become popular research in the past few years, such as the Hyperprior autoencoder proposed by Google in 2017, the end-to-end deep video compression framework proposed by Shanghai Jiao Tong University and other institutions in 2018, and the Scale-Space Flow proposed by the Google Research Perception team in 2020 for end-to-end optimization of video compression. These neural video codecs exhibit impressive compression performance and bridge the gap with traditional codecs.

AI-based compression has absolute advantages.

But it should also be noted that bringing AI research from the lab to real-world scenarios is often not easy. This also means that the actual deployment of neural video codecs is challenging. Most related studies utilize wall-driven, high-end GPUs with floating-point computing, and neural network model architectures are often not optimized for fast inference. Therefore, for mobile devices with fixed compute, power, and temperature constraints, it is impractical or impossible to run real-time inference on such neural network decoder models.

On the commercial smartphones of the Snapdragon 888 SoC, Qualcomm AI Research Institute has achieved a new breakthrough in the neural network video decoder based on the combination of software and hardware.

Leverage the SNAPDRAGON 888's CPU and AI engine to achieve 30+fps HD video decoding

With expertise in energy-efficient AI and the powerful AI computing power of the Snapdragon 888 platform, Qualcomm implements real-time in-frame neural video encoding on commercial smartphones. Intra-frame encoding in High Efficiency Video Coding (HEVC) can be seen as an extension of High-End Video Coding (AVC), which utilizes spatial sampling predictions to encode. The processing steps of the common part of the in-frame encoding process and the inter-frame encoding include conversion, quantization, entropy encoding, and so on. To this end, Qualcomm AI Research Institute has optimized in the following aspects:

Redesign the network architecture to reduce complexity;

Quantify and accelerate neural networks on AI inference engines;

Encoding with parallel entropy.

Based on the above optimizations, Qualcomm used the CPU and AI engine on the Snapdragon 888 mobile platform to develop a neural video decoder based on the combination of software and hardware, decoding HD video with a resolution of 1280×704 at more than 30fps, without any help from the video decoding unit. Snapdragon 888 integrates the sixth generation of Qualcomm AI engine, as a complete set of processor collaboration system, this generation of AI engine includes the redesigned Hexagon 780 processor, will be AI in all aspects of high-speed communications, professional imaging, gaming experience and many other aspects.

8-bit model with efficient decoding performance

Decoder architecture optimization, parallel entropy decoding (PEC) and AIMET quantitative perception training are the three important steps for Qualcomm AI Research Institute to achieve efficient neural coding on the smartphone side.

In the first step, based on an SOTA frame pair compression network, the decoder architecture optimization is realized by pruning the channel and optimizing the network operation, relying on the built-in AI engine of the Snapdragon 888 to accelerate and reduce the computational complexity.

The second step is to create a fast parallel entropy decoding algorithm. The algorithm can take advantage of data-level and thread-level parallelism, allowing for higher entropy encoding throughput. In Qualcomm's scheme, the SNAPDR 888's CPU is used to handle parallel entropy decoding.

In the third step, the weights and activations of the optimized model are quantized to 8 bits, and then the loss caused by rate distortion is recovered by quantization perception training. Here we use the Qualcomm Innovation Center's open source AI Model Efficiency Toolkit (AIMET), which was launched and open sourced in May 2020 as a library of advanced quantization and compression techniques that support neural network model training.

Through these three steps, Qualcomm AI Research Institute built an 8-bit model with efficient decoding performance.

The effect of AI decoding

In the Demo setup, Qualcomm AI Research Institute selected video with a resolution of 1280×704 (close to 720p HD) and generated a compressed bitstream by running the decoder network offline and entropy decoding. The compressed bitstream is then processed by parallel entropy decoding and decoder networks running on Snapdragon 888 mobile devices (commercial smartphones), where parallel entropy decoding runs on the CPU and the decoder network is accelerated by the sixth-generation Qualcomm AI engine.

In the end, Qualcomm AI Research Institute obtained a neural decoding algorithm that achieved decoding speeds of more than 30 frames per second in video with resolutions of 1280×704 resolutions. The following is a dynamic demonstration of neural video decoding on commercial smartphones, with the video decoding speed (Speed) and the number of iterations (Loop) in the same video frame in the upper right corner, and the average bit rate (Bit Rate) at runtime and the bits per pixel per frame (BPF) in each frame of video image.

In the Demo demo, the video and decoding parameters are set to high quality, and a series of challenging and finely textured natural scenes are selected. While achieving decoding speeds of more than 30 frames, rich visual structures and textures are accurately preserved with the help of neural decoding networks, enabling very good scene reproduction. The bit rate conforms to the quality of the all-intra configuration and selection, indicating that this neural video decoder can support the data throughput required for high-quality video streaming.

Because AI-based codecs can produce visual detail that is not present in the bitstream, the bit rate of the same or higher quality video should be lower than that of traditional codecs. This also means that video codecs will become driven by a combination of hardware and software, and any new codecs can be processed by the CPU in the SoC and the built-in AI accelerators, as long as they are powerful enough.

Currently, this neural video decoder only supports in-frame decoding, which means that each frame of video is decoded independently, and there is no need to consider small changes between frames like other video codecs. It is reported that Qualcomm will also continue to work on the study of inter-frame video decoding running in real time on mobile devices.

In terms of the significance of this study, although there is still room for improvement in the real-time decoding of 30 fps + HD video on the Snapdragon 888 SoC, the release of AI computing power and image capabilities on the mobile phone side can bring richer video applications and a clearer and smoother viewing experience for mobile phone users. For example, the recently released Snapdragon 888 Plus mobile platform, although only a partial upgrade has been made on the basis of snapdragon 888, its AI computing power has reached a staggering 32TOPS, further upgraded; coupled with Qualcomm's next continuous in-depth research, it is foreseeable that AI's high-definition video real-time decoding capabilities will soon be further improved.

In addition to mobile phone platforms, Qualcomm has also introduced AI's ability to process video into other application platforms such as PCs, XRs and automobiles. For example, the AI-enabled Spectra ISP of the world's first 5G extended reality platform Snapdragon XR2 has improved its AI performance by 11 times compared with the original XR, greatly improving video processing capabilities; in the second-generation Snapdragon 8cx 5G computing platform on the PC side, the SPECtra ISP with AI capabilities supports 4K HDR-quality video shooting and background bokeh; the 4th-generation Snapdragon Automotive Digital Vehicle Platform, which enhances graphics images, computer vision and AI functions, can provide drivers and occupants with a more intelligent and comfortable video service experience.

Therefore, from a larger perspective, the use of AI computing power for video processing represents a future development direction, and it is bound to empower more application scenarios.

Reference Links:

https://segmentfault.com/a/1190000038930366

https://cnx-software.cn/2021/06/28/neural-video-decoder-leverages/

https://finance.sina.com.cn/tech/2021-02-04/doc-ikftssap3428399.shtml

https://www.leiphone.com/category/industrynews/0YEJQiyu3umwEyjJ.html

Decode video in real time and smoothly with AI on mobile phones: Qualcomm developed the first neural video decoder

Read on