laitimes

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

Reporting by XinZhiyuan

Editor: Yuan Xie La Yan

In order to make the majority of video call user experience better, but also let more AR and VR users favor the metaverse, Meta's AI research and development team recently developed an AI model that can better handle the virtual background.

Since the beginning of the COVID-19 pandemic, most people have become accustomed to remote video calls with friends, colleagues and family. Virtual backgrounds have been used in video chats.

Users can change the background when they are in the video, which can give them the right to control the environment around them in the virtual image, reduce the distraction caused by the environment, protect privacy, and even make the user look more energetic in the video.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

However, sometimes the effect of the virtual background may be different from the user's needs. Most people have experienced that the virtual background blocks the face when moving, or the virtual background does not recognize the boundary between the hand and the table.

Recently, Meta has utilized enhanced AI models to segment images, optimizing the background blur function, virtual background function, and ar effects of other Meta product services. This allows you to better distinguish between different parts of your photos and videos.

Researchers and engineers from Meta AI, Reality Lab and other departments of Meta, formed by a cross-departmental team, have recently developed new image segmentation models that have been used in live video calling on many platforms such as Portal, Messenger and Instagram and in augmented reality applications for Spark AR.

The group also optimized the two-person image segmentation model, which is already being applied on Instagram and Messenger.

How to get AI to improve the virtual background

In the process of promoting the optimization of image segmentation, the team mainly has the following three major challenges:

1. Let AI learn to recognize normally in different environments. For example, the environment is dark, the skin color of the character is different, the skin color of the character is close to the background color, the unusual posture of the character (such as bending over and tying shoelaces, or stretching the waist), the character is obscured, the character is moving, and so on.

2. Make the edge position look more smooth, stable and coherent. These characteristics are less discussed in current studies, but user feedback studies have shown that these factors greatly affect people's experience when using various background effects.

3. Ensure that models can operate flexibly and efficiently across billions of smartphones around the world. It's not okay to use it only on a small selection of the most advanced phones, which tend to be powered by the latest processors.

Moreover, the model must be able to support various aspect ratios of mobile phones, so that the normal use of the model can be guaranteed in laptops, Meta's portable video calling devices and people's mobile phone portrait mode, landscape mode.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

An example of a virtual background processed with Meta's AI model, with a head portrait on the left and a full body portrait on the right.

The challenge of real-world personal image segmentation models

The concept of image segmentation is not difficult to understand, but it is difficult to obtain highly accurate personal image segmentation results. For good results, the model that processes the image must be extremely consistent and have very low latency.

Incorrect segmentation of image output can lead to various effects that distract video users using virtual backgrounds. What's more, image segmentation errors can cause unnecessary exposures to the user's real physical environment.

Because of these, the accuracy of the image segmentation model must reach more than 90% of the cross-to-side ratio in order to enter the actual market product application. The cross-merger ratio is a common standard measure that measures the ratio of the predicted value of an image segment to the overlap of the base truth value.

Due to the huge complexity of the use of scenes and instances, the final 10% of the image segmentation model of Meta is far more difficult to complete than all the previous parts.

Meta's software engineers found that when the intersection ratio has reached 90%, the measurable indicators of the image tend to be saturated, and it is difficult to improve the temporal consistency and spatial stability.

To overcome this hurdle, Meta developed a video-based measurement system that, along with several other metrics, addressed this additional difficulty.

Develop AI training and measurement strategies for real-world applications

AI models can only learn from delivered datasets. Therefore, if you want to train a high-precision image segmentation model, it is not enough to simply record a large number of video samples of video users sitting in a bright room. Sample types should be as rich as possible close to the real world.

Meta AI Lab used its own ClusterFit model to extract usable data from massive samples of different genders, skin tones, ages, body postures, movements, complex backgrounds, and large numbers.

The measurement of a still image does not accurately reflect the quality of the model's real-time processing of dynamic video, because real-time models often have tracking patterns that rely on time information. To measure the real-time quality of the model, meta AI labs designed a quantitative video evaluation architecture that calculates the metrics for each frame of the picture when the model predicts the picture.

Unlike the ideal situation in the paper, Meta's personal image segmentation model is judged by the daily large number of users. If there are aliases, distortions, or other unsatisfactory effects, it is useless to have more performance than the benchmark value.

Therefore, Meta AI Lab directly asked its own product users about their evaluation of the image segmentation effect. The result is that edges are not smooth and blurry have the greatest impact on the user experience.

In response to this need, Meta AI Lab has added a new indicator of "edge cross-match ratio" to the video evaluation architecture. When the ordinary intersection ratio of the picture is more than 90% and almost saturated, the edge intersection ratio is a more important indicator to pay attention to.

Moreover, the lack of picture time consistency will bring a mixture of graphic edges, which will also affect the user experience. Meta AI Labs uses two methods to measure the temporal consistency of a screen.

First, the Meta researchers hypothesized that the images of the two frames immediately adjacent to the time point were basically the same. So any difference in prediction on the model means that there will be a time discrepancy in the final picture.

Second, the Meta researchers started with the foreground motion of the two frames of the picture immediately adjacent to the time point. The light flow in the foreground allows the model to advance from the predicted value of frame N to frame N+1. The researchers then compared this prediction with the real N+1 frame value.

The degree of difference measured in both methods is reflected in the measure of the intersection ratio.

The Meta AI Lab used 1100 video samples from 30 species of more than 100 populations to input AI models, categorized including all human characterization sexes versus skin tones on fitspatrick scales.

The results of the analysis show that the Meta AI model has almost significant accuracy in the visual processing effect of all population subclasses, with a cross-merger ratio and confidence of more than 95%, and the difference between the inter-merger ratios of each classification is basically about 0.5 percentage points, and the performance is excellent and reliable.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry
Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

Videos of people of different skin tones and genders, Meta's AI model processed after the intersection ratio data

Optimize the model

Architecture

Meta researchers used FBNet V3 as the backbone of the optimization model. This is a decoding structure formed by a mixture of multiple layers, each with the same spatial resolution.

The researchers designed an architecture with a lightweight decoder and heavyweight encoder that would have better performance than an architecture with a fully symmetrical design. The resulting architecture is underpinned by neural architecture search and is highly optimized for the speed at which it runs on the device.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

Semantic segmentation model architecture. The green rectangle represents the convolutional layer, and the black circle represents the fusion point of the layers.

Data learning

The researchers used offline, high-volume PointRend models to generate a pseudo-standard real-value label for uncommented data to increase the amount of data trained. Similarly, researchers used a division-student semi-supervised model to eliminate bias in pseudo-labels.

Aspect ratio-dependent resampling

Traditional deep learning models resample the image into a small square and feed it into a neural network. The image is distorted due to resampling. And because each frame of the image has a different aspect ratio, the magnitude of distortion will be different.

The presence of distortion and the degree of distortion will cause neural network AI to learn low-level features that are not robust. The limitations caused by this distortion are magnified in image segmentation applications.

As a result, if most of the trained images are portrait proportions, the model performs much worse on live-action images and videos.

To solve this problem, the research team employed Detectron 2's aspect ratio-related secondary sampling method, which grouped images with similar aspect ratios and sampled them a second time to the same size.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

On the left is a baseline image with distortion due to an irregular aspect ratio, and on the right is an improved image after the AI model has been processed

Custom complement borders

Aspect ratio-related quadratic sampling methods require images with similar aspect ratios to be bordered, but the common zero-frame method produces artifacts.

Worse still, as the depth of the network increases, the artifact spreads to other areas. In the past, the use of multiplexed borders was used to remove these artifacts.

A recent study showed that reflective borders in convolutional layers can further improve the quality of the model by minimizing artifact propagation, but correspondingly, the cost of delay increases. Examples of artifacts and how to remove them are shown below.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

trace

Time inconsistencies will cause AI to process graphics with predictive differences from frame to frame, resulting in flicker, and its appearance will greatly damage the user's experience.

To improve time consistency, the researchers devised a detection process called "mask detection." It takes three channels from the current frame image (YUV), and there is also a fourth channel.

For the first frame of the image, the fourth channel is just an empty matrix, while for the subsequent number of frames, the fourth channel is a prediction of the previous frame.

The researchers found that this strategy of utilizing fourth-channel tracking significantly improved temporal consistency. At the same time, they also used some of the most advanced tracing models, such as CRVOS and transformation invariance CNN modeling strategies, to obtain a more time-stable segmentation model.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

"Mask Detection" Method Flowchart

Boundary cross entropy

Building smooth, sharp boundaries is essential for applications of AR image segmentation. In addition to the standard cross-entropy losses that occur when segmenting images, the researchers must also consider boundary-weighted losses.

The researchers found that the interior of the object was more easily segmented, so the authors of the Unet model and most of the subsequent variants suggested using a ternary graph of weighted loss to improve the quality of the model.

However, the ternary graph weighted loss has a limitation that the ternary graph will only calculate the boundary area based on the standard real value, so it is insensitive to all false positives and is an asymmetrical weighted loss.

Inspired by the "boundary-to-parallel ratio", the researchers used the method of cross-merger ratio to extract boundary regions for standard real values and various predictions, and establish cross-entropy losses in these regions. Models trained on boundary cross entropy are clearly superior to benchmarks.

In addition to making the boundary area in the final mask output clearer, the new model has a lower false positive rate after applying the new method.

Meta developed a new virtual background processing AI, so that the human figures in the metaverse are no longer blurry

The new AI model for meta virtual background processor applications has new features that are more efficient, more stable, and more diverse. These optimizations improve the quality and consistency of background filters, thereby improving the effectiveness of their application in the product.

For example, an optimized segmentation model can be used to identify the whole body of a multiplayer scene and a person, as well as a full-body portrait obscured by a sofa, desk, or dining table.

In addition to being applied to video calls, through the combination of virtual environments and people and things in the real world, this technology can also add new dimensions to AR and VR technology. This application is especially important when building metaversities and creating immersive experiences.

References: https://ai.facebook.com/blog/creating-better-virtual-backdrops-for-video-calling-remote-presence-and-ar/

Read on