Editor: Editorial Department
Who would have thought that the person chatting with you one day was actually an AI. Researchers from TUM and others have proposed a new algorithm, NPGA, capable of generating high-fidelity 3D avatars with expressions so realistic that you will doubt your eyes.
What is the real uncanny valley effect?
Look, the girl below has made a variety of vivid and rich expressions, squeezing her eyes, raising her eyebrows, pouting....
Looking at this boy again, the constant change of mouth shape, coupled with subtle eye movements, does not show any difference from us at all.
However, who would have thought that they were not real people!
Netizens have called it AGI, which is so realistic that it is terrifying.
Such a powerful 3D avatar generation is not inferior to the realistic digital avatar of Xiaozha who previously took the Quest 3 Pro as a guest on the "Metaverse" podcast.
So, which folk master did this research come from?
Recently, research teams from the Technical University of Munich and University College London in Germany have proposed a new algorithm, NPGA, which can generate high-quality 3D avatars.
Address: https://arxiv.org/pdf/2405.19331
It's a data-driven approach to creating high-fidelity, controllable avatars from multi-perspective videos.
Traditionally, mesh 3DMM is often used to generate rendered avatars, but the results are mediocre.
The innovation of NPGA lies in the use of Gaussian point clouds, that is, through countless points to form a 3D portrait shape, making rendering more efficient and realistic.
Another innovation in the research is the use of a neural network model, the "Neural Parametric Head Model" (NPHM), to capture the subtle changes in human facial expressions, so that the 3D digital avatar can more realistically simulate human expressions.
Finally, in order to enhance the expressiveness of the digital avatar, the researchers also proposed "Laplacian terms" for potential features and predictive dynamics.
Experimental evaluation results show that NPGA has an improvement of about 2.6 PSNR in the self-reproducibility task compared to the previous SOTA model.
Some people exclaimed, this is simply one step closer to a scam.
At this time, netizens did not forget to ridicule an incredible video recently released by Google.
Looking at this strange art style, coupled with the instability of the avatar, it is simply impossible to compete with the NPGA.
This is the ChatDirector algorithm proposed by the Google team, and according to Google's promotional words, 3D virtual avatars can make online meetings more "immersive"
NPGA: Neural parameter Gaussian incarnation
This technology can be applied to many scenarios, such as movies, games, AR/VR remote meetings, and the metaverse that Xiaozha has in mind.
While the video looks so realistic, capturing images from the real world and reconstructing a 3D avatar is a challenging task. Both accurate recognition capabilities of computer vision (CV) are required, as well as the high-fidelity and real-time rendering performance of computer graphics (CG).
In recent years, the intersection of these two technologies has made the 3D avatars of virtual worlds more and more realistic. However, there is one core problem that remains unresolved – how to achieve control.
The main reason why Google ChatDirector's video is very strange is not because of the picture rendering, but because of the poor control of facial movements and expressions.
A netizen in the Reddit comment section asked, "When will I see the open-source version of this model, so that I can generate a similar 3D avatar with just a few photos?"
Unfortunately, current technology should not be able to do 3D reconstruction with just a few images.
The training set used by the team, NeRSemble, is a video dataset that captured more than 4,700 high-resolution, high-frame-rate multi-view videos of more than 220 human heads in 16 cameras, including a variety of rich head movements, emotions, expressions, and spoken language.
This dataset was also published in 2023 by a team of authors at NPGA and accepted by SIGGRAPH 2023 and ACM TOG.
Address:https://tobias-kirschstein.github.io/nersemble/
Tips,If you want to click in to watch the sample video, you may need a strong psychological quality,The various exaggerated expressions included in it can be called the human abstract behavior award。
When the dataset was first published last year, the reconstructed movements and expressions were stiff and lacked rich facial detail.
The fact that this realistic result was achieved in just one year was due to the team's methodological improvements.
Overview of the methodology
a) Based on the MonoNPHM model, the point cloud calculated by COLMAP was used on the NeRSemble dataset to trace the MonoNPHM, so as to achieve geometrically accurate model tracking.
b) Propose a cyclic consistency target to invert the backward deformation field of MonoNPHM, and the resulting forward deformation field is directly compatible with rasterization-based rendering.
c) The NPGA consists of a gauge Gaussian point cloud and MLP, containing a distilled prior network F for forward deformation, and network G learning fine-grained dynamic details.
d) The input of the deformation field is elevated to a higher dimensional space by attaching latent features to each primitive, so that the deformation behavior of each primitive can be described more accurately.
Specific algorithm details
Most of the previous head reconstruction work used a 3D Morphable Model, using principal component analysis (PCA) to learn the representation of human geometry, and separating the parameter space of facial recognition and expression changes.
Although the parameter space of 3DMM is compact enough, the authors argue that the underlying linear nature limits the fidelity that can be achieved in the expression space.
At the same time, the paper shows that the underlying expression space plays a crucial role in the quality of virtual humans, which not only affects the controllability, but also determines the upper limit of detail clarity. If the underlying expression is insufficient, there is a high chance that it will lead to overfitting when optimizing the model.
Therefore, the team used an improved version of 3DMM, NPHM (Neural Parametric Head Models), to track and extract the hidden vector z_id and expression code z_exp of identity recognition from multi-view image sequences.
After that, you can use a backward deformation field B to convert the point x_p in pose space to coordinate x_c in canonical space:
Unfortunately, this study only focused on reconstructing the head, masking the torso parts in the dataset because they were not included in the expression space of the z_exp extracted from NPHM.
Based on the scene representation defined for each primitive in 3DGS, the authors additionally added Gaussian features
, although it is a static feature in its own right, can provide semantic information for the dynamic behavior of each primitive, acting in a way similar to positional encoding.
After parametric expression, the proposed dynamic module D for modeling facial expressions consists of two multilayer perceptrons (MLPs):
- Based on a rough prior network F
- Network G, which goes beyond prior knowledge and is responsible for modeling the remaining details
Among them, the training and use of model F is one of the core innovations of this paper. F is first trained on a sequence of images of 20 people in the NeRSemble dataset, and then the network is used in the reconstruction of all avatars.
The prior knowledge of F is extracted from the backward deformation field B (which is essentially the inverse of B) by the method of "cyclic consistency distillation".
Then, using the dynamic model D, we can get the Gaussian point cloud representation in the reconstructed pose space A_p:
After rendering the screen space based on A_p, the team also proposed to use a CNN network to improve the detail expression of the potential image, instead of super-resolution processing. Subsequent ablation experiments also proved the effectiveness of CNN for performance improvement.
In addition to the design of the algorithm and architecture, the team also made two improvements in the optimization strategy.
The first is to perform Laplace smoothing based on KNN graph algorithm for gauge space A_c and dynamic model D.
The second is Adaptive Density Control, which is a core factor in the success of 3DGS. Using a heuristic approach, pruning the possible redundant Gaussian point cloud density in a static scenario.
Experimental evaluation
研究人员通过自我重现(Self-Reenactment)任务来评估NPGA算法的保真度。
Self-recreation will more accurately depict unseen expressions and include sharper details in relatively static areas such as hair areas.
Interestingly, GHA_NPHM performed slightly worse than GHA, suggesting that simply using MonoNPHM to express the code does not provide an immediate performance boost.
Instead, the researchers hypothesize that without the movement of NPHM as initialization, the latent expression distribution of NPHM may provide a more complex training signal than the linear hybrid shape of BFM.
The following is a qualitative comparison of different methods for retention sequences.
The quantitative results of these methods are as follows.
Again, how does the new algorithm perform in the cross-reenactment task?
Cross-reproduction refers to transferring another person's expression to a virtual person.
As shown in the figure below, all of the methods successfully split the identity and expression information, resulting in effective cross-reproduction.
However, the NPGA avatar retains most of the details of more of the driving emotes.
To demonstrate the real-world applicability of the algorithm, Figure 6 shows that the researchers used MonoNPHM's monocular RGB to track high-fidelity avatar animations.
Ablation studies
In the final ablation experiment, to validate several important components of the NPGA, the researchers performed an ablation experiment by using three subjects. The quantitative and qualitative results of ablation are shown in Table 2 and Figure 5, respectively.
Without the full Gaussian feature (Vanilla), the 3D avatar would not be able to present very detailed expressions, including complex areas such as the eyes and lower teeth.
However, after adding the full Gaussian feature (p. G.F.), the reconstruction effect is noticeably clearer, but it is prone to artifacts under extreme expressions.
When the researchers added Laplace regularization and screen-space CNNs, this artifact problem was finally solved.
In addition, experiments have also proved that the default point cloud densification strategy will inhibit the reconstruction of details, so the strategy of using adaptive density control (ADC) is very necessary.
The following table illustrates that the generalization gap between the training sequence (NVS) and the test sequence (self-reproducible task) can be significantly narrowed using a regularization strategy.
Locality
According to the researchers, the controllability and reconstruction quality of the virtual avatars created by NPGA will be fundamentally limited by the underlying 3DMM expression space.
Therefore, including the neck, torso, tongue, and eye rotation of these areas, cannot be fully explained by NPHM's expression code.
As a result, the algorithm cannot reliably animate and may even introduce artifacts due to overfitting.
A possible solution is to extend the underlying 3DMM to provide a more detailed description of the human state.
In addition, NPGA, as a data-driven avatar creation method, is somewhat limited by the data available.