A single picture can "dance", and SHERF can generalize new ways to drive the body's neural radiation field

Machine Heart column

Heart of the Machine Editorial Office

The goal of the human nerve radiation field is to recover and drive high-quality 3D digital humans from 2D human body pictures, so as to avoid spending a lot of manpower and material resources to directly obtain 3D human body geometric information. The exploration in this direction has a very potential impact on a series of application scenarios, such as virtual reality and auxiliary reality scenarios.

The existing human nerve radiation field generation and driving technology can be mainly divided into two categories.

The first type of technology uses monocular or multi-eye human body video to reconstruct and drive 3D digital humans. This type of technology is mainly for the modeling and driving of specific digital humans, and the optimization is time-consuming, and it lacks the ability to generalize to large-scale digital human reconstruction.
The second type of technology is to improve the efficiency of 3D digital human reconstruction. It is proposed to use multi-view human body pictures as input to reconstruct the human nerve radiation field.

Although this second type of method achieves some results in 3D body reconstruction, this type of method often requires a multi-eye picture of the human body at a specific camera angle as input. In real life, we can often only obtain a picture of the human body at any camera angle, which poses a challenge to the application of such technology.

At ICCV2023, the S-Lab team from Nanyang Technological University-SenseTime Joint Research Centre presented SHERF, a generalizable driven-human neural radiation field method based on a single image.

A single picture can "dance", and SHERF can generalize new ways to drive the body's neural radiation field

Thesis address: https://arxiv.org/abs/2303.12791

Project address: https://skhu101.github.io/SHERF

Code open source: https://github.com/skhu101/SHERF

SHERF can reconstruct and drive the 3D digital human based on a 3D human body image entered by the user from any camera angle, camera and human motion body shape (SMPL) parameters at that angle, and any camera parameters and human motion body shape (SMPL) parameters given the target output space. This method aims to reconstruct and drive a 3D human nerve radiation field using a picture of the human body at any camera angle.

Figure 1

Basic principle

Human nerve radiation field reconstruction and drive is mainly divided into five steps, as shown in Figure 2.

Figure 2

The first step is the coordinate conversion from target space to standard space, based on any human action body shape parameters and camera external parameter parameters under the user input target output space, rays are emitted in the target space, and a series of spatial points are sampled on the light, and the space points in the target space are converted into standard space by the Inverse Linear Blend Skinning of the SMPL algorithm.

The second step is to extract the hierarchical features corresponding to the 3D points in the standard space.

Global feature extraction: use a two-dimensional coding network (2D Encoder) to extract one-dimensional features from the input picture, and use the Mapping Network and Style-Based Encoder to further convert 1D features into Tri-plane features in standard space, and then project 3D points in standard space to three planes to extract the corresponding global features;
Point-Level Feature Extraction: First, the two-dimensional coding network (2D Encoder) is used to extract two-dimensional features from the input image, and the vertices of the SMPL under the observation space are projected onto the imaging plane of the input picture to extract the corresponding features, and then the Inverse Linear Blend Skinning of the SMPL algorithm is used to extract the observation space The vertices of SMPL are transferred to the standard space to construct sparse three-dimensional tensors, and then the point-level features of 3D points in standard space are obtained by using sparse convolution.
Pixel-Aligned Feature extraction: First, the two-dimensional coding network (2D Encoder) is used to extract two-dimensional features from the input picture, and the Linear Blend Skinning of the SMPL algorithm is used to transfer the 3D points in the standard space to the observation space, and then projected onto the imaging plane of the input picture to extract the corresponding pixel-level features.

The third step is Feature Fusion Transformer, which uses the Transformer model to fuse three different levels of features. The fourth step generates the corresponding picture information for the decoding of the human nerve radiation field, and inputs the coordinates, light direction vectors and corresponding features in the standard space into the human nerve radiation field decoding network to obtain the body density and color information of the 3D points, and further generates the color value of the corresponding pixel in the target space based on volume rendering, and obtains the picture under any human action body shape parameters and camera external parameter parameters under the output space of the end user input target.

Based on the above steps, any human action sequence (SMPL) parameter in a given target output space can recover and drive a 3D digital human from a 2D picture.

Comparison of results

In this paper, experiments were performed on human datasets on four human datasets, namely THuman, RenderPeople, ZJU_MoCap, and HuMMan.

The study compared the most advanced human neural radiation field method that generalizes multi-view human body pictures, NHP and MPS-NeRF. This article compares peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). As shown in the figure below, this article significantly surpasses the previous scheme on all datasets and all indicators.

The results of SHERF Dynamic Drive 3D Human Body are shown below:

From left to right, they are input image, motion seq 1, motion seq 2

This paper also verifies the generalization and driving effect on in-the-wild DeepFashion data, as shown in Figure 3 below, given any input image, this paper uses the advanced algorithm of single-view SMPL estimation to estimate the SMPL and the corresponding camera angle, and then uses the algorithm proposed in this paper to drive the 3D human body. Experimental results show that SHERF has strong generalization.

From left to right, they are input image, motion seq 1, motion seq 2

Application prospects

In game film production, virtual reality augmented reality or other scenarios that require digital human modeling, users can enter a 3D human body picture of any camera angle without professional skills and professional software, and the camera parameters and corresponding human action body shape parameters (SMPL) at this angle can achieve the purpose of reconstructing and driving the 3D digital human.

epilogue

In this paper, a generalizable and driven-able human neural radiation field method based on a single input image is proposed SHERF. Admittedly, this article still has certain flaws.

First of all, for the input image to observe less than part of the human surface, the rendered results can observe certain flaws, a solution is to establish a occlusion-aware human representation.

Secondly, how to complete the part of the human body that cannot be observed by the input picture is still a rare problem. This paper proposes SHERF from the perspective of reconstruction, which can only give a definite complement to the unobserved human part, and lacks diversity in the reconstruction of the unobserved part. One possible scenario is to use generative models to generate diverse, high-quality 3D human effects on unseen parts of the human body.

Finally, our code has been fully open sourced, and a large number of digital human results generated based on a single picture have also been uploaded to the project homepage, welcome to download and play!