Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

2022-01-20 19:49:00

01 Introduction to Human Reconstruction

Constructing real 3D mannequins in metaworlds is one of the top tasks in building virtual digital people. 3D human body reconstruction aims to recover 3D geometry from 2D human body information, such as 3D reconstruction of RGB inputs.

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

The reconstructed three-dimensional human body can be used for film and television special effects production, including motion drive and other applications.

02 Introduction to the classic methods of human reconstruction

Classic one-perspective RGB methods of human reconstruction include reconstruction based on three-dimensional representations of the human body, as well as reconstructions based on neuroreceptic functions.

The reconstruction based on the three-dimensional representation of the human body uses SPPL to provide human body a priori, optimizing the corresponding representation parameters.

In the representative of HMR, a single picture is input, and the image features are extracted through the neural network, and the shape and pose parameters of the SPPL are returned, as well as the corresponding camera posture. Since any given shape or pose parameter is not necessarily really a reasonable person, for example the gesture is not necessarily a gesture that humans can achieve, a discriminator is added to screen out unreal results.

This method is efficient in reconstruction, but is limited by the expression ability of the human body, and often only the naked human body can be reconstructed.

Based on the neural recessive function, which leverages the ability of neural networks to learn a priori from large amounts of data, as well as the flexible expression ability of neural implicit representations, it is possible to reconstruct the human body with clothing details, represented by PIFu.

PIFu can support single-point or multi-view inputs, and can reconstruct highly complex shapes such as hairstyles, clothing, and their variations and deformations can be digitized in a uniform way.

PIFuHD further enhances the details of PIFu by setting up an end-to-end trainable, multi-level architecture that addresses this limitation. The coarse level observes the entire image at a lower resolution and focuses on overall reasoning. And provides a fine-grained level of context to estimate highly detailed geometries by looking at higher resolution images. You can reconstruct a detailed reconstruction of the shape of the human body by taking full advantage of the 1k resolution input image.

But this approach often relies on training sets, and it is very difficult to obtain a high-precision three-dimensional human body dataset. Therefore, a more generalized and detailed expression method is required to achieve three-dimensional human reconstruction of monocular RGB.

03 ICON: A new method for human reconstruction in-the-wild RGB from a single perspective

Insufficient data has always been one of the biggest problems in deep learning, and since the existing public dataset does not contain a large amount of human GT field scene data, the dataset AGORA, AGORA: Avatars in Geography Optimized for Regression Analysis, was first constructed.

AGORA used 4240 commercial mannequins, and 257 children, containing different textures and movements. The mannequins were placed into different scenes, rendering a total of 14K training and 3K test images. The data not only provides color images, but also includes 3D groundtruth, as well as SPPL-X registration.

Icon is a deep learning model that can infer from color images who are wearing 3D clothes.

Specifically, icon takes an RGB image as input, contains a segmented person wearing clothes, and an estimated "under clothes" human body shape (SMPL), and outputs a pixel-aligned 3D shape reconstruction of the person wearing clothes.

Icon has two main modules: (1) SMPL-guided normal prediction of the clothed human body and (2) implicit surface reconstruction based on local features.

Inferring full 360° 3D normals from a single RGB image of a person wearing clothes is challenging, mainly because the normals of the occluded parts require guessing based on the observed parts. This is an ambiguous prediction problem that is challenging for deep networks.

For the SPPL-guided clothed human body normal prediction module,

input

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

And RGB images, the output is an estimated normal.

First, the existing method PARE is used to obtain the corresponding SPPL according to the image. The estimated SPML mesh, using the pytorch3D micro-renderer, obtains the SPPL normal graph of the front and back of the human body:

Construct neural networks, inputs, and RGB images that predict the front and back normals of the clothed human body, which are used to further construct implicitly represented features.

The neural network-based implicit representation module, the sample points in the input space, and the corresponding features, are represented by the network output Occupancy, which can extract the displayed mesh by the Marching Cube algorithm.

For each sample point, a local feature is constructed; compared with the global feature of PIFu, the local feature can represent finer local details and reduce the influence of the global posture on the prediction result. Specifically, the local characteristic is

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

is the symbolic distance value from the query point to the nearest SPPL point;

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

Corresponds to the center of gravity normal of the SPPL for this point.

for the normal information of the point,

If the query point is visible, the normal value projected by the point to the 2D plane, and if it is not visible, the normal value on the back side, which makes the predicted result closer to the truth, rather than using the predicted normal directly whether it is visible or not.

Note that it is built this way

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

Independent of global poses.

During the training, the fitting results of the SPPL are also fine-tuned, mainly to optimize the difference in SPPL normal and predicted closed-body normal and the difference in contour:

At the same time, the normal is fine-tuned by using the optimized SPPL to get a better normal. The fine tuning of smpl parameters and the fine tuning of normals alternate during training.

The training of the network is Loss for

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

Calculate L1 Loss on the predicted normal and GT normals.

Perception of Loss

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

For the purpose of improving the details, the 2016 Article by Li Feifei entitled "Perceptual Losses for Real-Time Style Transfer and Super-Resolution" proposed:

04 Advantages of ICON

Compared to the PIFu series of methods, icons can achieve higher reconstruction results with less data, thanks to the finely designed local feature structure, which expresses finer local features without being affected by global features.

Training requires less data and also improves the generalization of the network, which is a good solution for reconstructing human figures with extreme data shortages.

Icon also provides a way to build a human Avatar. Enter a sequence video of a single person, reconstruct the area visible to the image, and then use SCANimate to build a driverable human avatar.

Compared with the early production of human avatars, which require high-precision acquisition equipment and a series of complicated operations, the simpler equipment required and the end-to-end generation method provide a feasible solution for low-cost digital avatar production.

flaw. Due to the parametric model of SPPL previously utilized by ICON, loose clothing away from the body may be difficult to reconstruct.

While icons are robust to small errors in body fitting, a serious failure of body fitting can lead to reconstruction failure. Because it is trained on an orthogonal view, icons have difficulty producing strong perspective effects, producing asymmetrical limbs or anatomically impossible shapes.

A key future application is to use images alone to create datasets of wearing avatars. Such datasets could facilitate research into human body shape modeling, be valuable to the fashion industry, and facilitate graphics applications.

Private message I received target detection and R-CNN / application of data analysis / e-commerce data analysis / application of data analysis in the medical field / NLP student project display / Chinese introduction and practical application of NLP / NLP series live lessons / NLP frontier model training camp and other dry goods learning resources.

Metaversumm Technical Guide: High-Quality Single-View RGB Human Reconstruction

01 Introduction to Human Reconstruction

02 Introduction to the classic methods of human reconstruction

03 ICON: A new method for human reconstruction in-the-wild RGB from a single perspective

04 Advantages of ICON

Read on