laitimes

Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!

author:3D Vision Workshop

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

This article introduces an advanced face tracking technology that includes a highly robust and accurate 2D alignment module. The technology has been validated in multiple benchmarks and downstream tasks. The proposed method uses a two-stage pipeline for face tracking, first predicting the dense 2D alignment of the face model, and then fitting the parametric 3D model to the alignment results. Experiments show that the proposed method performs well in terms of face tracking accuracy and 3D reconstruction accuracy, and can improve performance in different downstream tasks, such as avatar synthesis and speech-driven 3D facial animation. This paper points out the limitations of the method, such as incomplete differentiability of pipelines and data limitations, and proposes future directions, including extending the alignment network to directly predict the depth, and using synthetic datasets to alleviate data problems.

Let's read about this work together~

论文题目:3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow

作者:Felix Taubner, Prashant Raina等

作者机构:LG Electronics

Paper link: https://arxiv.org/pdf/2404.09819.pdf

When working with 3D facial data, improving fidelity and avoiding uncomfortable valley effects relies heavily on accurate 3D facial representation capture. Due to the high cost of this method, and due to the widespread availability of 2D video, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often perform poorly at capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. To address these challenges, we propose a novel face tracker, FlowFace, which introduces an innovative 2D alignment network for dense vertice-by-vertex alignment. Unlike previous work, FlowFace is trained on high-quality 3D scan annotations, rather than weakly supervised or synthetic data. Our 3D Model Fitting module can syndicately fit a 3D facial model from one or more observations, integrate existing neutral shape priors to enhance the decoupling of identity and expression, and vertex-by-vertex deformation for detailed facial feature reconstruction. In addition, we propose a novel metric and benchmark for evaluating tracking accuracy. Our approach has demonstrated superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D video, which leads to performance gains on downstream tasks.

Figure 5. Qualitative results on two sequences (top and bottom 3 rows) of our Multiface benchmark. Warm colors indicate high error, and cool colors indicate low error. DECA, HRN, and MPT experienced difficulty with movement in the cheekbone and forehead region, which is visible in the SSME error plot (right column). Despite using only 2D alignment as supervision, our method performed better at 3D reconstruction (CD) (middle column).

Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!

Fragments extracted from one sequence per subject in our Multiface subset. Our benchmark incorporates a variety of expressions from different subjects and perspectives.

Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!

Examples of FLAME registrations from the FaceScape (four columns on the left) and Stirling (two columns on the right) datasets. The top row contains the original image, the middle row contains the raw scan data, and the bottom row contains the fitted FLAME model mesh. For the Stirling dataset, we generated a composite view using the available color 3D scan data.

Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!
  • The 2D alignment network has a novel architecture, with a visual transformer backbone and iterative, looping refinement blocks.
  • Contrary to previous methods that used weakly supervised or synthetic data, the alignment network was trained using high-quality annotations from 3D scans.
  • The alignment network predicts dense, vertex-by-vertex alignment, rather than keypoints, which makes it possible to reconstruct finer details.
  • An off-the-shelf neutral shape prediction model was integrated to improve the decoupling of identity and expression.

This article describes a method for monocular 3D face tracking. The rationale consists of two main phases:

  • Dense 2D Face Alignment Network:

This network is responsible for predicting the position of each vertex of the face model in image space. For each vertex, the network outputs an expected value and uncertainty for the location, as well as a UV-to-image mapping and uncertainty mapping. The network structure includes an image feature encoder, a UV position encoding module, and an iterative optical flow module. The loss function adopts the Gaussian negative log-likelihood loss function, which takes into account the prediction of vertex positions and the mapping of UVs to the image.

  • 3D Model Fitting:

At this stage, a parametric 3D model is fitted to the predicted 2D aligned observations by optimizing an energy function. The energy function includes an encouragement term for 2D alignment, a regular term for the FLAME model, a motion smoothing term, a 3D neutral geometric prior, and a deformation constraint term. By optimizing this energy function, it is possible to obtain the 3D model and camera parameters that are best suited to the observed data.

The advantage of this method is that it uses dense 2D face alignment instead of traditional sparse keypoints, and combined with 3D model fitting, it allows for accurate and robust 3D facial reconstruction and motion capture.

Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!
Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!
Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!
  • Training data: Multiple datasets including FaceScape, Stirling, and FaMoS were used, including fitting and keypoint annotation of the FLAME model.
  • 2D alignment network: Segformer-b5 was used as the backbone network, Dimg = 512, Duv = 64, Niter = 3. The AdamW optimizer and image enhancement techniques were used for training.
  • 3D model fitting: The AdamW optimizer and automatic learning rate scheduler were used to optimize the model until convergence. ΔD is enabled in multi-view reconstruction and is limited to the nasal region.
  • Baseline methods: Methods such as 3DDFAv2, SADRNet, PRNet, DECA, EMOCA, and HRN were implemented and tested, and these methods were extended to use a time prior.
  • Multiface Benchmarks: Benchmarks are divided into two categories, single image manipulation and the use of full sequences as observations. The authors' method improved facial region SSME by 54% and sequence prediction by 46% over the best published method.
  • FaceScape benchmark: The author's method improved CD by 38% over previous regression methods over a wide range of viewing angles and facial expressions.
  • NoW Challenge: The authors' method performed well in both single- and multi-view situations, improving performance by 4% to 13% compared to the baseline method in the non-metered challenge.
  • Downstream Tasks: The usefulness and effectiveness of the author's approach are further demonstrated through improvements to avatar synthesis and voice-driven facial animation tasks.
Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!
Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!
Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!

In this paper, we propose a state-of-the-art face tracking pipeline with a highly robust and accurate 2D alignment module. Its performance is fully validated in a variety of benchmarks and downstream tasks. However, the proposed two-stage pipeline is not entirely differentiable, which hinders end-to-end learning. In addition, our training data is limited to data captured in the lab. In future work, we intend to extend the alignment network to direct prediction depth, thus omitting the 3D model fitting step. Synthetic datasets can alleviate data issues. We believe that our tracker will accelerate research for downstream tasks by generating large-scale facial capture data using off-the-shelf video datasets. We also believe that our new motion capture evaluation benchmark will focus and align future research efforts to create more accurate methods.

Exploding!Accurate to every pixel!CVPR'24's latest 3D face tracking effect is simply amazing!

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.