laitimes

Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

author:3D Vision Workshop

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

论文题目:Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation

作者: Ruicong Liu ,Takehiko Ohkawa等

作者机构:The University of Tokyo, Tokyo, Japan

Paper link: https://arxiv.org/pdf/2403.04381.pdf

Code link: https://github.com/ut-vision/S2DHand

This paper proposes a novel single-view to dual-view adaptive (S2DHand) solution designed to adapt a pre-trained single-view estimator to a dual-view approach. Compared with the existing multi-view training methods, the adaptation process of S2DHand is unsupervised, does not require multi-view annotation, and can handle arbitrary dual-view pairs with unknown camera parameters, making the model suitable for different camera settings. S2DHand is built on certain stereo constraints, including pairwise cross-view consistency and transformation invariance between two perspectives. These two stereo constraints are used in complementary ways to generate pseudolabels, enabling reliable adaptation. The evaluation results show that S2DHand achieves significant improvements in different camera pairs, both within and across dataset settings, and outperforms existing adaptive methods in terms of performance.
Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

The reader understands:

This paper introduces a novel single-view to dual-view adaptive framework (S2DHand) that aims to adapt a single-view hand pose estimator to a dual-view setup. S2DHand is unsupervised and does not require multi-view labels. The method also does not require camera parameters, so it is compatible with any dual view angles. Two stereo constraints are used as two pseudo-labeling modules in a complementary way. The method achieves significant performance improvements for all dual-view pairs within and across dataset settings. The innovation and performance of this method make it have a broad application prospect for dealing with dual-view hand pose estimation problems.

This paper introduces a novel method called S2DHand for estimating three-dimensional hand pose from a subjective perspective. This method adapts from a single-view estimator to a dual-view without the need for multi-view labels or camera parameters. Specifically, it takes advantage of cross-view consistency and transformation invariance between two camera coordinate systems to improve the fit of the model in dual perspectives by generating reliable pseudo-labels. The evaluation results show that the proposed method has achieved significant improvement under different camera pairs, and is superior to the existing adaptation methods within the dataset and across dataset settings. The main contribution of this paper is to propose an unsupervised single- to dual-view adaptive method, which provides a new solution for 3D hand pose estimation from a subjective perspective.

The contribution to this article is:

  • In this paper, a novel unsupervised single-to-dual-view adaptive (S2DHand) solution is proposed for 3D hand pose estimation from a subjective perspective. The authors' approach can adapt traditional single-view estimators to arbitrary dual perspectives without annotations or camera parameters.
  • An adaptive strategy based on pseudo-labels is established. It takes advantage of cross-view consistency and transformation invariance between two camera coordinate systems for reliable pseudo-label generation. This leads to two key modules: attention-based merging and rotational guided refinement.
  • The results of the evaluation showed that the authors' method was beneficial for both arbitrarily placed camera pairs. The authors' approach achieves significant improvements for all camera pairs, both within and across dataset settings.
Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

This section discusses the problem setup for single- to dual-view adaptive hand pose estimation. First, the representation of the dual-view dataset is introduced, which includes image pairs from two perspectives, but does not include the real hand pose or camera parameters on the ground. The target is then described, i.e., adapting the pre-trained single-view estimator to an arbitrary dual-view setup without the need for ground reality or camera parameters. The input to the method is a pre-trained estimator and unlabeled dual-view data, and the output is an adapted estimator with parameters specific to the dual-view case. Finally, an example layout of a multi-view head-mounted camera is presented, along with synthetic training data to explore the performance of the method.

Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

This section introduces the proposed approach, the S2DHand framework. An initialization step was first performed to initialize the rotation matrix between the two views, which is essential for establishing the transition between the two camera coordinate systems. The architectural overview of the method consists of two branches, one is the estimator H and the other is its momentum version H'. The adaptation process is designed from two stereo constraints: pairwise cross-view consistency and rotational transformation invariance between two camera coordinate systems. This leads to two key pseudo-labeling modules: attention-based merging and rotation-guided refinement. The two modules work in a complementary way, ensuring reliable pseudo-labeling based on prediction accuracy.

Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

3.1 Initialization

This section describes the initialization steps aimed at estimating a relatively accurate rotation matrix R in order to correlate the two camera coordinate systems. This step assumes that the initial pre-trained estimator is capable of generating reasonable predictions. By using unlabeled dual-view data, the estimator can output a series of predictions from which the rotation matrix R can then be estimated. This process ensures rotational alignment during adaptation.

3.2 Single to dual view adaptation

This section describes the process of adapting from one to two perspectives. First, the adaptive process is started by initializing the rotation matrix R. The S2DHand framework includes two branches, one is an estimator H(·|) with the dynamically updated parameter θθ), the other is to use the time moving average to update the momentum version of the parameter θ H(·|θ)。 In the adaptive process, the role of the momentum model H is to generate pseudo-labels that are used to supervise the model H. The loss function is calculated by comparing the actual forecast with the pseudo-label. Finally, the estimator follows the implementation of DetNet, directly outputting a heat map, from which the 3D joint points are calculated.

3.3 Pseudo-labeling: Attention-based merging

This section describes the attention-based merging module for generating pseudo-labels. The module leverages the concept of cross-view consistency, i.e., predictions from different views should be consistent after being converted to the same coordinate system to generate accurate pseudo-labels. In order to account for the differences in image capture between different views, a joint-level attention mechanism is introduced. The module generates the final pseudo-label by converting two predictions to the same coordinate system and using attention to multiply them at the joint level.

Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

3.4 Pseudo-labeling: Spin-guided refinement

This section describes the Rotation-Guided Refining (RGR) module to further refine the prediction results to make them consistent across different views. The module takes advantage of the concept of rotational transformation invariance, i.e., predictions under different views should be consistent after being converted to the same coordinate system. By minimizing the difference between the prediction and the target rotation matrix, the module is able to make the prediction results more accurate. The final pseudo-labels are obtained by weighting the average of the refined prediction results and the pseudo-labels generated by the attention-based merging module. This approach improves the quality of pseudo-labels and further optimizes the performance of the model.

Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

This part of the experiment mainly revolves around the single- to dual-view adaptation task, using a new large-scale benchmark dataset called AssemblyHands as the evaluation set. The training set consists of two adaptation scenarios:

1) Same dataset scenario, i.e., the training set is from the same AssemblyHands dataset;

2) Cross-dataset scenarios, using synthetic datasets (including Rendered Handpose and GANerated Hands) as the training set. The experiment consisted of the following:

  • Dataset Description: AssemblyHands is a large-scale benchmark dataset that contains accurate 3D hand pose annotations. GANerated Hands contains more than 330,000 color images of hands, and Rendered Handpose contains about 44,000 samples.
  • Experimental setup: The average joint position error (MPJPE) was calculated using root relative coordinates as an evaluation index. A new dual-view MPJPE metric is proposed, while the traditional single-view MPJPE is also used. Implemented using PyTorch, all experiments were run on a single NVIDIA A100 GPU.
  • Adaptation results: Compared with the pre-trained model, S2DHand achieved significant accuracy improvement on all camera pairs under the same dataset and cross-dataset settings, with an average improvement of more than 10% and a maximum improvement of more than 20%.
  • Cross-dataset comparison: Compare S2DHand to leading domain adaptation methods, including SFDAHPE, RegDA, DAGEN, and ADDA. The results show that S2DHand outperforms other methods in cross-dataset settings.
  • Ablation study: The contribution of each component in the model was analyzed. The results show that both the attention merging module and the rotation-guided refining module can significantly improve the hand pose estimation performance.
  • Number of input image pairs: The performance of S2DHand under different input image pairs was evaluated, and the results showed that the performance tended to be stable when N≥1000, and N=1000 was selected as the optimal number.
  • Complementarity of the two pseudo-labels: The experimental results show that the rotation-guided refining module plays an important role in dealing with inaccurate predictions, and effectively optimizes the pseudo-labels.
  • Hyperparameter analysis: The optimal parameter values were determined by adjusting the α and β of the hyperparameters.
  • Qualitative results display: By projecting the 3D hand joint onto the image plane, the remarkable effect of S2DHand in improving the hand pose estimation performance in dual viewing angles is demonstrated.

In summary, the experimental results show that S2DHand has achieved significant performance improvement in single-view to dual-view adaptation tasks, especially in cross-dataset settings, which has high practical value and application prospects.

Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework
Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework
Camera parameters? No, you don't! CVPR'24 S2DHand Dual-View Hand Pose Estimation Framework

In this paper, we propose a novel Single-View to Dual-View Adaptation Framework (S2DHand), which aims to adapt the single-view hand pose estimator to the dual-view setting. S2DHand is an unsupervised approach that eliminates the need for multi-view labels. The method in this article also does not require camera parameters and is compatible with any dual view. The two stereo constraints are used as two pseudo-labeling modules that complement each other. The authors' method achieves significant performance improvements on all dual-view pairs in both the same dataset and across dataset settings.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建:3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group. 3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪,无人机等。

Read on