CVPR'24 Highlight!

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Motion estimation has been handled by two paradigms: feature tracking and optical flow. While each method can implement many applications, none of them can fully capture motion in video: optical flow can only produce motion for adjacent frames, while feature tracking can only track sparse pixels.

An ideal solution would involve the ability to estimate dense and long-range pixel trajectories in video sequences. However, the current solution still has difficulties in challenging scenarios, especially when complex deformations are accompanied by frequent self-occlusion. One potential reason for this difficulty is that tracking is only done in 2D image space, ignoring the intrinsic three-dimensional nature of motion. Since motion takes place in three-dimensional space, some properties can only be adequately expressed through three-dimensional representations. For example, rotation can be succinctly explained with three parameters in 3D, and occlusion can be simply represented in z-buffer, but is much more complex in 2D representations. Image projection can bring spatially distant areas into 2D space, which can lead to local 2D neighborhoods used for correlation potentially containing irrelevant contexts (especially near occlusion boundaries), leading to inference difficulties.

To address these challenges, the authors propose to leverage the geometric prior of a state-of-the-art monocular depth estimator to uplift two-dimensional pixels to three-dimensional and track them in three-dimensional space. This involves performing feature correlation calculations in 3D space, providing a more meaningful 3D context for tracking, especially in the case of complex motion. Tracking in 3D also allows a priori 3D motion to be enforced, such as ARAP constraints. Encouraging the model to learn which points move rigidly together can help track blurry or obscured pixels, as their motion can be inferred from adjacent clearly visible areas in the same rigid group.

Let's read about this work together~

标题：SpatialTracker: Tracking Any 2D Pixels in 3D Space

作者:Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, Xiaowei Zhou

Institutions: Zhejiang University, UC Berkeley, Ant Group

Original link: https://arxiv.org/abs/2404.04319

Code link: https://github.com/henry123-boy/SpaTracker

Official Website: https://henry123-boy.github.io/SpaTracker/

Recovering dense and long-range pixel motion in video is a challenging problem. Part of the difficulty comes from the 3D-to-2D projection process, which leads to occlusion and discontinuities in the 2D field of motion. While 2D motion can be complex, we believe that potential 3D motion is often simple and low-dimensional. In this work, we propose to alleviate the problems caused by image projection by estimating point trajectories in 3D space. Our method, named SpatialTracker, uses a monocular depth estimator to convert 2D pixels into 3D, uses a three-plane representation to efficiently represent the 3D content of each frame, and uses a transformer to perform iterative updates to estimate the 3D trajectory. Tracking in 3D allows us to take advantage of the Strongest Possible Rigidity (ARAP) constraint while learning rigid embeddings that cluster pixels into different rigid parts. Extensive evaluation has shown that our method achieves state-of-the-art tracking performance both qualitatively and quantitatively, especially in challenging scenarios such as out-of-plane rotation.

Track 2D pixels in 3D space. To estimate 2D motion under occlusion and complex 3D motion, the authors upscaled 2D pixels to 3D and performed tracking in 3D space.

Compare with 2D tracking with TAPIR and Cotracker. SpatialTracker can handle challenging scenarios such as out-of-plane rotation and occlusion.

Segmentation of rigid parts in video. SpatialTracker identifies different rigid parts of the scene by clustering their 3D trajectories.

(1) The authors suggest using a three-plane feature map to represent the 3D scene of each frame, first lifting the image features to a 3D feature point cloud, and then spraying them onto three orthogonal planes. The triplanar representation is compact and regular, fitting into the learning framework.

(2) The three-plane is densely covered in three-dimensional space, and the feature vector of any three-dimensional point can be extracted for tracking. Then, iteratively update the 3D trajectories of the query pixels predicted using transformers from the features represented in the triplane.

(3) To estimate the 3D trajectories using the 3D motion prior regularization, the model additionally predicts the rigid embedding of each trajector, which makes it possible to softly group pixels exhibiting the same rigid body motion and enforce ARAP regularization for each rigid cluster. The authors demonstrate that rigid embedding can be achieved through self-supervised learning and generate reasonable segmentation of different rigid parts.

(4) The model achieves state-of-the-art performance on a variety of public tracking benchmarks, including TAP-Vid, BADJA, and PointOdyssey. Qualitative results for challenging internet videos also demonstrate the model's excellent ability to handle fast, complex motion and extended occlusion.

Overview of Pipeline. First, a three-sided encoder is used to encode each frame into a three-sided representation (a). Then, using the features extracted from these three sides as input, the point trajectory (c) is initialized and iteratively updated in 3D space using a transformer. The 3D trajectories are trained using ground reality annotation and normalized by the Rigid as Possible (ARAP) constraint with the learned rigidity embedding (d). The ARAP constraint enforces that the three-dimensional distance between points with similarly rigid embeddings remains constant over time. Here dij denotes the distance between points i and j, while sij denotes rigid similarity. SpatialTracker produces accurate long-range motion trajectories (e) even in fast movement and heavy occlusion.

The TAP-Vid benchmark contains several datasets: TAPVid-DAVIS (30 real-world videos at approximately 34-104 frames), TAP-Vid-Kinetics (1144 real-world videos at 250 frames), and RGB-Stacking (50 composite videos at 250 frames). Each video in the benchmark is annotated with realistic 2D tracks and occlusions. Performance is evaluated using the same metrics as the TAP-Vid benchmark: Average Position Accuracy (<δavg), Average Jaccard (AJ), and Occlusion Accuracy (OA). SpatialTracker consistently outperforms all baseline methods on all three datasets, demonstrating the benefits of tracking in 3D space with the exception of Omnimotion. Omnimotion also performs tracking in 3D and gets the best results on RGB-Stacking by optimizing all frames at once, but this requires very expensive test time optimization.

BADJA is a benchmark that contains seven videos of animals moving with keypoint annotations. The metrics used in this benchmark include segment-based accuracy (segA) and 3px accuracy (δ3px). SpatialTracker has demonstrated competitive performance in δ3px and significantly outperforms all baseline methods in segment-based accuracy.

PointOdyssey is a large-scale synthetic dataset that contains a wide variety of animated characters, from humans to animals, placed in different 3D environments. Evaluated on PointOdyssey's test set, which contains 12 videos with complex motion, each with approximately 2000 frames. Evaluation metrics proposed by PointOdyssey are employed, which are designed to assess very long trajectories. SpatialTracker consistently outperforms the baseline method across all metrics, with clear advantages. In particular, the authors show that the performance of the model can be further improved by using more accurate ground true depths. This demonstrates the potential for continuous improvement of SpatialTracker in advances in monocular depth estimation.

3D tracking results.

In this work, the authors demonstrate that a properly designed three-dimensional representation is critical to solving the long-term challenge of dense and long-range motion estimation in video. Motion occurs naturally in 3D space, and tracking motion in 3D space allows the model to make better use of its laws in 3D space, such as ARAP constraints. The authors propose a novel framework for estimating 3D trajectories using a learnable ARAP constraint that utilizes a trihedral representation that is able to identify rigid groups in the scene and enforce rigidity within each population. Experiments show that SpatialTracker has superior performance compared with existing baseline methods and is suitable for challenging real-world scenarios.

SpatialTracker relies on off-the-shelf monocular depth estimators, the accuracy of which may affect the final tracking performance. However, the authors expect that advances in monocular reconstruction techniques will improve the performance of motion estimation. The two issues can interact more closely and benefit each other.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.