The latest | SceneTracker: Track everything in 4D space-time

Hello everyone, 3DCV today shares with you the first public (2024.03) work to effectively solve the problem of online 3D point tracking or long-term scene flow estimation (LSFE): SceneTracker. If you have any work to share, please contact cv3d008!

The reader understands personally

In the 4D space-time composed of time and space, accurately and online capture and analysis of long-term and fine-grained object movements plays a crucial role in a higher level of scene understanding in the fields of robotics, autonomous driving, metaverse, and embodied intelligence.

The SceneTracker proposed in this study is the first published (2024.03) work to effectively solve the problem of online 3D point tracking or the problem of long-term scene flow estimation (LSFE). It can quickly and accurately capture the 3D trajectory of any target point in 4D space-time (RGB-D video), allowing computers to gain insight into how objects move and interact in a specific environment.

SceneTracker is a novel learning-based LSFE network that uses an iterative approach to approximate the optimal trajectory. At the same time, it dynamically indexes and constructs apparent and deep correlation features, and uses Transformer to mine and exploit remote connections within and between trajectories. Through detailed experiments, SceneTracker has shown excellent ability to deal with 3D spatial occlusion and deep noise interference, which is highly compatible with the needs of LSFE tasks.

Finally, this study constructs the first real-world evaluation dataset, LSFDriving, which further confirms the commendable generalization ability of SceneTracker.

Thesis information

标题：SceneTracker: Long-term Scene Flow Estimation Network

作者：Bo Wang，Jian Li，Yang Yu，Li Liu，Zhenping Sun，Dewen Hu

Institution: National University of Defense Technology

Original link: https://arxiv.org/abs/2403.19924

Code Link: https://github.com/wwsource/SceneTracker

Introduction to the proposed methodology

Our goal is to track 3D points in a 3D video. We formalize the problem as follows: A 3D video is an RGB-D sequence of frames. The estimation of the long-term scene stream is designed to generate 3D trajectories in the camera coordinate system of query points with known initial locations. By default, all traces start at the first frame of the video. It's worth noting that our method has the flexibility to start tracking from any frame. The overall architecture of our approach is shown in Figure 1.

The latest | SceneTracker: Track everything in 4D space-time

Figure 1

1. Trajectory initialization

The first step in initialization is to divide the entire video into sliding windows. We divide it by length, sliding step. As shown on the left side of Figure 1, we need to track a query point, using red, green, and blue dots as examples. For the first sliding window, the location is initialized to the initial location of the query point. For other sliding windows, the previous frame is initialized based on the estimation of the last frame of the previous sliding window, and the subsequent frame is initialized based on the estimation of the last frame of the previous sliding window. Taking any of the sliding windows as an example, we get the initialized trajectory.

2. Feature encoding and downsampling

We network inference on the resolution. Here is a downsampling factor. First, we use a Feature Encoder network to extract image features. The Feature Encoder network is a convolutional neural network consisting of 8 residual blocks and 5 downsampled layers. Unlike processing RGB images, we directly sample the raw depth map of the frame at equal intervals. In addition, we use camera parameters to convert from a camera coordinate system to a coordinate system consisting of image plane and depth dimensions. The conversion formula is as follows:

Further, we downsample the initialization trajectory.

3. Updates to template features and trajectories

In the Flow Iteration module, we iteratively update the template features and 3D trajectories of the query points. When processing the first frame of the first sliding window, we use the coordinates of the query point to perform bilinear sampling on the feature map to obtain the template features of the first frame. Then we duplicate the feature in the time dimension to get the initial template feature for all subsequent sliding windows. All sliding windows have a uniform and distinct. After passing through the same Transformer Predictor modules, they will be updated to AND .

4. Trajectory output

We first upsample into to match the original input resolution. Then, we combine the camera internal parameters to convert from the coordinate system to the camera coordinate system, and obtain. Finally, we link all the sliding windows. The overlapping portions of adjacent windows take the result of the latter window.

Introduction to the proposed dataset

Given a sequence of autonomous driving data, our goal is to construct a frame of RGB-D video along with a 3D trajectory of the point of interest in the first frame. Specifically, we sample points of interest from static backgrounds, moving rigid vehicles, and moving, non-rigid pedestrians.

1. Annotations on the background

First, we use the camera intrinsic and extrinsic parameters to extract the LiDAR points of the first frame, which can be correctly projected onto the image. We then use the bounding box in 2D object detection to filter out all foreground LiDAR points. In the case of a LiDAR point, we project it onto the remaining frames based on the vehicle's pose. Officially, the projection points at the moment are:

Here, it is the transition matrix from the car body to the time coordinate system.

2. Markings on the vehicle

Unlike the background, the vehicle has its own independent movement. We introduce the 3D bounding box in 3D object tracking to provide a matrix of the transition from the world to the bounding box coordinate system at the moment. We use a 3D bounding box to filter out the LiDAR points for all vehicles. Taking a LiDAR point as an example, the projection point at the moment is:

3. Labeling on pedestrians

The complexity and non-rigidity of pedestrian motion determine the difficulty of annotation, which can be further proved by the absence of such data in the existing scene flow estimation dataset. We use binocular video to solve this challenge indirectly. First, we prepare a frame-by-frame corrected binocular video. We then used a semi-automated annotation frame to efficiently and accurately label the 2D trajectories of the points of interest in the left and right videos. The first step is to label the point of interest, we developed a custom annotation software and labeled the 2D coordinates of the point of interest in the first frame of the left eye image. The second step is to calculate the coarse left eye trajectory, and we use CoTracker to calculate the coarse trajectory of the left eye video. The third step is to calculate the coarse right eye trajectory, and we use LEAStereo to calculate the parallax of the point of interest on a frame-by-frame basis to derive the coarse trajectory. The fourth step is the manual refinement stage, where the left and right coarse trajectories will be displayed in the annotation software, and all low-quality annotations will be corrected by human annotators. Finally, we combine the refined left trajectory and the parallax sequence to construct the 3D trajectory. Figure 2 illustrates the LSFE labeling process for pedestrians.

Figure 2

Experimental results

1. An example of LSFDriving of the proposed dataset

Figure 3 shows an example of the proposed LSFDriving dataset in three categories (background, vehicle, and pedestrian).

Figure 3

2. The proposed method SceneTracker estimates the effect

Figure 4 shows an example of the estimation effect of the proposed method SceneTracker on the LSFOdyssey test set. We showed the 12-frame point cloud in a 40-frame video at equal intervals. The trajectory estimated by the method is shown in blue on the corresponding point cloud. As can be seen in Figure 4, our method consistently outputs smooth, continuous, and accurate estimates in the face of complex movements of the camera and dynamic objects in the scene.

Figure 4

3. Qualitative comparison with SF and TAP methods

Figure 5 shows the qualitative results of our method with the scene flow baseline and the tracking any point baseline method on the LSFOdyssey test set. We visualized the prediction and ground truth trajectories for the last frame. The trajectories are colored using JET. Solid wires mark areas where the SF baseline is significantly wrong due to occlusion or out-of-bounds. As can be seen in Figure 5, our method is able to estimate 3D trajectories with centimeter-level accuracy compared to other methods.

Figure 5

4. Quantitative comparison with SF and TAP methods

Table 1 shows the quantitative results of the 3D metric on the LSFOdyssey test set. All data is from the Odyssey training process. As can be seen from Table 1, our method significantly outperforms other methods in all dataset metrics.

Table 1

5. Performance on the real-world dataset LSFDriving

Table 2 shows the evaluation results of our method on LSFDriving under different inference modes. As can be seen from Table 2, our method demonstrates commendable real-world scenario estimation performance when relying solely on synthetic data for training.

Table 2

3DCV technical exchange group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensor, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, BEV perception, Occupancy, target tracking, end-to-end autonomous driving, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Technology Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。