IJCV 2024 | The National Defense University of Science and Technology recently proposed a multi-frame optical flow framework SplatFlow

1. Introduction

The multi-frame setup has the potential to alleviate the challenging occlusion problem in Optical Flow Estimation (OEF). Unfortunately, multi-frame OFE (MOFE) has not been fully studied. In this work, we propose a new MOFE method, SplatFlow, which introduces a guided Splatting transform to align the motion features of the previous frame, and designs a Final-to-All embedding method to input the aligned motion features into the estimation of the current frame, so as to reconstruct the existing two-frame backbone network. Numerous experiments have shown that SplatFlow achieves SOTA in both KITTI2015 and Sintel benchmarks, significantly outperforming all published methods. This work has been published in the International Journal of Computer Vision, a top journal in computer vision.

2. Thesis information

标题：SplatFlow: Learning Multi-frame Optical Flow via Splatting

作者：Bo Wang，Yifan Zhang，Jian Li，Yang Yu，Zhenping Sun，Li Liu，Dewen Hu

Institution: National University of Defense Technology

Original link: https://arxiv.org/pdf/2306.08887

Code link: https://github.com/wwsource/SplatFlow

3. Method

Our approach is a multi-frame approach designed for single-resolution iterative backbone networks such as RAFT and GMA. Let's use RAFT as an example to illustrate our approach, as shown in Figure 1.

IJCV 2024 | The National Defense University of Science and Technology recently proposed a multi-frame optical flow framework SplatFlow

Figure 1

The bold orange and purple transverse arrows in Figure 1 represent the optical flow estimation process (from frame to frame) and process (from frame t to frame t+1 frame) of the original RAFT, respectively. Our multi-frame approach aggregates motion estimation from process to in-process. The method first extracts the motion features after each iteration of the process. Then, a Splatting-based alignment method is used to obtain motion features aligned with the t-frame coordinate system. The method then uses a "final-to-all" embedding method to input the aligned motion features into the procedure.

As shown in the motion feature encoder network in Figure 1, we introduce motion features from the two-frame method RAFT. Specifically, the network jointly encodes the relevant features of the nth iteration of the process and the coarse-resolution optical flow of the first iteration, so as to obtain the motion features of the first iteration of the frame.

As shown in the Splatting-based aggregator network in Figure 1, we use it to implement the proposed Splatting-based motion feature alignment method. After extracting each iteration, we use the coarse-resolution optical flow of the first iteration to map it unidirectionally to the frame coordinate system to obtain the aligned motion features. This allows for in-depth and sub-pixel padding of derivable motion features.

As shown in the "Final-to-All" embedder network in Figure 1, the aligned motion features resulting from the last iteration are fed into the procedure to provide a valid motion prior for each update of the t-frame optical flow.

Experimental results

Let's start by looking at the impact of multi-frame settings on occlusion. Table 1 shows the three types of regions (unoccluded, occluded, and all) on the Things val and Sintel train Clean datasets after the C+T training process and the Sintel train and Sintel test datasets after the S-finetune training process for the "SplatFlow-RAFT" and "SplatFlow-GMA" baselines and their two frames of backbone RAFT and GMA and relative performance increments. As a result, after all the training processes, our method achieves significant improvements in all three regions of all datasets. The improvements are most noticeable in occluded areas, suggesting that the network can benefit from multi-frame setups in every area, especially in occluded areas.

Table 1

Figure 2 shows the qualitative results of our method and GMA on the Sintel Clean dataset after S-finetune and the KITTI test dataset on the K-finetune training process. The solid line frame marker area is obviously occluded in the T+1 frame, and the dashed wire mark area is not occluded, but it is difficult to estimate. The contents of the box show that our method can obtain finer results in non-occluded areas, more satisfactory in occluded areas, and avoid large area estimation failures. At the same time, the validation values in the Sintel benchmark report in Figure 2 (a)-(c) show that our method exceeds GMA in all three regions, which is consistent with the conclusions in Table 1.

Figure 2

We evaluated our methods on publicly available Sintel and KITTI benchmarks and compared the results with previous work, as shown in Table 2. After the S-finetune training process (Part 2 of Table 2), our method ranked first on both the Sintel test Clean and Sintel test Final datasets, with EPEs of 1.12 and 2.07, respectively. Compared to the previous best-of-breed GMA, the error was reduced by 19.4% and 16.2%, respectively. After the K-finetune training process (Table 2, Part III), our method ranked first among all optical flow-based methods on the KITTI test dataset. As can be seen from these results, our method achieves new state-of-the-art performance on two publicly available benchmarks, proving its effectiveness and sophistication.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

Computer Vision Technology Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensor, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, BEV perception, Occupancy, target tracking, end-to-end autonomous driving, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Technology Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。