How to solve the scale fuzzy problem in feature matching?

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Whether you prefer inches or centimeters, we measure and understand the world in proportional-unit of measurement. Unfortunately, when we project the world onto the image plane, the scale-measure quality is lost. Scale blurring is an aspect that makes computer vision and its applications on it difficult. Imagine an augmented reality problem where two people watch the same scene through their phones. Let's say we want to insert proportionally scaled virtual content, such as a virtual person, into two views. In order to do this in a credible way, we need to restore the relative pose between the two cameras, and we need to scale it proportionally.

In computer vision, estimating the relative pose between two images is a long-standing problem. Feature-based matching solutions provide excellent quality in the face of adversity, such as wide baseline matching or seasonal changes. However, their geometric reasoning is limited to a two-dimensional plane, so the distance between the cameras remains unknown.

In some cases, we can restore scene scale with the help of specialized hardware. Modern phones are equipped with IMU sensors, but they require the user to move. Some phones are equipped with LiDAR sensors that measure depth, but these sensors are limited in range and are limited to a very small number of high-end devices.

The recently proposed "map-free relocation" provides two images and internal references, but no further measurements. So far, the best solution to restore the relative pose of a metric is to combine two-dimensional feature matching with a separate depth estimation network to elevate the correspondence to a three-dimensional quantum space. However, there are two problems. First, the feature detector and depth estimator are separate components that operate independently. Feature detectors are often triggered at corners and depth discontinuities, and this is where depth estimators face difficulties. Second, learning the best metric depth estimator often requires strong supervision using the true depth of the ground, depending on the data domain. For example, for images of pedestrians recorded by mobile phones, there are very few measurement depths available.

We propose Metric Keypoints (MicKey), a feature detection process that solves both of these problems. First, MicKey regresses key positions in camera space, which allows us to establish metric correspondence through descriptor matching. By measuring correspondence, we can restore the relative pose of the measure. Second, by using differentiable attitude optimization to train MicKey end-to-end, we only need the ground truth of the image pair and its relative pose as supervision, and we don't need depth measurements. MicKey implicitly learns the correct depth of the key, only for the actually found and accurate feature area. Our training process is robust for image pairs with unknown visual overlaps, so information such as image overlaps obtained by structural motion reconstruction, etc., is usually not required. This weak supervision makes MicKey very easy to use and appealing because it doesn't require any additional information to train it on a new field.

MicKey ranks high in mapless relocation benchmarks, surpassing very recent and recent methods. MicKey provides reliable proportional metric pose estimation, even under extreme perspective changes achieved through depth prediction specifically for sparse feature matching.

Let's read about this work together~

标题：Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences

作者：Axel Barroso-Laguna, Sowmya Munukutla, Victor Adrian Prisacariu, Eric Brachmann

Institutions: Niantic, University of Oxford

Original link: https://arxiv.org/abs/2404.06337

Code link: https://github.com/nianticlabs/mickey

Given two images, we can estimate the relative camera pose between them by establishing a correspondence between them. Typically, these correspondences are 2D to 2D, and the pose we estimate is defined only to the scale. Some applications designed to enable instant augmented reality require pose estimation of scale metrics, so they rely on external depth estimators to recover scale. We propose MicKey, a key point matching process that can predict metric correspondence in 3D camera space. By learning to match 3D coordinates between images, we are able to infer the relative pose of the metric without the need for depth measurements. Depth measurements also do not need to be used for training, nor do they require scene reconstruction or image overlap information. MicKey only supervises the relative posture of the image and them. MicKey achieves state-of-the-art performance on a mapless relocation benchmark while requiring less oversight than competing methods.

MicKey is a neural network that predicts 3D metric key coordinates in camera space from 2D input images. Given two images, MicKey establishes a 3D-3D correspondence through descriptor matching, and then applies the Kabsch solver to recover the metric relative pose.

How to solve the scale fuzzy problem in feature matching?

Examples of correspondences, scores, and depth maps generated by MicKey. Even in the case of large-scale changes or wide baselines, MicKey is able to find a valid correspondence. Note that thanks to our feature encoder, the depth map has a resolution that is 14 times smaller than the input image. We follow the visualization of the depth map used in DPT, where brighter means closer.

1) A neural network MicKey, which predicts the 3D keys of the metric and its descriptors from a single image, allowing the relative pose estimation of the metric between image pairs.

2) An end-to-end training strategy that only requires relative pose supervision, so there is no need for depth measurement or knowledge of image pair overlap during training.

Training process. MicKey predicts the 3D coordinates of keys in camera space. The network also predicts the key point selection probability (key point distribution) and the descriptor that guides the matching probability (match distribution). The combination of these two distributions yields the probability that the two key points in PI↔I′ will become corresponding, and we optimize the network to make the correct correspondence more likely to occur. In a differentiable RANSAC loop, we generate multiple relative pose assumptions and calculate their losses ˆh relative to the true transformation of the ground. WE TRAIN THE CORRESPONDING PROBABILITY PI↔I′ BY GENERATING GRADIENTS BY REINFORCE. Since our attitude solver and loss function are differentiable, backpropagation also provides a direct signal to train 3D keypoint coordinates.

MicKey architecture. MicKey uses a feature extractor to divide an image into patches. For each patch, MicKey calculates a two-dimensional offset, a key point confidence, a depth value, and a description vector. The 3D keypoint coordinates are valued by the absolute position of the patch, its 2D offset, and its depth.

The mapless dataset contains 460, 65, and 130 scenarios for training, validation, and testing. Each training scenario consists of two different scans of the scene, where the absolute pose is available. In validation and test sets, the data is limited to a reference image and a series of query images. Test ground reality is not available, so all results are evaluated through a map-free website. We compared MicKey to different feature matching pipelines and relative attitude returners (RPRs). All matching algorithms are paired with DPT and are used to recover the metric scale. In addition, we offer two versions of MicKey, one that relies on overlapping scores and uses the entire batch during training, and one that follows our course learning strategy. For MicKey w/ Overlap, we use the same overlap range (40%-80%) as proposed in the section. The evaluations in the mapless test set are shown in Table 1. Rather than focusing on relative pose errors, the benchmark measures the capabilities of the methods in AR applications, and it quantifies the quality of these algorithms with the Reprojection Error Metric (VCRE) of the image plane, claiming that this is more relevant to the user experience. Specifically, benchmarks look at the area under the curve (AUC) and precision values (Prec.). AUC takes into account the confidence level of the network and therefore also evaluates the ability of the method to decide whether these estimates should be trusted. Precision measures the estimated percentage below the threshold (90 pixels). We observed that both variants of MicKey performed well in terms of VCRE results, both in terms of AUC and precision. We see a small benefit of overlapping score supervision from training MicKey as well, and claim that our simple course learning approach gets the best performance if such data is not available. In addition, we noticed that training a simple RPR method with no overlap score (RPR w/o Overlap) significantly reduced performance.

The evaluation in the ScanNet test set is shown in Table 2. We use the same criteria as the no-map benchmark and evaluate the VCRE pose at 10% of the image diagonal. In contrast to the no-map, the ScanNet test paired image pairs that ensured that the input images overlapped, and the results showed that all methods performed well under these conditions. Similar to previous experiments, we observed that MicKey did not gain much benefit from using overlap scores during training. Therefore, the results show that training MicKey with only posture supervision can obtain comparable results to the fully supervised method, proving that the state-of-the-art metric relative pose estimator can be trained with a small amount of supervision of relative pose.

The depth evaluation in Table 3 shows that the state-of-the-art matcher exhibits the best performance when paired with our depth map. Even though other depth methods can be trained on mapless data, it is unclear how standard photometric loss works across scans, where images may have large baselines, and whether such methods will produce better depth maps for metric pose estimation tasks.

limit

As shown in Tables 1 and 2, MicKey excels at estimating good poses for AR applications. For very fine thresholds, other methods may obtain more accurate pose estimates, i.e., they have smaller translation and rotation errors. Future work could investigate the backbone architectures that make high-resolution feature maps possible without compromising the expressiveness of our current feature encoders.

conclusion

We propose MicKey, a neural network that matches 2D images to 3D camera space. Our evaluation shows that MicKey ranks first on the no-map retargeting benchmark with only weak training supervision, and has better or comparable results in ScanNet than other state-of-the-art methods, which were obtained through fully supervised training. As a result of our end-to-end training, we have shown that MicKey can compute correspondence beyond low-level pattern matching. In addition, our interweaving of key points and depth estimation during training showed that our depth maps were tailored for feature matching tasks, and that the top-ranked matchers performed better under our depth maps. Our experiments demonstrate that we can train state-of-the-art keypoints and deep regressors without strong supervision.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。