laitimes

CVPR'24 Open Source | Visual repositioning of the latest SOTA!New scenes only need to be fine-tuned in minutes!

author:3D Vision Workshop

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Today, neural networks have conquered almost all areas of computer vision, but there is still at least one task that they struggle to tackle: visual relocation. What is visual retargeting?Given a set of mapped images and their representations in a common coordinate system, construct a representation of the scene. Subsequently, given a query image, estimate its pose relative to the scene, i.e., position and orientation. Successful visual retargeting methods rely on predicting image-to-scene correspondence through matching or direct regression, and then using traditional and robust algorithms such as PnP and RANSAC to solve pose problems. With a different perspective, the pose regression-based approach attempts to perform visual retargeting without relying on traditional pose solving, but instead uses a single feedforward neural network to infer pose from a single image. The mapped data is treated as a training set, with out-of-camera parameters acting as supervision. In general, there are two types of postural regression methods, but they still have difficulties in accuracy compared to correspondence-based methods.

The Absolute Pose Regression (APR) method involves training a dedicated pose regressor for each individual scene, enabling the prediction of the camera pose to that particular scene. Although the scene coordinate space can be implicitly encoded in the weights of the neural network, the pose estimation accuracy of the absolute pose regressor is low, mainly due to the fact that the training data available for each scene is often limited and difficult to generalize to unseen views. Relative postural regression is the second type of postural regression method. The regressor is trained to predict the relative pose between two images. In a typical inference scenario, a regressor is applied to a pair of images consisting of an unseen query and an image in the mapping set, and the predicted relative pose can then be combined with the known pose of the mapped image to obtain an absolute query pose. These methods can be trained on many scene-agnostic data, but their accuracy is still limited: the metric pose between two images can only be approximately predicted.

Inspired by these limitations, this article proposes a new method of absolute pose regression: Map Relative Pose Regression (marepo). Combine scene-specific representations (encoding the proportional metric reference space for each target scene) with a generic, scene-agnostic absolute pose regression network. In particular, a fast-trained scene coordinate regression model is used as the scene representation, and a pose regression network is trained in advance, which learns the relationship between scene coordinate prediction and the corresponding camera pose. This general relationship can train the pose regressor on hundreds of different scenarios, effectively solving the problem that the absolute pose regression model is limited by limited training data. On the other hand, because the pose regressor at the time of positioning is conditional on scene-specific map representations, it is able to accurately predict proportional metric poses, unlike relative pose regressors.

Let's read about this work together~

标题:Map-Relative Pose Regression for Visual Re-Localization

作者:Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

Institution: University of Oxford, Niantic Laboratory

Original link: https://arxiv.org/abs/2404.09884

Code link: https://github.com/nianticlabs/marepo

Official Website: https://nianticlabs.github.io/marepo/

3D Vision Daily

, Like26

Pose regression networks predict the camera pose of the queried image relative to the known environment. In this category, Absolute Attitude Regression (APR) has recently shown promising accuracy, with position errors in the range of a few centimeters. The APR network implicitly encodes the scene geometry in its weights. To achieve high accuracy, they require a large amount of training data, which in reality can only be created through a process of compositing new views over several days. This process must be repeated over and over again for each new scene. We propose a new pose regression method, Map Relative Pose Regression (MAREPO), which satisfies the data requirements of the pose regression network in a scenario-independent way. We combine a pose regressor with a scene-specific map representation to make its pose prediction relative to the scene map. This allows us to train pose regressors in hundreds of scenes to learn the universal relationship between scene-specific map representations and camera poses. Our Map Relative Attitude Regressor can be applied immediately to new map representations, or fine-tuned for maximum accuracy after a few minutes. Our method is far superior to previous pose regression methods on two public datasets (indoor and outdoor).

The relationship between camera pose estimation performance and mapping time. The median translation error of several pose regression relocation methods on the 7-Scenes dataset and the time required (proportional to the size of the circle) to train each relocator on the target scene are shown. Marepo achieves excellent performance on both metrics due to the integration of scene-specific geometric map priors in an accurate, map-dependent pose regression framework.

CVPR'24 Open Source | Visual repositioning of the latest SOTA!New scenes only need to be fine-tuned in minutes!

(1) Marepo, a novel absolute pose regression method, is proposed, which combines the general relative pose regression method of scene-agnostic maps with scene-specific metric representation. Demonstrates that the network can make end-to-end inferences about previously unseen images, and that it can directly estimate accurate absolute metric poses thanks to the powerful and unambiguous knowledge of 3D geometry encoded by scene-specific components.

(2) A Transformer-based network architecture is introduced, which can handle the dense correspondence between the 2D position in the query image and the corresponding 3D coordinates in the reference system of the previously mapped scene, and estimate the pose of the camera capturing the query image. It is further demonstrated how the performance of the method can be significantly improved by applying dynamic position coding in the query image, by encoding the intrinsic camera parameters in the transformer input.

Schematic diagram of the Marepo network. The scene-specific geometry prediction module GS processes query images to predict scene coordinate maps (H). Then, the scene-independent map relative pose returner M is used to directly return to the camera pose. The training and inference of the network relies entirely on RGB image I and in-camera parameter K without the need for depth information or pre-built point clouds.

CVPR'24 Open Source | Visual repositioning of the latest SOTA!New scenes only need to be fine-tuned in minutes!

The map relative pose regressor M takes the predicted scene coordinate map tensor and the corresponding camera internal parameters as inputs, embeds the information with dynamic position coding into the high-dimensional features, and finally estimates the camera pose P.

CVPR'24 Open Source | Visual repositioning of the latest SOTA!New scenes only need to be fine-tuned in minutes!

First evaluated on the Microsoft 7-Scenes dataset, an indoor relocation dataset that provides up to 7,000 mapped images per scene. Each scene covers a limited area (between 1 m³ and 18 m³); Still, previous APR methods took tens of hours or even days to train the model to relocate in it. This is not ideal in a practical situation, as the appearance of the scene may have changed during that time, making the trained APR obsolete. Instead, marepo only takes a few minutes of training time (≈ 5) to generate a geometrically tuned prediction network GS for each new scenario specifically tuned for the target environment. Comparing the marepo with the previous pose regression method in Table 1 shows that the marepo is not only a partially scene-independent method, but also enjoys the fastest mapping time of all APR-based methods and achieves an improvement of about 50% in average performance (as measured by the median error).

CVPR'24 Open Source | Visual repositioning of the latest SOTA!New scenes only need to be fine-tuned in minutes!

Further evaluated on the Wayspots dataset, which demonstrates challenging outdoor scenes that even current geometry-based methods struggle to handle. The dataset contains scans of 10 different areas, and the visual inertial ranging system provides the relevant ground real attitude. In Table 2, the performance of the proposed marepo (and the marepoS model fine-tuned for each scene's mapping frames) is shown compared with two APR-based methods; Also included is a comparison with two scene coordinate regression methods: DSAC* and Wayspots' current state-of-the-art ACE. marepo is significantly superior to previous APR-based methods – such as PoseNet and MS-Transformers, which require an average of several hours of training time and perform well compared to geometry-based approaches. For the first time, an end-to-end image-to-pose regression approach that relies on geometric priors can achieve a similar level of performance to a method that requires the deployment of a (slower) robust solver to estimate the camera pose from a set of potentially noisy 2D-3D correspondences. More specifically, the marepo only takes five minutes to train a network that encodes the location of the location of interest in the GS scene-specific coordinate regressor, and (optionally) takes about a minute to fine-tune the map relative to the regressor M (since Wayspot scans significantly fewer frames than the 7-Scenes scene described above). When it comes to inference, Marepo (or its fine-tuned variant) can perform inference at ≈ 56 frames per second, making it not only accurate but also extremely efficient compared to other methods.

CVPR'24 Open Source | Visual repositioning of the latest SOTA!New scenes only need to be fine-tuned in minutes!

Marepo is a novel method that combines the advantages of a scene-agnostic pose regression network with a strong geometric prior provided by a fast trained representation of scene-specific metrics in pose regression. This method solves the limitations of previous APR technologies, and has both scalability and accuracy in predicting accurate scale metric poses in various scenarios. The authors demonstrate the superior accuracy of marepo relative to existing APR methods on both datasets and its ability to quickly adapt to new scenarios. In addition, it shows how a transformer-based network architecture can be combined with dynamic position coding to ensure robustness to different camera parameters, thus establishing Marepo as a versatile and efficient solution for regression-based visual relocation.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建:3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪,无人机等。

Read on