SPVLoc: From panorama to perspective, 6D pose estimation in unknown environments

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

This article introduces a method called SPVLoc for 6D camera positioning in indoor environments, i.e., accurately determining the position and orientation of the camera in an indoor environment. In this method, a simple semantic textured 3D scene model is used, and a novel image matching method is used to match the perspective image with the panoramic image, the RGB image with the semantic image. By performing efficient and scalable matching and retrieval under sparse reference sampling, the proposed method can improve the accuracy and inference speed of positioning. Compared with the existing technical methods, the SPVLoc method performs better in terms of positioning accuracy and inference speed, and by including a 3D model, it can reduce ambiguity when estimating 6D poses. The article also explores the possibility of combining localization and image analysis in the future to augment digital building models or applications in augmented reality scenarios.

Let's read about this work together~

论文题目：SPVLoc: Semantic Panoramic Viewport Matching for 6D Camera Localization in Unseen Environments

作者:Niklas Gard等

作者机构:Fraunhofer Heinrich Hertz Institute等

Link to paper: https://arxiv.org/pdf/2404.10527.pdf

This paper introduces SPVLoc, a global indoor positioning method that can accurately determine the six-dimensional (6D) camera pose of a queried image, requiring minimal scene-specific prior knowledge and no scene-specific training. Our method employs a novel matching process to locate the viewport of a perspective camera in a set of panoramic semantic layout representations of the indoor environment, which are given in the form of RGB images. These panoramas are rendered from an untextured 3D reference model that contains only approximate structural information about the shape of the room, as well as annotations for doors and windows. We prove that a direct convolutional network structure can successfully achieve image-to-panorama matching, and finally image-to-model matching. With the viewport classification score, we rank the reference panorama and select the best matching query image. Then, the 6D relative pose between the selected panorama and the queried image is estimated. Our experiments show that this method not only effectively bridges the gap between domains, but also has a good generalization ability for scenarios that have not been seen before that do not belong to the training data. In addition, it achieves higher positioning accuracy compared to state-of-the-art methods, and also estimates more degrees of freedom in the camera attitude. We will make our source code publicly available at: https://github.com/fraunhoferhhi/spvloc.

ZinD data preparation. The annotations generate a 3D reference model (left), while the resampled bitmap creates perspective training and test images (right).

SPVLoc: From panorama to perspective, 6D pose estimation in unknown environments

Qualitative positioning results: top-to-bottom - query, pose rendering using top-1 estimation, panorama using estimating viewport, map. Green box: The top 1 match is successful. Yellow box: Top 2 matches are successful. Red box: Failure case.

Reference locations for 0.7m, 1.5m global grids, and 1.5m local grids (from left to right).

A model-based 6D camera pose estimation system is introduced for unknown indoor environments, which does not require training on specific scenes.
A novel perspective-to-panorama image matching concept is proposed, which has high retrieval accuracy even under wide baseline cameras.
Compared to state-of-the-art methods, our method exhibits higher positioning accuracy while estimating more degrees of freedom.

This article introduces a method called SPVLoc for 2D localization of 6D RGB images indoors. The basic principle of the method is to use the semantic textureless 3D scene model to estimate the viewport of the image through cross-domain image-to-panoramic image matching, and then determine the pose of the image relative to the best-matched reference panorama by relative 6D pose regression. The article mainly includes the following steps and key points:

Semantic Panoramic Viewport Matching: Redefines the indoor positioning problem as a cross-domain image-to-panoramic image matching problem. Determine the position of viewports in the panorama by creating a semantic panorama reference rendering and leveraging the perspective of the perspective camera. The determination of the viewport involves calculating the viewport mask and bounding box, and predicting it through the network.
Feature-Correlation-based Pose Regression: By performing feature correlation with the viewport information, the viewport information of the image in the panorama is encoded, and the information is used to estimate the relative pose shift of the camera. The purpose of this step is to determine the precise position of the image relative to the panorama.
Optimization: During training, multi-task learning is used to balance the weights of different loss functions to improve the accuracy and robustness of the model. The optimization process includes loss calculations and weight adjustments for the pose offset and viewport predictions.
Inference: In the inference phase, the panoramic position is determined by a fixed 2D grid superimposed on the floor plan, and the reference location with the highest classification score is selected. The absolute pose is then determined based on the results of the Pose head, and the accuracy of the pose estimation is improved by rendering a new reference panorama.

In this paper, we introduce a 6D camera localization method for indoor environment, which combines panoramic images and semantic 3D models to achieve high-precision localization in unknown scenes.

Data set:

使用了两个公开数据集：Structured3D (S3D) 和 Zillow Indoor (ZInD)。

S3D contains 3,500 near-photorealistic models of indoor environments, each with realistic 3D structural information from the ground, including 21,835 panoramic images.

ZInD contains 67,448 panoramic images taken in 1,575 unrenovated residences, all globally aligned and registered onto a single floor plan.

Data preprocessing:

Before training, all data is converted into a uniform format.

Training Details:

Trained using a zoom angle of view, where randomly sampled angle of view is between 45 and 135 degrees.

For each query, panorama images of s random locations are rendered consistently within a radius of ±R1 (xy direction) and ±R2 (up).

A random negative is generated in different rooms to enhance the network's ability to identify subtle room differences.

Images were sampled using random yaw and random pitch and roll angles of ±10°.

The batch size was set to 40, including 40 query images and 200 panoramic images, and was trained on a single NVIDIA A100 GPU.

In the loss calculation process, query images with less than three semantic categories are ignored.

Approximately 42,000 steps were trained, with an initial learning rate of 2.5×10^-4, halved twice during the training process.

Test Details:

During the test, a grid of 1.2×1.2 meters was sampled from the panoramic images.

To evaluate the accuracy of the 2D positioning, the 3D rotation and translation errors were reported.

Comparison with the latest technology:

Compared to the LASER method, it showed higher positioning accuracy and recall.

The LASER method estimates only two positions and one rotational degree of freedom, while the SPVLoc method estimates the full 6D pose.

Ablation Studies: Removing certain components, such as perspective supervision and view segmentation task headers, can degrade network performance.

Removing negative samples from different rooms significantly reduces location accuracy.

Replacing the image encoder EfficientNet-S with the smaller ResNet-18 results in degraded performance.

Replacing all convolutional layers of the panoramic encoder with Equiconv does not result in a performance gain.

Adding an additional panorama input modal slightly improves the results.

Performance Studies:

Using a local grid instead of a global grid reduces the risk of missing a room altogether and improves performance at 10cm recall.

Networks trained with known camera focal lengths performed slightly better at matching images, but lost precision when testing images at different focal lengths.

The network is able to process test images with different pitch and roll angles, demonstrating robust estimation capabilities.

Limit:

In spaces with large repetitive room layouts, the effectiveness of the approach may be limited by the details of the semantic reference model.

In this paper, a scene-independent model-based 6D localization method for indoor scenes is introduced, involving a novel multimodal image matching method (panoramic image to perspective image, RGB to semantics). Matching and retrieval are efficient and scalable under sparse reference sampling. The positioning accuracy and inference speed are superior to the existing technical methods, and the inclusion of the 3D model reduces the ambiguity of estimating the 6D pose. Future work involves combining localization and image analysis to augment digital building models or explore applications in augmented reality scenarios.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。