laitimes

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

author:3D Vision Workshop

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Unconstrained image-based dense 3D reconstruction from multiple perspectives is one of the few ultimate goals of long-term research in computer vision. In a nutshell, the task aims to estimate the 3D geometry and camera parameters of a particular scene, given a set of photographs of that scene.

Overall, the modern motion and multi-view stereo matching process boils down to solving a series of minimal problems: matching points, finding the essence matrix, triangulating points, sparsely reconstructing the scene, estimating the camera, and finally performing intensive reconstruction. But each sub-problem was not solved perfectly, adding noise to the next step, increasing the complexity and engineering effort required for the entire process.

In this article, the authors propose DUSt3R, a radically new method for dense unconstrained stereo 3D reconstruction from uncalibrated and unpositioned cameras. The main component is a network that can regress a dense and accurate representation of the scene from just a pair of images, without prior information about the scene or camera (not even including internal parameters). The resulting scene representations are based on 3D point maps with rich attributes: they simultaneously encapsulate (a) the scene geometry, (b) the relationship between pixels and scene points, and (c) the relationship between two viewpoints. From this output alone, almost all scene parameters (i.e., camera and scene geometry) can be extracted directly. This is possible because the network federates the processing of the input image and the resulting 3D dot map, thus learning to associate a 2D structure with a 3D shape and having the opportunity to solve several minimal problems at the same time, enabling internal "collaboration" between them.

3D Vision Daily

, Like32

Let's read about this work together~

标题:DUSt3R: Geometric 3D Vision Made Easy

作者:Shuze Wang, Vincent Leroy, Johan Cabon, Boris Chidlovskii, Jerome Riviaud

Institutions: Aalto University, Naver Labs Europe

Original link: http://arxiv.org/abs/2312.14132

Code link: https://github.com/naver/dust3r

Official Website: https://dust3r.europe.naverlabs.com/

Multi-view stereo reconstruction (MVS) outdoors first requires estimating camera parameters, such as internal and external parameters. These parameters are usually cumbersome and cumbersome to obtain, yet they are necessary for triangulation of the corresponding pixels in 3D space, which is at the heart of all the best-performing MVS algorithms. In this work, we took the opposite position and introduced DUSt3R, a fundamentally novel paradigm for dense and unconstrained stereo 3D reconstruction of arbitrary image collections, i.e. operating without a priori information about camera calibration or perspective pose. We construct the pairwise reconstruction problem as a regression of the dot graph, relaxing the strict constraints of the usual projection camera model. We show that this formulation smoothly unifies the case of monocular and binocular reconstruction. With more than two images provided, we further propose a simple yet effective global alignment strategy that represents all paired point diagrams in a common frame of reference. Our network architecture is based on standard Transformer encoders and decoders, allowing us to leverage powerful pre-trained models. Our representation provides a 3D model of the scene directly as well as depth information, but interestingly, we can seamlessly recover pixel-matched, relative, and absolute cameras from it. Detailed experiments on all these tasks show that the proposed DUSt3R can unify various 3D vision tasks and establish new SoTAs in monocular/multi-view depth estimation and relative pose estimation. In conclusion, DUSt3R makes geometric 3D vision tasks simple.

Given an unconstrained set of images, i.e. a set of photographs with unknown camera poses and intrinsic parameters, DUSt3R outputs a corresponding set of point plots, from which various geometries that are often difficult to estimate in one go can be directly recovered, such as camera parameters, pixel correspondences, depth maps, and fully consistent 3D reconstructions. Note that DUSt3R is also suitable for a single input image (e.g., monocular reconstruction is implemented in this case). The authors also show qualitative examples of models obtained in the absence of known camera parameters. For each sample, from left to right: input image, colored point cloud, and use shadow rendering to better see the underlying geometry.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

Reconstructed instances of two scenarios that have never been seen during training. From left to right: RGB, Depth Map, Confidence Map, Reconstruction. The correct scene shows the results of the global alignment.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

Examples of 3D reconstructions from only two images of an unseen scene: KingsCollege (top left), OldHospital (top center), StMarysChurch (top right), ShopFacade (bottom left), GreatCourt (bottom right). It is worth noting that this is the original output of the network, i.e. we present a new point of view on a colored point cloud.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

An example of reconstructing an unseen scene from two images. It is worth noting that this is the original output of the network, i.e., showing a new point of view on a colored point cloud, recovering the camera parameters from the original point map.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!
3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

(1) The first comprehensive end-to-end 3D reconstruction process for uncalibrated and unpositioned images is proposed, unifying monocular and binocular 3D reconstruction.

(2) A point map representation for MVS applications is introduced, which enables the network to predict 3D shapes in canonical frames, while preserving the implicit relationship between pixels and scenes. This effectively removes many of the constraints of the usual perspective camera formula.

(3) In the case of multi-view 3D reconstruction, an optimization process is introduced to align the point map globally, which can easily extract all the conventional intermediate outputs of the classical SfM and MVS processes. In a sense, this approach unifies all 3D vision tasks and greatly simplifies the traditional reconstruction process, making the DUSt3R look simple and easy to use.

(4) Good performance in a range of 3D vision tasks has been demonstrated. In particular, the all-rounder model achieves state-of-the-art results in monocular and multi-view depth benchmarking, as well as multi-view camera pose estimation.

DUSt3R is trained in a fully supervised manner using simple regression losses, utilizing large public datasets where ground truth annotations are either synthetically generated, reconstructed from SfM software, or captured using dedicated sensors. The authors abandon the trend of integrating task-specific modules and adopt a fully data-driven strategy based on a generic transformer architecture that does not enforce any geometric constraints when inferring, but is able to benefit from a powerful pre-training scheme. The network learns powerful geometric and shape priorals that are very similar to those typically utilized in MVS, such as textures, shadows, or outlines.

In order to fuse predictions from multiple image pairs, the authors revisited the case of Beam Adjustment (BA) of point maps to achieve full-scale MVS. A global alignment process is introduced, which does not involve minimizing reprojection errors, as opposed to BA. Instead, the camera pose and geometric alignment are optimized directly in 3D space, which is fast and has good convergence in practice.

Network structure. The two views of the scene (I1, I2) are first encoded and use a shared ViT encoder. The resulting token, representing F1 and F2, is then passed to two Transformer decoders that are constantly exchanging information through cross-attention. Finally, the two regression heads output two corresponding dot plots and associated confidence plots. Importantly, these two dot plots are represented in the same coordinate frame of the first image I1.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

For each scenario of the two datasets, comparisons are made with the most recent results in Table 1. DUSt3R achieves accuracy comparable to existing methods, such as feature matching methods or methods based on end-to-end learning, and in some cases even exceeds robust baselines such as HLoc. The author believes that there are two important reasons for this. First of all, DUSt3R has never undergone any form of visual localization training. Secondly, during the training of DUSt3R, neither the query image nor the database image was seen.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

In the zero-sample case, the latest Slow Tv represents the current state of the art. The method collects a large number of mixed datasets of urban, natural, compositive, and indoor scenes, and trains a general-purpose model. For each dataset in the mix, the camera parameters are known or estimated with COLMAP. As shown in Table 2, the DUSt3R adapts well to both outdoor and indoor environments. It outperforms self-supervised baselines and is comparable in performance to state-of-the-art supervised baselines.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

As observed in Table 3, DUSt3R achieves state-of-the-art accuracy on ETH-3D and is generally superior to the latest state-of-the-art methods, even those that use real camera poses. In terms of time, the DUSt3R is also much faster than the traditional COLMAP pipeline. This demonstrates the applicability of DUSt3R on indoor, outdoor, small-scale, or large-scale scenarios, while there is no training on the test domain except for the ScanNet test set, which is part of the Habitat dataset.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

This article proposes a new paradigm that can solve not only outdoor 3D reconstruction without a priori information about the scene or camera, but also all kinds of 3D vision tasks.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.

Read on