3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

Unconstrained image-based dense 3D reconstruction from multiple perspectives is one of the few ultimate goals of long-term research in computer vision. In a nutshell, the task aims to estimate the 3D geometry and camera parameters of a particular scene, given a set of photographs of that scene.

Overall, the modern motion and multi-view stereo matching process boils down to solving a series of minimal problems: matching points, finding the essence matrix, triangulating points, sparsely reconstructing the scene, estimating the camera, and finally performing intensive reconstruction. But each sub-problem was not solved perfectly, adding noise to the next step, increasing the complexity and engineering effort required for the entire process.

In this article, the authors propose DUSt3R, a radically new method for dense unconstrained stereo 3D reconstruction from uncalibrated and unpositioned cameras. The main component is a network that can regress a dense and accurate representation of the scene from just a pair of images, without prior information about the scene or camera (not even including internal parameters). The resulting scene representations are based on 3D point maps with rich attributes: they simultaneously encapsulate (a) the scene geometry, (b) the relationship between pixels and scene points, and (c) the relationship between two viewpoints. From this output alone, almost all scene parameters (i.e., camera and scene geometry) can be extracted directly. This is possible because the network federates the processing of the input image and the resulting 3D dot map, thus learning to associate a 2D structure with a 3D shape and having the opportunity to solve several minimal problems at the same time, enabling internal "collaboration" between them.

3D Vision Daily

, Like32

Let's read about this work together~

标题：DUSt3R: Geometric 3D Vision Made Easy

作者:Shuze Wang, Vincent Leroy, Johan Cabon, Boris Chidlovskii, Jerome Riviaud

Institutions: Aalto University, Naver Labs Europe

Original link: http://arxiv.org/abs/2312.14132

Code link: https://github.com/naver/dust3r

Official Website: https://dust3r.europe.naverlabs.com/

Multi-view stereo reconstruction (MVS) outdoors first requires estimating camera parameters, such as internal and external parameters. These parameters are usually cumbersome and cumbersome to obtain, yet they are necessary for triangulation of the corresponding pixels in 3D space, which is at the heart of all the best-performing MVS algorithms. In this work, we took the opposite position and introduced DUSt3R, a fundamentally novel paradigm for dense and unconstrained stereo 3D reconstruction of arbitrary image collections, i.e. operating without a priori information about camera calibration or perspective pose. We construct the pairwise reconstruction problem as a regression of the dot graph, relaxing the strict constraints of the usual projection camera model. We show that this formulation smoothly unifies the case of monocular and binocular reconstruction. With more than two images provided, we further propose a simple yet effective global alignment strategy that represents all paired point diagrams in a common frame of reference. Our network architecture is based on standard Transformer encoders and decoders, allowing us to leverage powerful pre-trained models. Our representation provides a 3D model of the scene directly as well as depth information, but interestingly, we can seamlessly recover pixel-matched, relative, and absolute cameras from it. Detailed experiments on all these tasks show that the proposed DUSt3R can unify various 3D vision tasks and establish new SoTAs in monocular/multi-view depth estimation and relative pose estimation. In conclusion, DUSt3R makes geometric 3D vision tasks simple.

Given an unconstrained set of images, i.e. a set of photographs with unknown camera poses and intrinsic parameters, DUSt3R outputs a corresponding set of point plots, from which various geometries that are often difficult to estimate in one go can be directly recovered, such as camera parameters, pixel correspondences, depth maps, and fully consistent 3D reconstructions. Note that DUSt3R is also suitable for a single input image (e.g., monocular reconstruction is implemented in this case). The authors also show qualitative examples of models obtained in the absence of known camera parameters. For each sample, from left to right: input image, colored point cloud, and use shadow rendering to better see the underlying geometry.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

Reconstructed instances of two scenarios that have never been seen during training. From left to right: RGB, Depth Map, Confidence Map, Reconstruction. The correct scene shows the results of the global alignment.

Examples of 3D reconstructions from only two images of an unseen scene: KingsCollege (top left), OldHospital (top center), StMarysChurch (top right), ShopFacade (bottom left), GreatCourt (bottom right). It is worth noting that this is the original output of the network, i.e. we present a new point of view on a colored point cloud.

An example of reconstructing an unseen scene from two images. It is worth noting that this is the original output of the network, i.e., showing a new point of view on a colored point cloud, recovering the camera parameters from the original point map.

(1) The first comprehensive end-to-end 3D reconstruction process for uncalibrated and unpositioned images is proposed, unifying monocular and binocular 3D reconstruction.

(2) A point map representation for MVS applications is introduced, which enables the network to predict 3D shapes in canonical frames, while preserving the implicit relationship between pixels and scenes. This effectively removes many of the constraints of the usual perspective camera formula.

(3) In the case of multi-view 3D reconstruction, an optimization process is introduced to align the point map globally, which can easily extract all the conventional intermediate outputs of the classical SfM and MVS processes. In a sense, this approach unifies all 3D vision tasks and greatly simplifies the traditional reconstruction process, making the DUSt3R look simple and easy to use.

(4) Good performance in a range of 3D vision tasks has been demonstrated. In particular, the all-rounder model achieves state-of-the-art results in monocular and multi-view depth benchmarking, as well as multi-view camera pose estimation.

DUSt3R is trained in a fully supervised manner using simple regression losses, utilizing large public datasets where ground truth annotations are either synthetically generated, reconstructed from SfM software, or captured using dedicated sensors. The authors abandon the trend of integrating task-specific modules and adopt a fully data-driven strategy based on a generic transformer architecture that does not enforce any geometric constraints when inferring, but is able to benefit from a powerful pre-training scheme. The network learns powerful geometric and shape priorals that are very similar to those typically utilized in MVS, such as textures, shadows, or outlines.

In order to fuse predictions from multiple image pairs, the authors revisited the case of Beam Adjustment (BA) of point maps to achieve full-scale MVS. A global alignment process is introduced, which does not involve minimizing reprojection errors, as opposed to BA. Instead, the camera pose and geometric alignment are optimized directly in 3D space, which is fast and has good convergence in practice.

Network structure. The two views of the scene (I1, I2) are first encoded and use a shared ViT encoder. The resulting token, representing F1 and F2, is then passed to two Transformer decoders that are constantly exchanging information through cross-attention. Finally, the two regression heads output two corresponding dot plots and associated confidence plots. Importantly, these two dot plots are represented in the same coordinate frame of the first image I1.

For each scenario of the two datasets, comparisons are made with the most recent results in Table 1. DUSt3R achieves accuracy comparable to existing methods, such as feature matching methods or methods based on end-to-end learning, and in some cases even exceeds robust baselines such as HLoc. The author believes that there are two important reasons for this. First of all, DUSt3R has never undergone any form of visual localization training. Secondly, during the training of DUSt3R, neither the query image nor the database image was seen.

In the zero-sample case, the latest Slow Tv represents the current state of the art. The method collects a large number of mixed datasets of urban, natural, compositive, and indoor scenes, and trains a general-purpose model. For each dataset in the mix, the camera parameters are known or estimated with COLMAP. As shown in Table 2, the DUSt3R adapts well to both outdoor and indoor environments. It outperforms self-supervised baselines and is comparable in performance to state-of-the-art supervised baselines.

As observed in Table 3, DUSt3R achieves state-of-the-art accuracy on ETH-3D and is generally superior to the latest state-of-the-art methods, even those that use real camera poses. In terms of time, the DUSt3R is also much faster than the traditional COLMAP pipeline. This demonstrates the applicability of DUSt3R on indoor, outdoor, small-scale, or large-scale scenarios, while there is no training on the test domain except for the ScanNet test set, which is part of the Habitat dataset.

This article proposes a new paradigm that can solve not only outdoor 3D reconstruction without a priori information about the scene or camera, but also all kinds of 3D vision tasks.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithm, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

3D reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS, NeRF, Structured Light, Phase Deflection, Robotic Arm Grabbing, Point Cloud Practice, Open3D, Defect Detection, BEV Perception, Occupancy, Transformer, Model Deployment, 3D Object Detection, Depth Estimation, Multi-Sensor Calibration, Planning and Control, UAV Simulation, 3D Vision C++, 3D Vision python, dToF, Camera Calibration, ROS2, Robot Control Planning, LeGo-LAOM, Multimodal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, linear and surface structured light, hardware structured light scanners, drones, etc.

3.9k star!2 images to reconstruct a dense 3D scene!No need for camera references!

Read on

Google Camera is so powerful?Figure 1, Xiaomi 14 comes with Leica Classic Figure 2, Xiaomi 14 comes with Leica Vivid Figure 3, Google Camera Leica is vivid. Even if these 3 photos remove the watermark blind selection

Lu Han made a video call, and the photos under the original camera went viral in the circle of friends, doubting the aesthetics of girls for the first time

Canon travel DSLR camera, what lens to choose to go all over the world in one shot?

What is the #佳能24-105 lens suitable for#Canon 24-105 lens is suitable for amateurs to shoot most application subjects, and it can basically achieve "one shot to go the world".

5.1 When you go out to check in, how can the popular micro-sheet be absent? These young people can't refuse these cameras~

After ten days of using the Huawei Pura70 Ultra, someone finally told the truth...... A blogger recently posted a feeling of using the machine, and the simple summary is that the machine is very beautiful and very high-end

Xiaomi 14 uses Google Camera, which can be compared to 14Ultra? The three pictures are Google Camera, Leica Bright, Google Camera, Leica Smart, Xiaomi Camera, Leica Classic, you can see Google Phase

Figures 1 and 2 are from Google Camera, default telephoto 2.6xFigure 3 and 4 are from Xiaomi default camera, default telephoto 3.2x. In fact, the native focal length of the telephoto lens of the Xiaomi Mi 14 is 65

Best Travel Cameras in 2024 |TOP10

Children's English enlightenment, speaking English to children every day, covering multiple scenes of daily life, practical

A single-lens reflex camera of several thousand dollars is all 2300 yuan in the thrift market?

The secrets of the reading scene in "The Legend of Ruyi", which ones have you discovered?

Apple is still squeezing toothpaste?The iPhone 16 Pro camera will bring a number of major upgrades

HUAWEI P70 Ultra First Experience: The Mystery of Large Screen, Warmth and Camera Expansion!

Title: Lost contact with the mountain climbing incident exposed! Wife and the opposite sex hide behind a tree, the rescue scene is incomprehensible! Have you heard about this? A couple went to climb a mountain, but lost contact! When they were found

The first batch of 18!The list of typical cases of "artificial intelligence + higher education" application scenarios has been announced, have your universities been selected?

What you want, the scene is king, and the office screen should be bought like this in 2024!

Don't just stare at the Great Wall cannon, Jiangling Avenue understands the full-scene gameplay of pickup trucks

#头条创作挑战赛#又到五一放假时, how to spend this year's May Day, it seems that there is no plan yet. The reason why this is so is because there is a study plan every day, and when it comes to festivals

It has been on the market for five months and has dropped to 3199 yuan, SLR-level image + full-scene fast charging, and 16GB+512GB

A number of new check-in points have been unlocked! Songshan Lake Science Park "May Day" opens a new scene to welcome guests

Northeastern University | Neural implicit SLAM SOTA: 30-fold reduction in parameters, high-quality scene reconstruction

The National People's Congress solves the problem of object segmentation in complex space-time scenes, which can be used for autonomous driving and image analysis

Yu Qian was sprayed miserably! But this time, it was not because someone questioned his cross talk level, nor did he doubt his acting skills, but because he was dissatisfied with his overly pedantic remarks. In the recent days of mining

The official materials of the vivoX100 series have been leaked, and there are surprises on the appearance: Ultra adopts a new lens layout, looking at the color of the body and the texture of the text, it is possible that titanium will be provided

He was thrown into a welfare home 3 days after he was born, and 50 years later, he found his 84-year-old biological mother to meet him

Appearance is online, protect your eyes: Wu Qianyuan's camera is taken into the eyes of Internet celebrities?

Self-taught photography teaches you how to use a camera little by little

Have you ever seen a group of peasants sweating in the hot summer, sitting around with a few bottles of beer, enjoying their labor?

I can't laugh anymore! In the high-EQ scene of the second-hand car circle, Zhou Hongyi of 360 Group is worth billions!

Zhang Zetian and Liu Qiangdong took a walk: both of them were carefully dressed, showing an elegant and warm scene

In the digital camera market, when will domestic products be brought up?