Ultra-fast end-to-end MeshLRM enables high-quality reconstruction in less than 1s!

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

This article introduces a new type of large-scale reconstruction model called MeshLRM, which can directly output high-quality meshes. The authors fine-tune a pre-trained NeRF-based LRM model trained by volumetric rendering by using the Differentiable Marching Cubes (DiffMC) method and differentiable rasterization. In order to improve the efficiency of DiffMC, the authors have made several improvements to the LRM architecture, including the adoption of small shared MLP and a simplified image tokenization method, which facilitates NeRF and mesh training. The authors also found that employing a low-to-high-resolution training strategy can significantly accelerate the training of NeRF-based models. Compared to existing methods, the author's method is improved in both quality and speed, and is the only one capable of outputting high-quality meshes. The article also shows that the method can be directly applied to applications such as text-to-3D and image-to-3D generation. Since meshes are the most widely accepted format for 3D assets in the industry, the authors believe that this approach is a big step forward in automating 3D asset creation and may open up new possibilities for new types of 3D workflows.

Let's read about this work together~

论文题目：MeshLRM: Large Reconstruction Model for High-Quality Meshes

作者:Xinyue Wei, Kai Zhang等

作者机构：UC San Diego,Adobe Research

Paper link: https://arxiv.org/pdf/2404.10556.pdf

Project Homepage: https://sarahweiii.github.io/meshlrm/

We propose a novel LRM-based method, MeshLRM, which can reconstruct a high-quality mesh from only four input images in less than a second. Unlike previous Large Reconstruction Models (LRMs) that focused on NeRF-based reconstruction, MeshLRM incorporates differentiable mesh extraction and rendering into the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning the pre-trained NeRF LRM and mesh rendering. In addition, we have improved the LRM architecture by simplifying several complex designs from previous LRMs. NeRF initialization of MeshLRM is sequentially trained by low- and high-resolution images, and this new LRM training strategy achieves significantly faster convergence for better quality with less computational resources. Our approach enables state-of-the-art mesh reconstruction from sparse view inputs and also supports a number of downstream applications, including text-to-3D and single-image-to-3D generation.

Ultra-fast end-to-end MeshLRM enables high-quality reconstruction in less than 1s!

MeshLRM与其他前馈方法的定性比较。 'In 3D-LRM'是Instant 3D中的Triplane-LRM ; 'MC'是Marching Cube。 'In 3DLRM'使用体渲染，其他使用面渲染。

The text-to-3D result is achieved by applying Instant3D's diffusion model to generate a 4-view image from text input. Our method can produce more precise and smoother geometry along with sharp textures.

Image to 3D results are achieved by applying Zero123++ to generate 6 multi-view images from a single image input. Our results are superior to other reconstruction methods. Note that our model is trained on 4 views and can generalize to 6 views zero times.

A novel LRM-based framework that integrates differentiable mesh extraction and rendering for end-to-end small-scale mesh reconstruction.
A ray opacity-based loss for stabilizing DiffMC.
An efficient LRM architecture and training strategy for fast and high-quality reconstruction. We benchmarked MeshLRM for 3D reconstruction (on synthetic and real-world datasets) and 3D generation (in combination with other multi-view generation methods). Figure 1 shows the output of a high-quality mesh reconstructed by MeshLRM, all of which was rebuilt in less than a second.

This article introduces a model called MeshLRM for reconstructing high-quality meshes in less than 1 second. The model uses a Transformer-based architecture, which mainly consists of a series of self-attention-based Transformer blocks for processing concatenated image markers and triplanar markers. Compared to previous LRMs, MeshLRM simplifies the design of image tokenization and triplanar NeRF decoding, resulting in faster training and inference speeds.

The training of the model is divided into two phases: first, the model is trained to predict NeRF from the sparse input image by supervised volume rendering, and then the mesh surface extraction is optimized by performing differentiable Marching Cubes on the predicted density field and minimizing surface rendering loss.

Specifically, the model uses a simplified image tokenization method to convert the camera parameters of each image into Plücker ray coordinates and concatenate them with RGB pixels to form a 9-channel feature map. The model then divides the feature map into non-overlapping patches and linearly transforms them into inputs to the Transformer. The model also uses a deep Transformer network, including self-attention and MLP layers, for comprehensive information exchange between all input views, and effectively models intra-view, inter-view, and cross-modal relationships.

The model's triplanar markers are decoded and converted into renderable triplanar NeRF. To improve training efficiency, the model uses smaller MLPs to decode density and color instead of heavier shared MLPs. In addition, the model uses a technique called DiffMC to extract the mesh surface from the density field and render it using a microscopic rasterizer for high-quality mesh reconstruction.

In terms of training, the model is first pre-trained using ray-traveling radiation field rendering, and then fine-tuned using high-resolution images for better mesh reconstruction quality. During the fine-tuning phase, the model used multiple loss functions to oversee render quality, and introduced a ray opacity loss to stabilize training and prevent artifacts in the mesh. Ultimately, the model's mesh reconstruction loss consists of several losses, including L2 loss, perceived loss, and ray opacity loss, as well as normal loss to optimize geometric accuracy and smoothness.

The experimental content of this paper is divided into two main parts: the dataset and evaluation protocol, and the analysis and ablation study.

Datasets and evaluation protocols

The model in this paper is trained on the Objaverse dataset, which contains 730K objects. For the first stage of volumetric rendering training, the model was then fine-tuned on the Objaverse-LVIS subset, which contained 46K objects. This subset is higher in quality, whereas previous studies have shown that fine-tuning is beneficial for improving quality. The authors used GSO, NeRF-Synthetic, and OpenIllumination datasets to evaluate the reconstruction quality of the MeshLRM model. Evaluation metrics include PSNR, SSIM, and LPIPS for render quality, and bidirectional Chamfer distance (CD) for mesh geometry quality. Since it was not possible to accurately reconstruct the unseen part, the authors emitted rays from all test views and sampled 100K points at the ray-surface intersection on each object to calculate the Chamfer distance.

Analytical and ablation studies

Volumetric rendering (Phase 1) training strategy: To validate the effectiveness of the training strategy using 256-res pre-training and 512-res fine-tuning, the authors compared it to a model trained only with high resolution (i.e., 512-res de novo). The results show that the model using the low-to-high-resolution training strategy has a 2.6dB improvement in PSNR and significantly better performance at the same total computational cost.
Effectiveness of Surface Fine-tuning: The authors verified the effectiveness of the second-stage surface fine-tuning by comparing the final mesh with the mesh extracted directly from the first-stage model using Marching Cubes. The results show that the model with DiffMC for fine-tuning of surface rendering is significantly better than the mesh extracted directly from the first-stage model in terms of mesh rendering quality and geometric quality.
Surface fine-tuning loss: The authors conducted a missing study, which showed that a model without the proposed loss of light opacity would produce severe floating artifacts. The loss of light opacity is important for stable training and to prevent the formation of floating artifacts.
Tiny MLPs: The authors use Tiny MLPs for trihedral encoding instead of the large MLPs in previous LRMs. The results show that Tiny MLPs can bring a significant training speed advantage without degrading quality.

Comparison with baseline method

Comparison with feedforward method: The authors compared MeshLRM to the previous Instant3D model, and the results showed that MeshLRM outperformed Instant3D in both rendering quality and geometry quality. At the same time, MeshLRM has a smaller model size, lower computational cost, and faster inference.
Comparison with Per-Scenario Optimization Methods: The authors also compare MeshLRM with recent per-scenario optimization methods such as ZeroRF and FreeNeRF. The results on the NeRF-Synthetic and OpenIllumination datasets show that MeshLRM outperforms these methods in both rendering quality and geometry quality, and the inference speed is faster.

In this paper, we propose MeshLRM, a new large-scale reconstruction model capable of directly outputting high-quality meshes. We fine-tune a pre-trained NeRF-based LRM trained on volumetric rendering by applying the Differentiable Marching Cubes (DiffMC) method and differentiable rasterization. Since DiffMC requires backbone efficiency, we have made several improvements to the LRM architecture (small shared MLP and simplified image tokenization) that facilitate NeRF and mesh training. We also found that low-to-high-resolution training strategies can significantly accelerate the training of NeRF-based models. Our method is improved in both quality and speed compared to existing methods, and is the only one capable of outputting a high-quality mesh. In addition, we show that our method can be directly applied to applications such as text-to-3D and image-to-3D generation. Since meshes are the most widely accepted format for 3D assets in the industry, we believe our approach is a step forward in automating 3D asset creation and potentially opening up new possibilities for new types of 3D workflows.

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。