Enjoy! Adobe releases ultra-high-quality Mesh reconstruction models!

Editor: Computer Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

High-quality 3D mesh models are at the heart of 3D vision and graphics applications, and 3D editing, rendering, and simulation tools are optimized for them. 3D mesh assets are often created manually by professional artists or reconstructed from multi-view 2D images.

Traditionally, this has been done with sophisticated photogrammetry systems, although more recent neural approaches, such as NeRF, have provided a simpler end-to-end pipeline through scene-level optimization. However, these neural methods often produce volumetric representations, and converting them to meshes requires additional optimization in post-processing. In addition, both traditional and neural methods require a large number of input images and long processing times (up to several hours), limiting their applicability in time-sensitive design workflows.

The goal of this article is to achieve efficient and accurate 3D asset creation with few-shot mesh reconstruction, employing direct forward network inference without the need for per-scene optimization. MeshLRM is designed based on recent Large Reconstruction Models (LRMs) for 3D reconstruction and generation. Existing LRMs use triplanar NeRF as a 3D representation with high rendering quality. While these NeRFs can be converted to meshes by Marching Cube (MC) post-processing, this results in a significant decrease in render quality and geometric accuracy. The authors solve this problem with MeshLRM, a novel Transformer-based large-scale reconstruction model designed to output high-fidelity 3D meshes directly from sparse view inputs.

Let's read about this work together~

标题：MeshLRM: Large Reconstruction Model for High-Quality Mesh

作者:Sinyu Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschanter, Kalyan Sunkavalli, Hao Su, Zexiang XU

Institution: University of California, San Diego, Adobe

Original link: https://arxiv.org/abs/2404.12385

Official Website: https://sarahweiii.github.io/meshlrm/

We propose MeshLRM, a novel LRM-based method that can reconstruct a high-quality mesh from only four input images in less than a second. Unlike previous Large Reconstruction Models (LRMs) that focus on NeRF-based reconstruction, MeshLRM combines differentiable mesh extraction and rendering within the LRM framework. This allows for end-to-end mesh reconstruction by fine-tuning the pre-trained NeRF LRM, with mesh rendering. In addition, we have improved the LRM architecture by simplifying several complex designs in previous LRMs. NeRF initialization of MeshLRM is trained sequentially using both low- and high-resolution images, and this new LRM training strategy can significantly accelerate convergence for better quality with fewer calculations. Our approach enables state-of-the-art mesh reconstruction from sparse view inputs, and also allows for many downstream applications, including text-to-3D and single-image-to-3D generation.

MeshLRM is an LRM-based content creation framework designed to generate high-quality 3D assets. The complex 3D meshes and textures in this scene were all reconstructed in MeshLRM, a method that combines existing image/text-to-3D generation with end-to-end sparse view mesh reconstruction.

Enjoy! Adobe releases ultra-high-quality Mesh reconstruction models!

(1) a novel LRM-based framework integrating differentiable mesh extraction and rendering for end-to-end few-shot mesh reconstruction;

(2) a novel ray opacity loss for stabilizing DiffMC-based widely available training;

and (3) an effective LRM architecture and training strategy to achieve fast and high-quality reconstruction. MeshLRM was benchmarked in terms of 3D reconstruction (on synthetic and real-world datasets) and 3D generation (used in combination with other multi-view generation methods).

MeshLRM integrates differentiable surface extraction and rendering into NeRF-based LRM. MeshLRM applies the recent Differentiable Marching Cube (DiffMC) technique to extract isosurfaces from the density field of triplanar NeRF and renders the extracted mesh using a differentiable rasterizer. MeshLRM can be trained end-to-end using mesh render loss to optimize it to produce high-quality meshes that render realistic images. MeshLRM is trained by initializing the model with volumetric NeRF rendering;

Since mesh components don't introduce any new parameters, they can start with these pre-trained weights. However, the authors find that training mesh-based LRM with differentiable MC is still very challenging, and the main problem lies in the (spatially) sparse gradients from the DiffMC operation, which only affect the surface voxels and leave most of the space constant. This results in poor local minima for model optimization and appears as floating delict in the reconstruction. The authors solve this problem by using a novel ray opacity loss that ensures a near-zero density along the empty space of all pixel rays, effectively and stably training and guiding the model to learn accurate surface geometry of float-free animals. The end-to-end training method reconstructs a high-quality mesh, and the rendering quality exceeds that of NeRF's volumetric rendering. The DiffMC-based mesh technology is versatile and can potentially be applied to any NeRF-based LRMs for mesh reconstruction. In this work, the authors propose a simple and efficient LRM architecture consisting of a large Transformer model that directly processes concatenated multi-view image markers and triplanar markers, with a pure self-attention layer to regress NeRF and the final triplanar features of mesh reconstruction. In particular, the authors simplified many of the complex design choices used in previous LRMs, including removing the pre-trained DINO module in image tokenization and replacing the large three-plane decoder with a small two-layer MLP. A novel low-resolution pre-training and high-resolution fine-tuning strategy was used to train MeshLRM on the Objaverse dataset. These design choices result in a state-of-the-art LRM model with faster training and inference speeds and higher reconstruction quality.

Model architecture of MeshLRM. First, the image is tagged in chunks, and the Transformer takes the stitched image and the triplanar token as input. The output triplanar token is upsampled with a non-patched operator, while the output image token is discarded (not drawn in the figure). With two miniature MLPs for density and color decoding, the model supports both volume drawing and DiffMC fine-tuning.

Trained using ray-traveling-based radiation field rendering, which provides good initialization weights for mesh reconstruction. Unlike previous LRM work that directly used high-resolution (512×512) input images for training, the authors developed an efficient training protocol inspired by ViT. Specifically, the model was first pre-trained with 256×256 images until convergence, and then fine-tuned with fewer iterations using 512×512 images. This training scheme significantly reduces the computational cost, so that better quality is obtained for the same amount of computation than a 512-resolution training from scratch (Tab. 1）。

The model was trained using ray-traveling-based radiation field rendering, and the effectiveness and importance of Stage 2 surface fine-tuning was demonstrated by comparing the final mesh with the mesh extracted directly from the Stage 1 model using Marching Cubes on the GSO dataset. While MeshLRM (NeRF) can achieve high volume rendering quality, extracting meshes directly from them and rendering them with Marching Cubes (MC), i.e., "MeshLRM(NeRF)+MC", results in a significant decrease in rendering quality across all metrics. On the other hand, the final MeshLRM model, which was fine-tuned using DiffMC-based mesh rendering, achieved higher mesh rendering quality and geometric quality than the MeshLRM(NeRF)+MC baseline, significantly increasing PSNR by 2.5dB and reducing CD by 0.58. This mesh rendering quality is even comparable to the Phase 1 MeshLRM (NeRF) volume rendering results, with ------ PSNR slightly reduced, but SSIM and LPIPS improved.

Small MLP. To justify the choice to use small MLPs in triplanar encoding instead of large MLPs in previous LRMs, compare with a version that replaces two small MLPs with a shared MLP decoder of 10 layers (i.e., 9 hidden layers) of 64 width. Small MLPs can achieve similar performance to large MLPs while providing significant training speed advantages (2.7 s/step vs. 3.6 s/step).

This article introduces MeshLRM, a novel large-scale reconstruction model that can output high-quality meshes directly. A pre-trained NeRF-based LRM trained using volume drawing is fine-tuned by applying the Differentiable Marching Cubes (DiffMC) method and differentiable rasterization. Since DiffMC requires backbone efficiency, several improvements to the LRM architecture (micro-shared MLP and simplified image tokenization) have been made to facilitate NeRF and mesh training. The authors also found that low-resolution to high-resolution training strategies can significantly accelerate the training of NeRF-based models. Compared to existing methods, MeshLRM is improved in both quality and speed, and is the only method that can output high-quality meshes. MeshLRM can be applied directly to applications such as text-to-3D and image-to-3D generation.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

Computer Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensor, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, BEV perception, Occupancy, target tracking, end-to-end autonomous driving, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Visual Learning Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。

Enjoy! Adobe releases ultra-high-quality Mesh reconstruction models!

Read on

轩辕大模型的实践与应用 | ML-Summit 2024

The mobile UI model came out, and the Apple iPhone may welcome a new cycle of upgrades

iFLYTEK does not tell the "sexy story" of large models

Meta released the "strongest open-source AI model", and the next generation may be stronger than GPT

面壁新模型:早于Llama3、比肩 Llama3、推理超越 Llama3!

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

13 Models of Effective Communication Expression

Eat through an industrial chain in one day: NO.37 AI large model industrial chain

10 domestic large models vs. mentally handicapped - Chinese comprehension ability assessment

The most complete interpretation of the MoE hybrid expert model: revealing the key technologies and challenges

Baidu's strongest SOTA: 3DGS based on diffusion model!

Sprint 2024 "Half Year Red" | Sixty percent of AI companies have achieved profitable growth, and large model companies have made money?

Dialogue with UBTECH Jiao Jichao: Large model accelerates humanoid robots to "work in the factory"

iFLYTEK's profit puzzle: high investment and low return in the field of large models

Ali Lin Junyang: Large models are not enough for many people, and building multimodal agents is the key

Li Feifei, the godmother of AI, founded a spatial intelligence company that strives to overcome the existing limitations of large-scale AI technology