The latest and most complete summary! A Review of Autonomous Driving Occupancy: A Perspective of Information Fusion

Source: 3D Vision Workshop

Add a small assistant: dddvision, note: direction + school/company + nickname, pull you into the group. At the end of the article, industry subdivisions are attached

标题：A Survey on Occupancy Perception for Autonomous Driving: The Information Fusion Perspective

作者:Huaiyuan Xu, Junliang Chen, Shiyu Meng, Yi Wang, Lap-Pui Zhou

Institution: The Hong Kong Polytechnic University

Original link: https://arxiv.org/abs/2405.05173

Github：https://github.com/HuaiyuanXu/3D-Occupancy-Perception

3D occupancy perception technology aims to observe and understand the dense 3D environment of autonomous vehicles. Due to its comprehensive perception capabilities, this technology is making its mark in autonomous driving perception systems and has attracted a lot of attention from industry and academia. Similar to traditional bird's-eye view (BEV) perception, 3D occupancy sensing has the characteristics of multi-source input and the necessity of information fusion. However, the difference is that it captures the vertical structure that the 2D BEV ignores. In this investigation, we review the latest work on 3D occupancy perception and provide an in-depth analysis of methods with different input modes. Specifically, we summarize the general network process, highlight information fusion techniques, and discuss effective network training. We evaluated and analyzed the occupancy-aware performance of state-of-the-art technologies on the most popular datasets. In addition, we discuss challenges and future research directions. We hope this report will inspire the community and encourage more research efforts on 3D occupancy perception. A list of research reviews from this survey can be found in an active repository that continuously collects the latest work: https://github.com/HuaiyuanXu/3D-Occupancy-Perception.

2.1. Occupancy perception in autonomous driving

Autonomous driving can improve the efficiency of urban transportation and reduce energy consumption. For reliable and safe autonomous driving, a critical capability is an accurate and comprehensive understanding of the surrounding environment, i.e., perceptually observing the world. At present, bird's-eye view (BEV) perception is the mainstream perceptual mode, which has the advantages of absolute scale and unobstructed description of the environment. BEV perception provides a unified representation space for multi-source information fusion (e.g., information from different perspectives, modalities, sensors, and time series) and provides many downstream applications (e.g., interpretable decision-making and motion planning). However, BEV Perception does not monitor altitude information and therefore cannot provide a complete representation of a 3D scene. To solve this problem, occupancy perception is proposed for autonomous driving to capture the dense 3D structures of the real world. This emerging perception technology aims to infer the occupancy state of each voxel used in the voxelized world, and is characterized by a strong generalization ability for open targets, irregularly shaped vehicles, and special road structures. Compared to 2D views such as perspective and bird's-eye views, occupancy perception has the nature of 3D properties, making it more suitable for 3D downstream tasks such as 3D detection and segmentation.

In academia and industry, occupancy perception has presented meaningful implications for the overall understanding of 3D scenes. In terms of academic considerations, estimating the dense voxel occupancy of the real 3D world from complex input formats, including multiple sensors, modalities, and time series is challenging. In addition, further reasoning about the semantic categories, textual descriptions, and motion states that occupy voxels is valuable for a more comprehensive understanding of the environment. From an industrial perspective, deploying a lidar kit on each autonomous vehicle is expensive. With cameras as an inexpensive alternative to LiDAR, vision-centric occupancy perception is indeed a cost-effective solution that can reduce vehicle equipment manufacturing costs.

The latest and most complete summary! A Review of Autonomous Driving Occupancy: A Perspective of Information Fusion

2.2. Motivation for information fusion research

The essence of occupancy perception is to understand the full and dense 3D scene, including understanding occluded areas. However, observations from a single sensor, such as a 2D image or point cloud, capture only a portion of the scene. Figure 1 visually illustrates the inability of an image or point cloud to provide a 3D panorama or a dense environment scan, resulting in insufficient scene perception. To this end, the study of fusing information from multiple sensors and multiple frames will facilitate comprehensive occupancy perception. This is because, on the one hand, information fusion expands the range of spatial perception, and on the other hand, it intensively observes the scene. In addition, in occlusion areas, it is beneficial to integrate multi-frame observations, as the same scene is observed by a large number of viewpoints, providing sufficient scene features for occlusion inference.

In addition, in dynamic outdoor scenarios, autonomous vehicles must navigate complex environments under different lighting and weather conditions, and stable occupancy perception needs are crucial. Perceived robustness is essential to ensure driving safety and efficiency. At this point, the study of multimodal fusion will promote robust occupancy perception by combining the advantages of different data modalities. For example, lidar and radar data are not affected by changes in illumination and can perceive the precise depth of a scene. This ability is especially important when driving at night or in situations where shadows and flares can obscure critical information. Camera data excels at capturing detailed visual textures and is adept at identifying long-range and color-based environmental elements such as road signs, traffic lights, and lane markings. The fusion of data from these multiple modalities will present a holistic view of the environment while resisting adverse environmental changes.

2.3. Contributions

On related perceptual topics, 3D semantic segmentation and 3D object detection have been extensively reviewed. However, these tasks do not facilitate an intensive understanding of the environment. BEV Perception, which solves this problem, has also been thoroughly reviewed. Our investigation focused on 3D occupancy perception, capturing environmental height information that BEV perception ignores. Roldao et al. conducted a literature review on 3D scene completion for both indoor and outdoor scenes, which is closely related to our focus. Unlike their work, our survey was tailored specifically for autonomous driving scenarios. In addition, in view of the multi-source nature of 3D occupancy perception, we conducted an in-depth analysis of the information fusion technology in this field. The main contribution of this survey is threefold:

• We systematically reviewed the latest research on 3D occupancy perception in the field of autonomous driving, including the overall research background, a comprehensive analysis of its importance, and an in-depth discussion of related technologies.

• We provide a 3D occupancy-aware taxonomy and introduce the core methodological issues in detail, including network pipes, multi-source information fusion, and effective network training.

• We evaluated 3D occupancy perception and provided a detailed performance comparison. In addition, current limitations and future research directions are discussed.

Recent autonomous driving occupancy sensing methods and their characteristics are described in detail in Table 1. The table details where each method was published, the input modal, the network design, the target tasks, the network training and evaluation, and the open source status. Below, we divide occupancy sensing methods into three categories based on the modality of the input data. They are LiDAR center occupancy perception, visual center occupancy perception, and multimodal occupancy sensing. Subsequently, the training of the occupancy network and its loss function are discussed. Finally, a variety of downstream applications using occupancy sensing are introduced.

4.1. Perceptual Accuracy

SemanticKITTI is the first dataset with 3D occupancy labels for outdoor driving scenarios. Occ3D-nuScenes is the dataset used in CVPR's 2023 3D Occupancy Prediction Challenge. These two datasets are currently the most popular. Therefore, we summarize the performance of the various 3D occupancy methods trained and tested on these datasets, as described in Tables 3 and 4. These tables further organize the occupancy methods based on input modalities and supervised learning types. The best performance has been highlighted in bold. Table 3 evaluates 3D geometry and 3D semantic occupancy perception using IoU and mIoU metrics. Table 4 evaluates semantic occupancy perception using mIoU and mIoU ∗. Unlike mIoU, the mIoU ∗ indicator excludes the "other" and "other flat" classes and is used by the self-supervised OccNeRF. To be fair, we calculated the mIoU ∗ for other self-supervised occupancy methods. It is worth noting that the OccScore metric is used for the CVPR 2024 Autonomous Grand Challenge, but it is not yet widespread. Therefore, we do not use this metric to summarize occupancy performance. Next, we will compare perceptual accuracy in three aspects: overall comparison, modal comparison, and supervised comparison.

(1) Overall comparison. Table 3 shows that: (i) the IoU score of the occupying network is less than 50%, while the mIoU score is less than 30%. The IoU score (which stands for geometric perception, i.e., ignores semantics) far exceeds the mIoU score. This is because predicting occupancy is challenging for some semantic categories, such as bicycles, motorcycles, people, cyclists, motorcyclists, poles, and traffic signs. Each of these categories has a small proportion (less than 0.3%) in the dataset, and their small size in shape makes them difficult to observe and detect. Therefore, if the IOU scores in these categories are low, they can significantly affect the overall mIoU value. Because the mIOU calculation does not take into account class frequency, it divides the total IoU score of all classes by the number of classes. (ii) A higher IoU does not guarantee a higher mIoU. One possible explanation is that the semantic awareness (reflected in mIoU) and geometric awareness (reflected in IoU) of the occupying network are different and not positively correlated. As can be seen from Table 4: (i) The mIOU score occupying the network is within 50%, which is higher than the score on SemanticKITTI. For example, TPVFormer has an mIOU of 11.26% on SemanticKITTI, but 27.83% on Occ3D-nuScenes. Similarly, the same is true for OccFormer and SurroundOcc. We think this may be due to the more accurate occupancy tags in Occ3D-nuScenes. SemanticKITTI labels each voxel based on the lidar point cloud, i.e., the label is assigned to the voxel based on the majority vote of all the marker points within the voxel. In contrast, Occ3D-nuScenes utilizes a complex label generation process that includes voxel densification, occlusion inference, and image-guided voxel refinement. This annotation can produce more precise and dense 3D occupancy labels. (ii) COTR achieved the highest IoU score in all categories.

(2) Modal comparison. The input data modality significantly affects the accuracy of 3D occupancy perception. Table 3 of the "Mod." columns report the input modalities for the various occupancy methods. It can be seen that the occupancy method of the lidar center has a higher IoU and more accurate perception of the mIoU score because the lidar perception provides accurate depth information. For example, S3CNet has the highest mIoU (29.53%), while DIFs achieve the highest IoU (58.90%). We observed that these two multimodal approaches did not outperform S3CNet and DIFs, suggesting that they did not take full advantage of the richness of multimodal fusion and input data. There is still a lot of room for improvement in multimodal occupancy perception. In addition, although occupancy perception of visual centers has developed rapidly in recent years, as can be seen from Table 3, there is still a gap between the state-of-the-art occupancy methods of visual centers and those of lidar centers in terms of IoU and mIoU. We believe that further improvement of depth estimation of the visual center method is necessary.

(3) Supervised comparisons. Table 4 of the "Sup." The columns provide an overview of the types of supervised learning used to train occupancy networks. Strongly supervised training using 3D occupancy tags directly is the most prevalent type. Table 4 shows that the occupancy network based on strongly supervised learning achieves impressive performance. The mIoU scores of FastOcc, FB-Occ, PanoOcc, and COTR were significantly higher (12.42%-38.24% higher mIoU than weakly supervised or self-supervised methods). This is because the occupancy labels provided by the dataset are carefully annotated, have high accuracy, and can impose strong constraints on network training. However, annotating these dense occupancy tags is time-consuming and laborious. It is necessary to explore weakly or self-supervised network training to reduce the dependence on occupying labels. Vampire was the best-performing method based on weakly supervised learning, achieving an mIoU score of 28.33%. It shows that semantic lidar point clouds can supervise the training of 3D occupancy networks. However, collecting and annotating semantic lidar point clouds is expensive. SelfOcc and OccNeRF are two typical occupation works based on self-supervised learning. They use volumetric rendering and photometric consistency to obtain self-supervised signals, proving that the network can learn 3D occupancy perception without any labels. However, their performance is still limited, with SelfOcc achieving 7.97% mIoU and OccNeRF achieving 10.81% mIoU ∗.

4.2. Inference Speed

Recent research into 3D occupancy perception has begun to consider not only perceptual accuracy, but also its inference speed. Based on the data provided by FastOcc and FullySparse, we collated the inference speed of the 3D occupancy methods and reported their running platforms, input image sizes, backbone architectures, and occupancy accuracy on the Occ3D-nuScenes dataset, as shown in Table 5. A practical occupancy method should have high accuracy (mIoU) and fast inference speed (FPS). As can be seen from Table 5, FastOcc achieves a high mIoU (40.75%), which is comparable to the mIOU of BEVFomer. Notably, FastOcc has a higher FPS value on lower performance GPU platforms, compared to BEVFomer. In addition, after acceleration by TensorRT [132], the inference speed of FastOcc reached 12.8Hz.

5.1. Occupancy-based applications in autonomous driving

3D occupancy perception enables a comprehensive understanding of the 3D world and supports a wide range of tasks in autonomous driving. Existing occupancy-based applications include segmentation, detection, flow prediction, and planning.

(1) Segmentation: Semantic occupancy perception can basically be regarded as a 3D semantic segmentation task. (2) Detection: OccupancyM3D and SOGDet are two occupancy-based jobs that are used to achieve 3D object detection. OccupancyM3D first learns to occupy to enhance 3D features, which is then used for 3D inspection. SOGDet developed two parallel tasks, semantic occupancy prediction and 3D object detection, while training these tasks to enhance each other. (3) Flow prediction: Cam4DOcc predicts the foreground flow in the 3D space from the occupied perspective and realizes the understanding of the changes in the surrounding 3D environment. (4) Planning: OccNet quantizes the physical 3D scene as a semantic occupancy and trains a shared occupancy descriptor. This descriptor is fed into various task headers to achieve driving tasks. For example, the motion planning head plans the trajectory for the self-vehicle output. However, the existing occupancy-based applications mainly focus on the perception level and less at the decision-making level. Given that 3D occupancy is more consistent with the 3D physical world than other perception methods (e.g., bird's-eye view perception and perspective view perception), we believe that 3D occupancy has a wider range of application opportunities in autonomous driving. At the perception level, it can improve the accuracy of existing trajectory prediction, 3D object tracking, and 3D lane line detection. At the decision-making level, it can help with safer driving decisions and provide 3D explainability for driving behavior.

5.2. Deployment Efficiency

For complex 3D scenes, it is always necessary to process and analyze large amounts of point cloud data or multi-view visual information to extract and update occupancy state information. To achieve real-time performance for autonomous driving applications, solutions often need to be computationally completed in a limited amount of time and need to have efficient data structures and algorithm design. Overall, deploying deep learning algorithms on target edge devices is not an easy task.

At the moment, some real-time efforts on occupancy tasks have been attempted. For example, FastOcc accelerates predictive inference by adjusting input resolution, view transformation module, and prediction head. SparseOcc is a sparse occupancy network without any dense 3D features to minimize the computational cost of sparse convolutional layers and mask-guided sparse sampling. Tang et al. proposed to adopt sparse latent representation instead of TPV representation and sparse interpolation operation to avoid information loss and reduce computational complexity. However, the above approach is still some way from the real-time deployment of autonomous driving systems.

5.3. Powerful 3D occupancy awareness

In a dynamic and unpredictable real-world driving environment, the robustness of perception is critical to the safety of autonomous vehicles. State-of-the-art 3D occupancy models can be vulnerable to out-of-distribution scenes and data (e.g., changes in lighting and weather) that introduce visual biases, as well as input image blurring caused by vehicle motion. In addition, sensor failures (e.g., loss of frame and loss of camera view) are common. Given these challenges, there is significant value in studying robust 3D occupancy perception.

However, research on robust 3D occupancy is limited, mainly due to the scarcity of datasets. Recently, the ICRA 2024 RoboDrive Challenge offered imperfect scenarios for studying powerful 3D occupancy perception. We believe that work related to robust bird's-eye view perception may inspire research into robust occupancy perception. MBEV proposes random masking and reconstructed camera views to enhance robustness in various camera missing situations. GKT utilizes coarse projections to achieve a robust bird's-eye view representation. In most scenarios involving natural damage, multimodal models outperform unimodal models through the complementarity of multimodal inputs. In addition, in 3D LiDAR perception, Robo3D enhances the robustness of the student model by transferring knowledge from the teacher model with a complete point cloud to the student model with imperfect inputs. Based on these efforts, near-robust 3D occupancy perception may include, but is not limited to, robust data representation, multimodality, network architecture, and learning strategies.

5.4. Generalized 3D occupancy awareness

3D labeling is expensive, and large-scale 3D labeling in the real world is impractical. At present, the generalization ability of existing networks trained on limited 3D labeled datasets has not been extensively studied. In order to get rid of the dependence on 3D labels, self-supervised learning represents a potential path to generalized 3D occupancy perception. It learns occupancy perception from a wide range of unlabeled images. However, the current performance of self-supervised occupancy awareness is poor. On the Occ3DnuScene dataset, the highest accuracy of the self-supervised method is much lower than that of the strongly supervised method. In addition, current self-supervised methods require more data for training and evaluation. Therefore, improving self-supervised generalized 3D occupancy perception is an important research direction in the future.

In addition, current 3D occupancy perception can only recognize a set of predefined object categories, which limits its generalization ability and practicability. Recent advances in large language models (LLMs) and large visual language models (LVLMs) have shown promising abilities for reasoning and visual comprehension. The integration of these pre-trained large models has been shown to enhance the generalization ability of perception. POP-3D leverages a powerful pre-trained visual language model to train its network and implements 3D occupancy perception of open vocabulary. Therefore, we see the adoption of LLMs and LVLMs as both a challenge and an opportunity to achieve generalized 3D occupancy awareness.

This paper conducts a comprehensive survey of 3D occupancy perception in autonomous driving in recent years. We reviewed and discussed in detail the state-of-the-art LiDAR center, vision center, and multimodal perception solutions, and highlighted information fusion technologies in the field. To facilitate further research, a detailed performance comparison of existing occupancy methods is provided. Finally, we describe some of the open challenges that may inspire research directions in the coming years. We hope that this survey will benefit the community, support the further development of autonomous driving, and help lay readers understand this area.

Readers who are interested in more experimental results and details of the article can read the original paper~

This article is only for academic sharing, if there is any infringement, please contact to delete the article.

3D Vision Workshop Exchange Group

At present, we have established multiple communities in the direction of 3D vision, including 2D computer vision, large models, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc., and the subdivisions include:

2D Computer Vision: Image Classification/Segmentation, Target/Detection, Medical Imaging, GAN, OCR, 2D Defect Detection, Remote Sensing Mapping, Super-Resolution, Face Detection, Behavior Recognition, Model Quantification Pruning, Transfer Learning, Human Pose Estimation, etc

Large models: NLP, CV, ASR, generative adversarial models, reinforcement learning models, dialogue models, etc

Industrial 3D vision: camera calibration, stereo matching, 3D point cloud, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase deflection, Halcon, photogrammetry, array camera, photometric stereo vision, etc.

Slam:视觉Slam、激光Slam、语义Slam、滤波算法、多传感器融吇𴢆算法、多传感器标定、动态Slam、Mot Slam、Nerf Slam、机器人导航等.

Autonomous driving: depth estimation, Transformer, millimeter-wave, lidar, visual camera sensors, multi-sensor calibration, multi-sensor fusion, autonomous driving integrated group, etc., 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane line detection, Occupancy, target tracking, etc.

三维重建：3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

Unmanned aerial vehicles: quadrotor modeling, unmanned aerial vehicle flight control, etc

In addition to these, there are also exchange groups such as job search, hardware selection, visual product landing, the latest papers, the latest 3D vision products, and 3D vision industry news

Add a small assistant: dddvision, note: research direction + school/company + nickname (such as 3D point cloud + Tsinghua + Little Strawberry), pull you into the group.

3D Vision Workshop Knowledge Planet

3DGS、NeRF、结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测、BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制、无人机仿真、三维视觉C++、三维视觉python、dToF、相机标定、ROS2、机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪，无人机等。

The latest and most complete summary! A Review of Autonomous Driving Occupancy: A Perspective of Information Fusion

Read on

Tesla is staging a wave of executive departures! Musk returns to the car business: "A radical organizational reform must be carried out every five years" [with Tesla's self-driving technology analysis]

【Space please answer】"Autonomous driving" and "standard plus optional"...... Analysis of the highlights of the Long March-6C launch vehicle

Technological breakthroughs and capital influx have accelerated the marketization process of autonomous driving

Zhiji L6: 219,900 up, LiDAR +1000 km endurance, this wave of volume king is a real volume! In the past few years, new cars have been put on the market, and there is really no real ability. It's not,

The autonomous driving test was less than expected, and the Xiaomi car project was exposed to a delay in progress

Liu Shuquan and Zhou Guang: Complying with Artificial Intelligence 2.0, End-to-End Making Autonomous Driving More "Human"

Is autonomous driving going to be profitable by large models?

Mistaking the picture of a billboard as a truck and causing a rear-end collision: car companies should not package assisted driving as autonomous driving | Quick review

Renault joins hands with WeRide to make the French Open debut of self-driving buses!

Musk wants to save Tesla, but self-driving is getting cold

"Collected in China, processed in China"? Tesla was revealed to be building a self-driving data center in China

More and more technology tycoons are optimistic about Tesla's pure vision autonomous driving! Hanging Huawei is far ahead?

7.6 billion! Masayoshi Son made another move, leading the self-driving unicorn praised by Bill Gates

Lei Jun's live broadcast was watched by 20 million onlookers; Tesla was revealed to be building a self-driving data center in China

Autonomous driving technology development status and future challenges: a comprehensive analysis of the market, technology and human factors

The post-90s autonomous driving bull makes a car: the smart card is equipped with a wire-controlled chassis to prepare for unmanned driving