Read multimodal sensor fusion for autonomous driving in one article

--Collect the "Automotive Driving Automation Classification" (GB/T 40429-2021)--

Multimodal fusion is a fundamental task of perceiving autonomous driving systems and has recently gained the interest of many researchers. However, due to the noise of the raw data, the low utilization of information, and the misalignment of the multimodal sensors, achieving fairly good performance is not an easy task. In this paper, the existing multimodal autonomous driving sensing task methods are reviewed in the literature. Analyzed more than 50 papers, including cameras and lidar, attempting to solve target detection and semantic segmentation tasks. Different from the traditional classification method of fusion model, the author divides the fusion model into two major categories and four sub-categories through a more reasonable classification method from the perspective of the fusion stage. In addition, current fusion methods were studied to discuss potential research opportunities.

Recently, multimodal fusion methods for autonomous driving sensing tasks have evolved rapidly, ranging from transmodal signature representations and more reliable modal sensors to more complex and robust multimodal fusion deep learning models and techniques. However, only a few literature reviews focus on the methodology of the multimodal fusion method itself, and most of the literature follows the traditional rules of dividing it into three categories: pre-fusion, deep (feature) fusion, and post-fusion, focusing on the stage of fusion features in deep learning models, whether data-level, feature-level, or proposal-level. First, this taxonomy does not explicitly define the feature representation for each level. Second, it shows that the two branches of lidar and camera are always symmetrical during processing, obscuring the situation where proposal-level features are fused in the lidar branch and data-level features are fused in the camera branch. In summary, traditional taxonomies may be intuitive, but they lag behind in summarizing the increasing number of multimodal fusion methods that have emerged recently, making it impossible for researchers to study and analyze them from a systems perspective.

The figure shows a schematic diagram of the autonomous driving perception task:

Read multimodal sensor fusion for autonomous driving in one article

Deep learning models are limited to the representation of inputs. In order to implement the model, the original data needs to be preprocessed by a complex feature extractor before the data is entered into the model.

As for image branches, most existing methods maintain the same format as the raw data entered by downstream modules. However, lidar branches are highly dependent on data formats that emphasize different characteristics and have a huge impact on downstream model designs. Therefore, this concludes as a point cloud data format based on point, voxel, and 2D mapping to accommodate heterogeneous deep learning models.

Data-level fusion or pre-fusion method that directly fuses raw sensor data in different modes through spatial alignment. Feature-level fusion or deep-depth fusion methods blend transmodal data in feature space through cascading or multiplication of elements. The goal-level fusion approach combines the predictions of each modal model to make the final decision.

A new taxonomy that divides all fusion methods into strong fusions and weak fusions, shows the relationship between the two as shown in the figure:

For performance comparison, KITTI benchmark for 3D detection and bird's-eye object detection. The following two tables show the experimental results of multimodal fusion methods on BEV and 3D KITTI test datasets, respectively.

According to the different combination stages represented by lidar and camera data, strong fusion is further divided into four categories: pre-fusion, deep fusion, post-fusion and asymmetric fusion. As the most studied fusion method, strong fusion has achieved many outstanding achievements in recent years.

As shown in the figure: Each subclass of strong fusion is highly dependent on lidar point clouds rather than camera data.

Pre-fusion. Data-level fusion is a method of directly fusing each modal data through spatial alignment and projection at the raw data level, unlike pre-fusion at the data level when lidar data is fused and camera data is fused at the data level or feature level. An example is shown in the figure:

In the lidar branch, point clouds can take the form of reflection maps, voxelized tensors, front/distance/bird's-eye views, and pseudo-point clouds. Although all of these data have different intrinsic characteristics and are highly correlated with the lidar backbone, most of the data is generated through rule-based processing, with the exception of pseudo-point clouds. In addition, the data at this stage is still interpretable compared to feature space embedding, so all of these lidar data representations are visually visible.

For image branches, strict data-level definitions should contain only data such as RGB or grayscale, lacking commonality and rationality. Compared to the traditional definition of pre-fusion, camera data is relaxed to data-level and feature-level data. In particular, the image semantic segmentation task results that favor 3D object detection will be represented here as feature-level, because these "target-level" features differ from the final target-level proposals for the entire mission.

Deep integration. The deep-fusion method fuses transmodal data at the feature level of the lidar branch, but at the data level and feature level of the image branch. For example, some methods use feature extractors to acquire embedded representations of lidar point clouds and camera images, respectively, and fuse features into both modes through a series of downstream modules. However, unlike other strong fusion methods, deep integration sometimes fuses features in a cascading manner, both of which leverage both primitive and high-level semantic information. An example of deep integration is shown in the figure:

Post-fusion. Post-fusion, also known as target-level fusion, refers to the method of fusing the results of the pipeline in each mode. For example, some post-fusion methods take advantage of the output of the lidar point cloud branch and the camera image branch and make final predictions based on the results of both modes. Note that the data format proposed by both branches should be the same as the final result, but differ in quality, quantity, and precision. Post-fusion is an ensemble method that is ultimately proposed for multimodal information optimization. The figure shows an example of post-fusion:

Asymmetric fusion. In addition to early fusion, deep integration, and post-fusion, some methods handle cross-modal branches with different permissions, so fusing target-level information of one branch with data-level or feature-level information of other branches is defined as asymmetric fusion. Other methods of strong fusion treat two branches as seemingly equal states, with asymmetric fusions dominating at least one branch, while the other branches provide auxiliary information to perform the final task. As shown in the figure, an example of asymmetric fusion may have the same extraction features proposed, but asymmetric fusion has only one proposal from one branch, and then the fusion has proposals from all branches.

Unlike strong fusion, the weak fusion method does not fuse data/features/targets directly from branching in multiple ways, but manipulates the data in other ways. Weak fusion-based approaches typically use a rules-based approach to utilize one modal data as a supervisor signal to guide the interaction of another modality. The figure shows the basic framework of the weak fusion pattern:

It is possible that the 2D proposal of the CNN in the image branch led to the emergence of a truncated cone (frustum) in the original lidar point cloud. However, unlike the image feature combination asymmetric fusion, the weak fusion directly feeds the selected original lidar point cloud into the lidar backbone to output the final proposal.

Some work cannot simply be defined as any of the above types of fusion, employing multiple fusion methods throughout the model framework, such as the combination of deep and post-fusion, and there are also combinations of pre-fusion and deep integration. There is redundancy in these approaches from the model design point of view, which is not the mainstream of converged modules.

There is some analysis of the problems to be solved.

Current fusion models face misalignment and information loss. In addition, flat fusion operations prevent further improvements in perceptual task performance. To summarize:

Misalignment and loss of information: Traditional pre-fusion and deep fusion methods use external calibration matrices to project all lidar points directly onto the corresponding pixels and vice versa. However, due to sensor noise, this pixel-by-pixel alignment is not precise enough. Therefore, the information around it can be taken as a supplement, resulting in better performance. In addition, there are other information losses during the conversion of input and feature spaces. Often, the projection of dimensionality reduction operations inevitably results in a large amount of information loss, for example, mapping a 3-D lidar point cloud to a 2-DBEV image. Mapping two modal data to another high-dimensional representation specifically designed for fusion can effectively utilize the raw data and reduce information loss.

More reasonable fusion operations: Simple operations such as cascading and multiplication of elements may not be able to fuse data with largely different distributions, making it difficult to bridge the semantic gap between the two modes. Some efforts attempt to fuse data and improve performance with more complex cascading structures.

A front-view single-frame image is a typical scenario for autonomous driving perception tasks. However, most frameworks utilize limited information and do not have detailed design assistance tasks to further understand the driving scenario. To summarize:

Adopt more potential information: Existing methods lack effective use of multi-dimensional and sourced information. Most of this focuses on single-frame multimodal data from the front view. Other meaningful information includes semantic, spatial, and scene contextual information. Some models attempt to use image semantic segmentation of task results as additional features, while others may take advantage of features of the middle layer of the neural network backbone. In an autonomous driving scenario, many downstream tasks with clear semantic information may greatly improve the performance of the target detection task. For example, lane detection, semantic segmentation. Therefore, future research can help perceive the performance of the task by working together to build a complete cognitive framework of the urban scene through a variety of downstream tasks, such as detecting lanes, traffic lights, and signs. In addition, the current perceptual task relies primarily on a single framework that ignores temporal information. Recent lidar-based approaches combine a sequence of frames to improve performance. Time series information contains serialized monitoring signals that can provide more robust results than the single-frame method.

Self-supervision of representational learning: Signals that supervise each other naturally exist in cross-modal data sampled from the same real-world scene but from different angles. However, due to the lack of in-depth understanding of the data, it is currently not possible to excavate the synergies between the various modes. Future research could focus on how to use multimodal data for self-supervised learning, including pre-training, fine-tuning, or contrast learning. By implementing these state-of-the-art mechanisms, fused models will deepen understanding of data and achieve better results.

Domain bias and data resolution are highly correlated with real-world scenes and sensors. These deficiencies hinder the large-scale training and implementation of self-driving deep learning models.

Domain bias: In an autonomous driving perception scenario, raw data extracted by different sensors is accompanied by domain-related features. Different camera systems have their optical properties, while lidar may vary from mechanical lidar to solid-state lidar. What's more, the data itself may be domain-biased, such as weather, season, or geographic location. As a result, the detection model does not adapt smoothly to new scenarios. Due to generalization failures, these flaws hinder the collection of large-scale datasets and the reusability of raw training data.

Resolution conflicts: Sensors from different modes often have different resolutions. For example, the airspace density of lidar is significantly lower than the airspace density of the image. Regardless of the projection method, some information is eliminated because the correspondence cannot be found. This can result in the model being dominated by data for a particular mode, whether it is the resolution of the feature vectors is different or the original information is unbalanced.

Reproduced from computer vision deep learning and automatic driving, the views in the text are only for sharing and exchange, and do not represent the position of this public account, such as copyright and other issues, please inform, we will deal with it in a timely manner.

-- END --

Read multimodal sensor fusion for autonomous driving in one article

Read on