10,000 words to understand the automatic driving 3D visual perception algorithm

For autonomous driving applications, it is ultimately necessary to perceive the 3D scene. The reason is very simple, the vehicle can not rely on a perceptual result on an image to drive, even a human driver can not drive against an image. Because the distance and depth information of the object and the scene cannot be reflected in the 2D perception results, this information is the key to the automatic driving system to make a correct judgment of the surrounding environment.

In general, the visual sensors of autonomous vehicles, such as cameras, are mounted above the body or on the rearview mirrors in the car. Regardless of location, the camera gets a projection of the real world in perspective view (world coordinate system to image coordinate system). This view is very similar to the human visual system and is therefore easily understood by human drivers. But a fatal problem with perspective views is that the scale of objects changes with distance. Therefore, when the perception system detects an obstacle in front of it from the image, it does not know the distance of the obstacle from the vehicle, nor does it know the actual three-dimensional shape and size of the obstacle.

10,000 words to understand the automatic driving 3D visual perception algorithm

Image Coordinate System (Perspective View) vs. World Coordinate System (Bird's Eye View) [IPM-BEV]

One of the most straightforward ways to get information from 3D space is to use LiDAR. On the one hand, the 3D point cloud output by LiDAR can be used directly to obtain the distance and size of the obstacle (3D object detection), as well as the depth of the scene (3D semantic segmentation). On the other hand, 3D point clouds can also be fused with 2D images to take full advantage of the different information provided by the two: the advantage of point clouds is that the distance and depth perception are accurate, while the advantage of images is that the semantic information is richer.

However, LiDAR also has its shortcomings, such as high cost, difficulty in mass production of vehicle-grade products, and greater influence by weather. Therefore, pure camera-based 3D perception is still a very meaningful and valuable research direction. The following sections of this article detail 3D perception algorithms based on single and dual cameras.

Monocular 3D perception

Perceiving the 3D environment based on a single-camera image is a pathological problem, but some geometric constraints and prior knowledge can be used to assist in this task, and deep neural networks can be used to learn end-to-end how to predict 3D information from image features.

Object detection

Single Camera 3D Object Detection (Image from M3D-RPN)

Image inversion

As mentioned earlier, the image is a projection from the real-world 3D coordinates to the 2D plane coordinates, so a very direct idea for 3D object detection from the image is to invert the 2D image to the 3D world coordinates, and then perform object detection under the world coordinate system. Theoretically, this is a pathological problem, but it can be solved with some additional information (such as depth estimation) or geometric assumptions (such as pixels on the ground).

BeV-IPM[1] proposes to convert images from perspective view to bird's eye view (BEV). There are two assumptions here: one is that the road surface is parallel to the world coordinate system and has a height of zero, and the other is that the vehicle's own coordinate system is parallel to the world coordinate system. The former is not satisfied in the case of non-flat road surface, and the latter can be corrected by vehicle attitude parameters (Pitch and Roll), which is actually the Calibration of the vehicle coordinate system and the world coordinate system. Assuming that all pixels in the image are at a real-world height of zero, the Homography transform can be used to convert the image to a BEV view. In the BEV view, the Bottom Box of the target is detected using the YOLO network-based method, that is, the rectangle that makes contact with the road surface. The Bottom Box has a height of zero, so it can be accurately projected onto the BEV view as GrundTruth to train the neural network, while the box predicted by the neural network can also accurately estimate its distance. The assumption here is that the target needs to be in contact with the road surface, which is generally satisfied for vehicle and pedestrian targets.

BEV-IPM

Another method of inverse transformation is the Orthographic Feature Transform (OFT)[2]. The idea is to use CNN to extract multi-scale image features, then transform these image features into BEV views, and finally perform 3D object detection on BEV features. First of all, a 3D mesh from the BEV perspective needs to be constructed (the grid range of the experiment in this article is 80 m x 80 m x 4 m, and the grid size is 0.5 m). Each mesh transforms a region on the corresponding image by perspective transformation (defined as a rectangular area for simplicity), and takes the mean of the image features in this area as the characteristics of the mesh, thus obtaining the 3D mesh feature. In order to reduce the amount of computation, the 3D mesh feature is compressed (weighted average) on the dimension of height to obtain a 2D mesh feature. The final object detection is performed on a 2D mesh feature. The projection of 3D mesh to 2D image pixels does not correspond one-to-one, and multiple meshes will correspond to adjacent image areas, resulting in ambiguity in mesh features. Therefore, it is also necessary to assume that the objects to be detected are on the road surface, and the range of heights is very narrow. So the 3D grid used in the experiment is only 4 meters high, which is enough to cover vehicles and pedestrians on the ground. However, if traffic signs are to be detected, this method of assuming that objects are close to the ground is not applicable.

Orthographic Feature Transform

Both of these methods are based on the assumption that the object is located on the ground. In addition, another idea is to use the results of depth estimation to generate pseudo-point cloud data, one of the typical jobs is TheEudo-LiDAR [3]. The result of the depth estimate is generally seen as an additional image channel (similar to RGB-D data), and the image-based object detection network is directly used to generate 3D object borders. In the article, the authors point out that the accuracy of 3D object detection based on depth estimation is much worse than that of LiDAR-based methods, mainly because the accuracy of depth estimation is not enough, but because there is a problem with the method of data representation. First of all, in terms of image data, the area of distant objects is very small, which makes the detection of distant objects very inaccurate. Second, the depth difference between adjacent pixels in depth can be very large (e.g. at the edge of an object), and the use of convolution to extract features can be problematic. With these two points in mind, the authors propose to convert the input image into point cloud data similar to LiDAR-generated based on a depth map, and then use algorithms for point cloud and image fusion (such as AVOD and F-PointNet) to detect 3D objects. Pseudo-LiDAR's method does not rely on a specific depth estimation algorithm, and any depth estimation from monocular or binocular can be used directly. With this special data representation method, Pseudo-LiDAR can improve the accuracy of object detection from 22% to 74% in the 30-meter range.

Pseudo-LiDAR

Compared with the real LiDAR point cloud, the Pseudo-LiDAR method still has a certain gap in the accuracy of 3D object detection, which is mainly due to the lack of accuracy of depth estimation (binocular effect is better than monocular effect), especially the depth estimation error around the object will have a great impact on the detection. As a result, Pseudo-LiDAR has also been expanded a lot since then. Pseudo-LiDAR++[4] uses low-harness LiDAR to enhance virtual point clouds. Pseudo-Lidar End2End[5] uses instance segmentation instead of object boxes in F-PointNet. RefinedMPL[6] generates only virtual point clouds on the former attractions, reducing the number of point clouds to 10% of the original, which can effectively reduce the number of false detections and the amount of algorithm calculations.

Key points and 3D models

In autonomous driving applications, many of the objects that need to be detected,such as vehicles and pedestrians—are relatively fixed in size and shape, and are known. This prior knowledge can be used to estimate the 3D information of the target.

DeepMANTA[7] is one of the pioneering works in this direction. First, traditional image object detection algorithms such as Faster RNN are used to obtain a 2D object frame, while also detecting key points on the vehicle. These 2D object frames and key points are then matched separately with multiple 3D vehicle CAD models in the database, and the most similar model is selected as the output of 3D object detection.

Deep MANTA

3D-RCNN[8] proposes an Inverse-Graphics method to recover the 3D shape and pose of each target in the scene based on the image. The basic idea is to start from the 3D model of the target and find the model that best matches the target in the image through parameter search. These 3D models usually have a lot of control parameters, and the search space is large, so the traditional method of searching for the optimal effect in the high-dimensional parameter space is not good. 3D-RCNN uses PCA to reduce the dimensionality of the parameter space (10-D) and utilizes a deep neural network (R-CNN) to predict the low-dimensional model parameters of each target. The predicted model parameters can be used to generate two-dimensional images or depth maps of each target, and the Loses compared with the GrudTruth data can be used to guide the learning of the neural network. This Loss is called Render-and-Compare Loss and is based on OpenGL. The 3D-RCNN method requires more input data, and the design of Loss is relatively complex, making it difficult to implement the project.

3D-RCNN

MonoGRNet[9] proposes to divide monocular 3D object detection into four steps, which are used to predict the depth of the 2D object frame, the 3D center of the object, the 2D projection position of the 3D center of the object, and the 3D position of the 8 corner points. First, the predicted 2D object frame in the image is manipulated by ROIAlign to obtain the visual features of the object. These features are then used to predict the depth of the object's 3D center and the 2D projection position of the 3D center. With these two pieces of information, the location of the object's 3D center point can be obtained. Finally, the relative positions of the 8 corner points are predicted according to the position of the 3D center. MonoGRNet can be considered to use only the center of the object as the key point, and the matching of 2D and 3D is also the calculation of the point distance. MonoGRNetV2[10] extends the center point to multiple key points and employs a 3D CAD object model for depth estimation, much like the DeepMANTA and 3D-RCNN described earlier.

MonoGRNet

Monoloco [11] primarily solves the problem of 3D detection of pedestrians. Pedestrians are non-rigid objects, and their posture and deformation are more diverse, making them more challenging than vehicle detection. Monoloco is also based on critical point detection, where the relative 3D position of the key point a priori can be used for depth estimation. For example, the distance of a pedestrian is estimated based on the length of 50 cm from the shoulder to the hip of a pedestrian. The reason for this length as a benchmark is that this part of the human body can produce the least deformation and is used to make depth estimates with the highest accuracy. Of course, other key points can also be used as an aid to complete the task of depth estimation. Monoloco uses a multi-layer, fully connected network to predict the distance of a pedestrian from the location of a key point, while also giving the uncertainty of the prediction.

Monoloco

To sum up, the above methods are all key points extracted from the 2D image and matched with the 3D model to obtain the 3D information of the target. This type of method assumes that the target has a relatively fixed shape model, which is generally satisfied for vehicles and relatively difficult for pedestrians. In addition, this type of method requires multiple key points to be annotated on the 2D image, which is also very time-consuming.

2D/3D geometric constraints

Deep3DBox[12] is an early and representative work in this direction. The 3D object frame requires 9-dimensional variables to represent, center, size, and orientation respectively (3D orientation can be reduced to Yaw, so it becomes a 7-dimensional variable). Image 2D object detection can provide a 2D object frame containing 4 known variables (2D center and 2D size), which is not enough to solve for variables with 7 or 9 dimensional degrees of freedom. In these three sets of variables, size and orientation are relatively closely related to visual features. For example, the 3D size of an object is very largely correlated with its category (pedestrians, bicycles, cars, buses, trucks, etc.), and the object category can be predicted by visual features. For the center point 3D position, due to the ambiguity generated by perspective projection, it is difficult to predict by visual features alone. Therefore, Deep3DBox proposes to first use the image characteristics inside the 2D object frame to estimate the size and orientation of the object. Then, a 2D/3D geometric constraint is used to solve the center point 3D position. This constraint is that the projection of the 3D object frame on the image is tightly surrounded by the 2D object frame, that is, at least one corner point of the 3D object frame can be found on each side of the 2D object frame. By predicting the size and orientation of the previously predicted, combined with the Calibration parameter of the camera, the 3D position of the center point can be solved.

Geometric constraints between 2D and 3D object frames (image from literature [9])

This method of utilizing 2D/3D constraints requires very precise 2D object frame detection. Under the framework of Deep3DBox, small errors on the 2D object frame can cause the prediction of the 3D object frame to fail. The first two phases of Shift R-CNN[13] are very similar to Deep3DBoxes in that they predict 3D size and orientation through 2D object frames and visual features, and then solve for 3D positions through geometric constraints. However, Shift R-CNN adds a third stage, combining the 2D object frame, 3D object frame and camera parameters obtained from the first two stages as inputs, using a fully connected network to predict more accurate 3D positions.

Shift R-CNN

When using 2D/3D geometric constraints, the above methods are to obtain the 3D position of the object by solving a set of superconstrained equations, and this process is used as a post-processing step and is not within the neural network. The first and third phases of Shift R-CNN are also trained separately. MVRA[14] built a network of solutions to this superconstrained equation and designed IoU Loss under image coordinates and L2 Loss under BEV coordinates to measure the error of object frame and distance estimation respectively to assist in end-to-end training. In this way, the quality of the object's 3D position prediction will also have a feedback effect on the previous 3D size and orientation predictions.

Generate 3D object boxes directly

The three types of methods introduced earlier are all based on 2D images, some transforming images into BEV views, some detecting 2D key points and matching 3D models, and some using geometric constraints of 2D and 3D object frames. In addition, there is another method to score all candidate boxes by means of features on the 2D image, starting from dense 3D object candidates, and the highly rated candidate boxes are the final output. This strategy is somewhat similar to the traditional Sliding Window approach in object detection.

Mono3D[15] is a representative of this type of approach. First, a dense 3D candidate box is generated based on the target prior position (z-coordinate is on the ground) and size. On the KITTI dataset, approximately 40K (vehicles) or 70K (pedestrians and bicycles) candidate boxes are generated per frame. These 3D candidate boxes are projected into image coordinates and scored by features on the 2D image. These features come from semantic segmentation, instance segmentation, context, shape, and positional prior information. All of these features are fused together to score the candidate box, and then a higher score is selected as the final candidate. These candidates are then scored through cnns for the next round to get the final 3D object frame.

Mono3D

M3D-RPN[16] is an Anchor-based approach. The method defines 2D and 3D Anchors, representing 2D and 3D object frames, respectively. 2D Anchors are obtained by dense sampling on the image, while 3D Anchor parameters are determined based on prior knowledge obtained from training set data. Specifically, each 2D Anchor matches the 2D object frame noted in the image according to the IoU, and the mean of the corresponding 3D object box is used to define the parameters of the 3D Anchor. It is worth mentioning that both standard convolution operations (with spatial invariance) and Deepth-Aware convolution are used in M3D-RPN. The latter divides the rows of the image (Y coordinates) into multiple groups, each corresponding to a different scene depth, using different convolutional kernels to process.

Anchor design and Deep-Aware convolution in M3D-RPN

Although some prior knowledge is utilized, Mono3D and M3D-RPN generate object candidates or Anchors based on dense sampling, so the amount of computation required is very large, and the practicality is greatly affected. Some subsequent methods propose to use two-dimensional image detection results to further reduce the search space.

TLNet[17] places Anchor densely in a two-dimensional plane. Anchor intervals are 0.25 meters, facing 0 degrees and 90 degrees, and the size is the average of the targets. The two-dimensional detection results on the image form multiple frustums in three-dimensional space, through which the echo can filter out a large number of anchors on the background, thereby improving the efficiency of the algorithm. The filtered Anchor is projected onto the image, and the features obtained after ROI Pooling are used to further refine the parameters of the 3D object frame.

TLTNet

SS3D[18] uses a more efficient single-stage detection, using a network similar to the CenterNet structure to output a variety of 2D and 3D information directly from the image, such as object categories, 2D object frames, 3D object frames. It should be noted that the 3D object frame here is not a general 9D or 7D representation (this representation is difficult to predict directly from the image), but a 2D representation that is easier to predict from the image and also contains more redundant, including distance (1-d), orientation (2-d, sin and cos), size (3-d), image coordinates of 8 corner points (16-d). Coupled with the 4-d representation of the 2D object frame, a total of 26D features. All of these features are used to predict 3D object frames, which is actually the process of finding a 3D object frame that best matches the 26D features. A special point is that this solution process takes place inside the neural network, so it must be reproducible, which is also a major highlight of the article. Benefiting from the simple construction and implementation, the SS3D can run at speeds of up to 20FPS.

SS3D

FCOS3D [19] is also a single-stage detection method, but is more concise than SS3D. The center of the 3D object frame is projected into a 2D image, resulting in a 2.5D center (X, Y, Depth) as one of the goals of regression. In addition, the regression targets are 3D size and orientation. The orientation here is represented by a combination of angles (0-pi) + heading.

FCOS3D

SMOKE [20] proposes a similar line of thinking, predicting 2D and 3D information directly from images through CenterNet-like structures. 2D information includes the projection position of object keys (center point and corner point) on the image, and 3D information includes center point depth, size, and orientation. By the image position and depth of the center point, the 3D position of the object can be restored. The 3D position of each corner point can then be restored through 3D dimensions and orientation.

The idea of these single-stage networks described above is to return 3D information directly from the image, without complex pre-processing (such as image inversion) and post-processing (such as 3D model matching), nor does it require precise geometric constraints (such as 2D object box can find at least one corner point of the 3D object frame on each side). These methods use only a small amount of prior knowledge, such as the mean of the actual size of various objects, and the resulting correspondence between the size and depth of the 2D object. This prior knowledge defines the initial values of the object's 3D parameters, and the neural network only needs to return to the deviation from the actual value, which greatly reduces the search space and thus the difficulty of network learning.

Depth estimation

The previous section introduced a representative approach to monocular 3D object detection, with ideas ranging from early image transformations, 3D model matching, and 2D/3D geometric constraints, to more recent prediction of 3D information directly from images. Much of this change in thinking stems from advances in depth estimation of convolutional neural networks. Most of the single-stage 3D object detection networks described earlier include branches of depth estimation. The depth estimate here, while only at the sparse target level, rather than at the dense pixel level, is sufficient for object detection.

In addition to object detection, autonomous driving perception has another important task, that is, semantic segmentation. One of the most straightforward ways to scale semantic segmentation from 2D to 3D is to adopt a dense depth map so that the semantic and depth information of each pixel has it.

Combining the above two points, monocular depth estimation plays a very important role in the 3D perception task. By analogy from the introduction of 3D object detection methods in the previous section, fully convolutional neural networks can also be used to perform dense depth estimation. Let's introduce the development status of this direction.

The input to the monocular depth estimate is an image, and the output is also an image (generally the same size as the input), and each pixel value on it corresponds to the scene depth of the input image. This task is somewhat similar to image semantic segmentation, except that semantic segmentation outputs a semantic classification of each pixel. Of course, the input can also be a video sequence, using additional information from camera or object motion to improve the accuracy of depth estimation (corresponding to video semantic segmentation).

As mentioned earlier, predicting 3D information from 2D images is a pathological problem, so traditional methods use clues such as geometric information, motion information, etc., to predict pixel depth through manually designed features. Similar to semantic segmentation, superpixel and conditional random field (CRF) methods are often used to improve the accuracy of estimates. In recent years, deep neural networks have made breakthroughs in a variety of image perception tasks, and depth estimation is certainly no exception. A great deal of work has shown that deep neural networks can learn superior features from training data than manual designs. This section focuses on this supervised learning-based approach. Other unsupervised learning ideas, such as the use of binocular parallax information, monocular dual pixel (Dual Pixel) difference information, video movement information, etc., will be introduced later.

One of the representative works of the early days of this direction was the method based on global and local cue fusion proposed by Eigen et al. [21]. The ambiguity of monocular depth estimation comes primarily from the global scale. For example, the text mentions that a real room and a toy room may differ graphically from a small amount, but the actual depth of field varies greatly. While this is an extreme example, there are still changes in the size of rooms and furniture in real-world datasets. Therefore, this method proposes to multi-layer convolution and downsample the image to obtain the descriptive characteristics of the entire scene, and to predict the global depth. Then, the depth of the local part of the image is predicted by another local branch (relatively high resolution). Here the global depth is used as an input to the local branch to aid in the prediction of the local depth.

Global and local information fusion[21]

The literature [22] further proposes the use of multiscale feature maps output by convolutional neural networks to predict depth maps of different resolutions (there are only two resolutions in [21]). These feature maps of different resolutions are fused by continuous MRF to obtain a depth map corresponding to the input image.

Multi-scale information fusion[22]

The above two articles use convolutional neural networks to return to the depth map, and another idea is to convert the regression problem into a classification problem, that is, to divide the continuous depth values into discrete intervals, each interval as a category. Representative work in this direction is DORN [23]. The neural network in the DORN framework is also a decoded structure, but there are some differences in details, such as the use of full connection layer decoding, expanded convolution for feature extraction, etc.

DORN depth classification

As mentioned earlier, depth estimation has similarities with semantic segmentation tasks, so the size of the receptive field is also very important for depth estimation. In addition to the pyramid knots and expansion convolutions mentioned above, the transformer structure that has recently become very popular has a global sensing field and is therefore also well suited for such tasks. In the literature [24], it is proposed to use Transformer and multi-scale structure to ensure local accuracy and global consistency of predictions at the same time.

Transformer for Dense Prediction

Binocular 3D perception

While prior knowledge and contextual information in images can be utilized, the accuracy of 3D perception based on a single purpose is not entirely satisfactory. Especially when adopting deep learning strategies, the accuracy of the algorithm is very dependent on the size and quality of the data set. For scenarios that have not appeared in the dataset, the algorithm will have a large deviation in depth estimation and object detection.

Binocular vision can solve the ambiguity caused by perspective transformations, so it can theoretically improve the accuracy of 3D perception. However, the binocular system has high requirements in hardware and software. The hardware requires two cameras that are precisely registered, and it is necessary to ensure that the registration is always correct during the operation of the vehicle. Software-wise, the algorithm needs to process data from two cameras at the same time, and the calculation complexity is high, making it more difficult to ensure the real-time nature of the algorithm.

Overall, binocular visual perception has relatively little work compared to monocular visual perception, and a few typical articles will be selected below to introduce. In addition, there are some work based on multi-purpose, but biased towards the level of system applications, such as Tesla's 360° perception system demonstrated on AI Day.

3DOP[25] first generates a depth map from an image from a dual camera, converts the depth map into a point cloud, quantifies it into a mesh data structure, and uses this as input to generate a 3D object candidate. Some intuitive and a priori knowledge is used to generate candidates, such as the density of the point cloud in the candidate box is large enough, the height is consistent with the actual object and the height difference from the point cloud outside the frame is large enough, and the overlap of the candidate box with Free Space is small enough. Through these conditions, approximately 2K 3D object candidates are eventually sampled in 3D space. These candidates are mapped onto 2D images for feature extraction via ROI Pooling, which is used to predict the category of objects and refine the object frame. The image input here can be an RGB image from a camera, or a depth map.

Overall, this is a two-stage detection method. The first stage uses depth information (point cloud) to generate object candidates, and the second stage uses image information (or depth) to refine. Theoretically, the first phase of point cloud generation could also be replaced by LiDAR, and the authors conducted experimental comparisons. The advantage of LiDAR is that the ranging is accurate, so it works better for small objects, partially occluded objects, and distant objects. The advantage of binocular vision is the high density of point clouds, so it works better when there is less occlusion at close range and objects are relatively large. Of course, without considering the cost and computational complexity, the best effect will be obtained by fusing the two.

3DOP

3DOP has a similar idea to the Pseudo-LiDAR[3] introduced in the previous section, in that it converts dense depth maps (from monocular, binocular, or even low-line LiDAR) into point clouds, and then applies algorithms in the field of point cloud object detection.

Estimating the depth map from the image, generating the point cloud from the depth map, and finally applying the point cloud object detection algorithm, the steps of this process are carried out separately, and end-to-end training cannot be carried out. DSGN[26] proposes a single-stage algorithm that generates a 3D representation of the BEV view from an intermediate representation such as Plan-Sweep Volume, starting from the left and right images, and performs both depth estimation and object detection. All steps of this process are instructive, so end-to-end training is possible.

DSGN

Depth map is a dense representation, in fact, for object learning does not need to obtain depth information at all locations of the scene, but only need to estimate the location of the object of interest. A similar idea was mentioned earlier in the introduction of the monocular algorithm. Instead of estimating the depth map in Stereo R-CNN[27], feature maps from two cameras are stacked on top of each other to generate object candidates within the framework of the RPN. The key to correlating the information between the left and right cameras here is the change in the labeling data. As shown in the following figure, in addition to the left and right callout boxes, Union of the left and right callout boxes has also been added. Anchor with an IoU of more than 0.7 with either box left or right as a Positive sample, and an Anchor with a Union box IoU of less than 0.3 as a Neutral sample. Positive's Anchor returns to both the position and size of the left and right callout boxes. In addition to the object frame, this method also uses corner points as a supplement. With all this information you can recover the 3D object frame.

Stereo R-CNN

A dense depth estimate of the entire scene can even have a bad impact on object detection. For example, the depth estimation deviation of the object edge due to the overlap with the background is large, and the large depth range of the entire scene will also affect the speed of the algorithm. Thus, similar to stereo RCNN, it is proposed in the literature [28] to estimate depth only at objects of interest and to generate only point clouds on objects. These object-centric point clouds are finally used to predict the object's 3D information.

Object-Centric Stereo Matching

Depth estimation

Similar to monocular perception algorithms, depth estimation is also a critical step in binocular perception. From the introduction of binocular object detection in the previous section, many algorithms use depth estimation, including scene-level depth estimation and object-level depth estimation. The following is a brief review of the basic principles of binocular depth estimation and a few representative works.

The principle of binocular depth estimation is actually very simple, that is, to estimate the depth of the 3D point based on the distance d between the same 3D point on the left and right images (assuming that the two cameras maintain the same height, so only the horizontal distance is considered), the focal length f of the camera, and the distance between the two cameras B (baseline length).

In binocular systems, f and B are fixed, so only the distance d, or parallax, needs to be estimated. For each pixel, all you need to do is find the matching point in another image. The range of distances from d is limited, so the search range of matches is also limited. For each possible d, the matching error at each pixel can be calculated, so a three-dimensional error data is obtained, called Cost Volume. When calculating the matching error, the local area near the pixel is generally considered, and one of the simplest methods is to sum the difference between all the corresponding pixel values in the local area:

MC-CNN[29] formalizes the matching process as calculating the similarity of two image blocks and learning the characteristics of the image blocks through neural networks. By labeling the data, you can build a training set. At each pixel, a positive and negative sample is generated, each of which is a pair of image blocks. The positive sample is two image blocks from the same 3D point (the same depth), and the negative sample is the image block from different 3D points (different depths). There are many options for negative samples, and in order to maintain a balance between positive and negative samples, only one is randomly sampled. With positive and negative samples, a neural network can be trained to predict similarity. The core idea here is to guide the neural network to learn the image characteristics suitable for the matching task by supervising the signal.

MC-CNN

MC-Net has two main disadvantages: 1) cost Volumn's calculation relies on local image blocks, which brings large errors in some areas with fewer textures or pattern recurrences; and 2) the post-processing steps rely on manual design, which takes a lot of time and is difficult to ensure optimality. GC-Net[30] has made improvements to these two points. First, multi-layer convolution and downsampling operations are performed on the left and right images to better extract semantic features. For each parallax level (in pixels), the feature map for that parallax level is obtained by aligning the left and right feature maps (pixel offsets) and then stitching them together. The feature graphs of all parallax levels are combined to get a 4D Cost Volumn (height, width, parallax, feature). Cost Volumn contains only information from a single image, and there is no interaction between the images. Therefore, the next step is to process Cost Volumn with 3D convolution, which can simultaneously extract relevant information between the left and right images and between different parallax levels. The output of this step is a 3D Cost Volumn (height, width, parallax). Finally, we need to calculate Argmin in the dimension of parallax to get the optimal parallax value, but the standard Argmin cannot be derived. Soft Argmin is used in GC-Net to solve the problem of differentiation, so that the entire network can be trained end-to-end.

GC-Net

PSMNet[31] is very similar in structure to GC-Net, but has been improved in two ways: 1) Using a pyramid structure and cavity convolution to extract multi-resolution information and expand the sensibility field. Thanks to the fusion of global and local features, Cost Volumn's estimates are also more accurate. 2) Employ multiple superimposed Hour-Glass structures to enhance 3D convolution. The use of global information is further enhanced. Overall, PSMNet has improved the use of global information, making parallax estimation more dependent on contextual information at different scales than on local information at the pixel level.

PSMNet

In Cost Volumn, the parallax level is discrete (in pixels), and what the neural network learns is the Cost distribution at these discrete points, and the extremum points of the distribution correspond to the parallax values at the current position. But parallax (depth) values should actually be continuous, and estimating them with discrete points will bring errors. The concept of continuous estimation is proposed in CDN[32], estimating the offset at each point in addition to the distribution of discrete points. Together, the discrete points and offsets constitute a continuous parallax estimation.

CDN

bibliography

[1] Kim et al., Deep Learning based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image, IV 2019.

[2] Roddick et al., Orthographic Feature Transform for Monocular 3D Object Detection, BMVC 2019.

[3] Wang et al., Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving, CVPR 2019.

[4] You et al., Pseudo-LiDAR++: Accurate Depth for 3D Object Detection in Autonomous Driving, ICLR 2020.

[5] Weng and Kitani, Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud, ICCV 2019.

[6] Vianney et al., RefinedMPL: Refined Monocular PseudoLiDAR for 3D Object Detection in Autonomous Driving, 2019.

[7] Chabot et al., Deep MANTA: A Coarse-to-fine Many-Task Network For Joint 2D and 3D Vehicle Analysis From Monocular Image, CVPR 2017.

[8] Kundu et al., 3D-RCNN: Instance-level 3D Object Reconstruction via Render-and-Compare, CVPR 2018.

[9] Qin et al., MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization, AAAI 2019.

[10] Barabanau et al., Monocular 3D Object Detection via Geometric Reasoning on Keypoints, 2019.

[11] Bertoni et al., MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation, ICCV 2019.

[12] Mousavian et al., 3D Bounding Box Estimation Using Deep Learning and Geometry, CVPR 2016.

[13] Naiden et al., Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints, ICIP 2019.

[14] Choi et al., Multi-View Reprojection Architecture for Orientation Estimation, ICCV 2019.

[15] Chen et al., Monocular 3D Object Detection for Autonomous Driving, CVPR 2016.

[16] Brazil and Liu, M3D-RPN: Monocular 3D Region Proposal Network for Object Detection, ICCV 2019.

[17] Qin et al., Triangulation Learning Network: from Monocular to Stereo 3D Object Detection, CVPR 2019.

[18] J rgensen et al., Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss, 2019.

[19] Wang et al., FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection, 2021.

[20] Liu et al., SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation, CVPRW 2020.

[21] Eigen, et al.,Depth Map Prediction from a Single Image using a Multi-Scale Deep Network, NIPS 2014.

[22] Xu et al., Monocular Depth Estimation using Multi-Scale Continuous CRFs as Sequential Deep Networks, TPAMI 2018.

[23] Fu et al., Deep Ordinal Regression Network for Monocular Depth Estimation, CVPR 2018.

[24] Ranftl et al., Vision Transformers for Dense Prediction, ICCV 2021.

[25] Chen et al., 3D Object Proposals using Stereo Imagery for Accurate Object Class Detection, TPAMI 2017.

[26] Chen et al., DSGN: Deep Stereo Geometry Network for 3D Object Detection, CVPR 2020.

[27] Li et al., Stereo R-CNN based 3D Object Detection for Autonomous Driving, CVPR 2019.

[28] Aon et al., Object-Centric Stereo Matching for 3D Object Detection, CVPR 2020.

[29] Zbontar and LeCun. Stereo matching by training a convolutional neural network to compare image patches, JMLR 2016.

[30] Kendall, et al., End-to-end learning of geometry and context for deep stereo regression, ICCV 2017.

[31] Chang and Chen et al., Pyramid Stereo Matching Network, 2018.

[32] Garg et al., Wasserstein Distances for Stereo Disparity Estimation, NeurIPS, 2020.

Reprinted from Zhihu, automotive electronics and software, the views in the text are only for sharing and exchange, do not represent the position of this public account, such as copyright and other issues, please inform, we will deal with it in a timely manner.

-- END --

10,000 words to understand the automatic driving 3D visual perception algorithm

Read on