One article to understand multi-eye visual perception of autonomous driving

From the perspective of output dimension, the perception method based on the vision sensor can be divided into two types: 2D perception and 3D perception. (Related reading: One article to understand 2D perception algorithms)

In terms of the number of sensors, visual perception systems are also divided into monocular systems, binocular systems, and multi-eye systems. 2D perception tasks typically employ monocular systems, which is where computer vision and deep learning are most closely integrated. But what autonomous driving perception ultimately needs is 3D output, so we need to generalize 2D information to 3D.

Until deep learning is successful, it is common practice to infer the depth (distance) of the target based on assumptions such as the a prior size of the target and that the target is on ground plane, or to use motion information for depth estimation (Motion Stereo). With the help of deep learning, it is feasible to learn scene clues from big data and perform monocular depth estimation. But this scenario relies heavily on pattern recognition and makes it difficult to handle corner cases outside of the dataset. For example, special construction vehicles in the construction section, because there are few or no such samples in the database, the visual sensor can not accurately detect the target, so it is impossible to judge its distance. The binocular system naturally obtains parallax and thus estimates the distance of the obstacle. Such a system relies less on pattern recognition, and as long as a stable key point can be obtained on the target, matching can be done, parallax is calculated, and distance estimated. However, the binocular system also has the following disadvantages.

First, if the key points cannot be obtained, such as the large white truck that often causes accidents in automatic driving, if it is across the middle of the road, the vision sensor is difficult to capture the key point in the limited field of view, and the distance calculation will fail.

Secondly, the binocular vision system has very high calibration requirements between cameras, and generally requires a very accurate online calibration function.

Finally, the binocular system has a large amount of computation and requires a chip with higher computing power to support, and generally uses FPGAs.

The cost of binocular systems is between monocular and lidar, and some OEMs are currently beginning to use binocular vision to support different levels of autonomous driving systems, such as Subaru, Mercedes-Benz, BMW, etc. Theoretically, binocular systems can already solve the problem of 3D information acquisition, so why do multi-eye systems are needed?

There are roughly two reasons: one is to improve the adaptability to various environmental conditions by adding different types of sensors, such as infrared cameras; the other is to expand the field of view of the system by increasing cameras with different orientations and different focal lengths. Let's analyze a few typical polyorphthalic systems.

Mobileye's trinocular system

Corresponding to the fixed focus lens, the detection distance and the detection angle are inversely proportional. The wider the viewing angle, the shorter the distance of detection, the lower the accuracy; the narrower the viewing angle, the longer the distance of detection, and the higher the accuracy. It is difficult for on-board cameras to zoom frequently, so in general, the detection distance and field of view are fixed.

Multi-eye system, you can cover different ranges of scenes through cameras with different focal lengths. For example, the trio of Mobileye and ZF jointly launched the trio system, which contains a 150° wide-angle camera, a 52° mid-range camera and a 28° long-range camera. Its farthest detection distance can reach 300 meters, and it can also ensure the detection field and accuracy of medium and short distance, which is used to detect the environment around the vehicle and find the objects that suddenly appear in front of the vehicle in time.

One article to understand multi-eye visual perception of autonomous driving

Three-eyed cameras for Mobileye and ZF

The main difficulty of this tri-eye system is how to deal with inconsistent perception results in overlapping regions. Different cameras give different understandings of the same scene, so the fusion algorithm behind is needed to decide which one to trust. Different cameras have different error ranges, and it is difficult to design a reasonable rule to define decisions in different situations, which brings greater challenges to fusion algorithms. As will be explained later in the article, multi-mesh systems can actually use data layer fusion, using deep learning and large data sets to learn fusion rules. Of course, this is not to say that machine learning is the end of the matter, and the deep neural networks in the black box sometimes give output that is difficult to explain.

Foresight's four-eyed perception system

Another idea for multi-eye systems is to add sensors in different bands, such as infrared cameras (in fact, lidar and millimeter-wave radar are also sensors of different bands). Foresight, a company from Israel, designed and demonstrated a quad-eye perception system. On top of the visible binocular camera, QuadSight added a pair of long-wave infrared (LWIR) cameras, expanding the detection range from the visible light band to the infrared band. The addition of the infrared band, on the one hand, increases the amount of information, on the other hand, it also enhances the adaptability in the night environment and in rain and fog, ensuring the ability of the system to operate all day.

The camera in the QuadSight system has a field of view of 45 degrees, which can detect a distance of up to 150 meters, and can detect objects of 35 * 25 cm size at a distance of 100 meters. In terms of running speed, it can reach 45 frames per second, which is enough to cope with high-speed driving scenes.

Foresight's QuadSight system

The QuardSight system consists of two pairs of binocular systems. As can be seen from the above figure, the infrared binocular camera is mounted on the left and right sides of the windshield, and its baseline length is much larger than that of the general binocular system. Here's a little bit of a bit of a discussion of the baseline length of the binocular system.

Traditional binocular systems generally use a short baseline mode, which means that the distance between the two cameras is relatively short, which limits the maximum distance of detection. When a target is far away, its parallax on the left and right images is already less than one pixel, and its depth cannot be estimated, the so-called baseline constraint. This is the limit of the case, in fact, for long-distance targets, even if the parallax is greater than one pixel, the depth estimation error is also very large. In general, the error of the depth estimate should be proportional to the square of the distance.

In order to improve the effective detection distance of the binocular system, an intuitive solution is to increase the baseline length, which can increase the range of parallax. NODAR's Hammerhead technology can achieve a wide baseline configuration with two cameras at a large distance, with a detection distance of up to 1000 meters, while generating a high-density point cloud. This system can take advantage of the width of the vehicle, such as mounting the camera in the side view mirror, headlights or on both sides of the roof.

Wide baseline configuration in Hammerhead technology

Tesla's panoramic perception system

After analyzing the examples of trinmus and tetramuts, the following is the focus of this article, which is based on multi-purpose panoramic perception systems. The example we're using here is Tesla demonstrating a purely visual FSD (Full Self Driving) system at AI Day 2021. Although it can only be regarded as L2 level (the driver must be ready to take over the vehicle at any time), if you only compare the L2 level of automatic driving system horizontally, FSD performance is still good. In addition, this pure visual solution integrates many successful experiences in the field of deep learning in recent years, and is very characteristic in multi-camera fusion, and I personally feel that it is worth studying at least in terms of technology.

Multi-camera configuration for Tesla FSD systems

Here's a little bit of a digression, talking about Tesla AI and Vision, and the head of the Vision direction, Andrej Karpathy. Born in 1986 and receiving his Ph.D. from Stanford University in 2015, the younger brother studied under Professor Feifei Li, a big cow in the field of computer vision and machine learning, and his research direction is the cross-task of natural language processing and computer vision and the application of deep neural networks in it. Musk recruited the young talent in 2016 and put him in charge of Tesla's AI department, where he was the chief architect of the algorithm for FSD's pure vision system.

Andrej first mentioned in his report on AI Day that five years ago Tesla's vision system first obtained detection results on a single image and then mapped it to Vector Space. This "vector space" is one of the core concepts in the report, and I understand that it is actually the representation space of various objects in the environment in the world coordinate system. For example, for object detection tasks, the position, size, orientation, velocity and other described characteristics of the target in the 3D space constitute a vector, and the space composed of the descriptive vectors of all targets is the vector space. The task of the visual perception system is to transform the information in the image space into information in the vector space. This can be achieved in two ways: one is to first complete all the perceptual tasks in the image space, and then map the results to the vector space, and finally fuse the results of multiple cameras; the other is to first convert the image features into the vector space, then fuse the features from multiple cameras, and finally complete all the perceptual tasks in the vector space.

Andrej gives two examples of why the first method is inappropriate. First, due to perspective projection, perception results that look good in an image are poorly accurate in vector spaces, especially in long-range areas. As shown in the figure below, lane lines (blue) and road edges (red) are very inaccurately positioned after being projected into the vector space, and cannot be used to support autonomous driving applications.

Perceptual results of image space (top) and its projection in vector space (bottom)

Second, in multi-eye systems, a single camera may not be able to see the full target due to the limitation of the field of view. For example, in the example shown in the following figure, a large truck appears in the field of view of some cameras, but many cameras only see part of the target, so it is impossible to make correct detection based on the missing information, so the subsequent fusion effect cannot be guaranteed. This is actually a general problem of multi-sensor decision-making layer fusion.

Limited field of view for a single camera

Based on the above analysis, image spatial perception + decision-making layer fusion is not a good solution. Fusion and perception directly in the vector space can effectively solve the above problems, which is also the core idea of the FSD perception system. In order to realize this idea, two important problems need to be solved: one is how to transform features from image space to feature space, and the other is how to get label data in vector space.

Spatial transformations of features

For the spatial transformation of features, the column has also been introduced in the article of 3D perception before, and the general practice is to use the calibration information of the camera to map the image pixels to the world coordinate system. But this is a pathological problem, which requires certain constraints, and ground plane constraints are usually used in autonomous driving applications, that is, the target is on the ground, and the ground is horizontal. This constraint is too strong to be satisfied in many scenarios.

There are three core points in Tesla's solution.

First of all, the correspondence between image space and vector space is established by Transformer and Self-Attention, where the position encoding of vector space plays a very important role. The specific implementation details will not be expanded here, and there will be time to open a separate article in detail later. To put it simply, the features of each position in the vector space can be seen as a weighted combination of all the positional features of the image, of course, the weight of the corresponding positions must be larger. But this weighted combination process is implemented automatically through Self-Attention and spatial coding, without the need for manual design, and end-to-end learning is performed entirely on the basis of tasks completed.

Second, in a production application, the calibration information of the camera on each vehicle is different, resulting in inconsistencies between the input data and the pre-trained model. Therefore, this calibration information needs to be provided to the neural network as additional inputs. It is simple to stitch together the calibration information of each camera, encode it through MLP, and then input it to the neural network. However, a better approach is to correct the images from different cameras through calibration information so that the corresponding cameras on different vehicles all output a consistent image.

Finally, the video (multi-frame) input is used to extract timing information to increase the stability of the output results, better handle occlusion of the scene, and predict the motion of the target. An additional input to this section is the vehicle's own motion information (available through the IMU) to support the neural network to align feature maps at different points in time. Timing information can be processed using 3D convolution, Transformer, or RNN. FSD's scheme uses RNNs, which in my personal experience is indeed the best balance between accuracy and computational volume at present.

Through the above algorithm improvements, the output quality of FSD in the vector space has been greatly improved. In the comparison figure below, the output from the image spatial perception + decision layer fusion scheme is on the lower left side, while the above feature space transformation + vector spatial perception fusion scheme is on the lower right side.

Image Spatial Perception (bottom left) vs. Vector Spatial Perception (bottom right)

Labels in vector space

Since it is a deep learning algorithm, data and annotation are naturally the key links. Annotations in image space are very intuitive, but what the system ultimately needs is annotations in vector space. Tesla's approach is to reconstruct a 3D scene with images from multiple cameras and annotate it under the 3D scene. The labeler only needs to make a label in the 3D scene, and he can see the mapping of the labeling results in each image in real time, so that he can adjust accordingly.

Labels in 3D space

Manual annotation is only one part of the overall annotation system, and in order to obtain annotations faster and better, it is also necessary to use automatic annotation and simulators. The automated annotation system first generates annotation results based on images from a single camera, and then integrates these results through a variety of spatial and temporal cues. Graphically speaking, the cameras get together to discuss a consistent labeling result. In addition to the coordination of multiple cameras, multiple Tesla vehicles driving on the road can also make fusion improvements to the annotations of the same scene. Of course, GPS and IMU sensors are also needed to obtain the position and attitude of the vehicle, so as to spatially align the output results of different vehicles. Automatic labeling can solve the problem of labeling efficiency, but for some rare scenarios, such as pedestrians running on highways shown in the report, a simulator is also required to generate virtual data. All of these technologies combine to form Tesla's complete data collection and annotation system.

Reprinted from YanZhi intelligent car, Zhihu, the views in the text are only for sharing and exchange, do not represent the position of this public account, such as copyright and other issues, please inform, we will deal with it in a timely manner.

-- END --

One article to understand multi-eye visual perception of autonomous driving

Read on