Tesla FSD vehicle-side perception analysis

Tesla AI Day has been more than 4 months in the past, and many of its avant-garde concepts and super detailed technical solution details have become the topic and direction of global autonomous driving practitioners. Since this time, I have repeatedly watched the video materials of AI Day several times, and also read a lot of Chinese and English analysis and interpretation of the article, I have been hoping to find an opportunity to share my understanding and interpretation of AI Day into articles, but because of procrastination has been delayed again and again, although it has dragged on for so long, but so far the technological innovation shown by Tesla on AI Day is still at the forefront of autonomous driving visual perception technology. Therefore, I hope that through this article, more people can understand the cutting-edge development of today's autonomous vehicle-side perception technology.

Launched in October 2020, the Tesla FSD Beta remains the world's only urban driver assistance system that users can use independently

The evolution of Tesla's perception technology

Through Tesla's infrequent technology sharing, it can be found that Tesla's perception technology will carry out a major innovation of its existing technology stack every once in a while, according to the Tesla perception technology program revealed by the time of writing the article, I divide the process of continuous iterative update of this technology into three important stages: the first innovation represented by HydraNet, the second innovation represented by BEV Layer, And the space-time sequence Texture processing module as a representative of the third innovation, which will be described in detail below.

HydraNet: A multitasking network that shares features

Up to now, in order to facilitate comparison with other research results, and limited by limited resources and time, most of the network models in academic research have basically used a set of public data sets to design a network architecture similar to the following figure for a specific goal (object detection, semantic segmentation, instance segmentation, depth estimation, etc.), with Two Parts of Backbone (trunk) and Neck (neck) for feature extraction. And the Head section gives the output of the task requirements according to the specific type of task.

Network structures with Backbone, Neck, and Head designed for specific inspection tasks

However, in order for the car to understand the traffic environment so that it can autonomously complete autonomous driving under human traffic facilities, AI must simultaneously master a large number of perceptual tasks (such as: vehicle, pedestrian, bicycle detection, lane line segmentation, passable area segmentation, traffic lights, sign detection, visual depth estimation, alien obstacle detection, etc.), using limited vehicle-side computing resources, it is impossible to design a network for each task, and it is found that Backbone, which is located at the lower level of the network, Neck and other mainly extract some visual features with common properties, common in different tasks but help to complete the final perception task, so the autonomous driving industry practitioners generally use the multi-head network shared for feature extraction trunk as the network structure of the automatic driving perception module, so that on the one hand, the shared trunk can be used to save a lot of repetitive approximation of the feature extraction calculation. At the same time, the performance of each Head can be optimized relatively independently to meet the needs of autonomous driving while the backbone parameters remain unchanged or there is less participation in the update.

Tesla used the same technology before it completely refactored the code in order to develop the FSD, which is the level of technology that most front-line autonomous driving companies have reached. According to andrej Karparthy's technology sharing in the first half of 2020, Tesla had about 1,000 Heads to solve a variety of autonomous driving perception task categories, and these more than 1,000 Heads can be classified in more than 50 large perception categories, including but not limited to, the following classification of more than 50 large perception categories Task:

Moving Objects: Pedestrian, Cars, Bicycles, Animals, etc.

Static Objects: Road signs, Lane lines, Road Markings, Traffic Lights, Overhead Signs, Cross-Walks, Curbs, etc.

Environment Tags: School Zone, Residential Area, Tunnel, Toll booth, etc

Because there are so many Heads, Tesla named its neural network model HydroNet, which means one trunk and multiple heads. However, in a recent interview with Lex, Elon Musk mentioned HydraNet and showed that he did not like the name very much, saying that one day he would give the FSD neural network a new name, and we will wait and see. I have described the characteristics of HydraNet in great detail in a previous article, as well as some of the best practices and challenges of Tesla training HydraNet, which can be interpreted in a step-by-step manner: Tesla Autopilot technology architecture is further studied and will not be repeated here.

Hydra, or Hydra, is a legendary creature that exists in many myths

BEV Layer: Spatial understanding capabilities perceived by FSDs

The Autopilot and current mainstream autonomous driving perception neural network structures before the FSD release have adopted a shared backbone multi-head structure similar to HydraNet, but their problem is that the entire perception is carried out under the Image Space (that is, the image space) where the image is located. Image Space, where the camera image is located, is a 2D pixel world, but all autonomous driving decisions and path planning are carried out in the 3D world where the vehicle is located, and such a dimensional mismatch makes it extremely difficult to directly carry out autonomous driving based on perceptual results.

For example, most novices who drive have experienced that when they first learn to reverse, it is difficult to intuitively construct the spatial connection between the car and the surrounding environment when they first learn to reverse, so it is easy for novices to rely on the reversing mirror to misoperate and cause scratching, which is actually caused by the lack of spatial understanding from the image plane of the inverting mirror to the spatial transformation of the automotive coordinate system.

Therefore, a self-driving AI that can only be perceived in the camera image plane is like a novice driver who lacks spatial understanding, it is difficult to complete the problem of driving the car well, at this time many companies choose to use millimeter wave radar, lidar and other sensors with depth measurement capabilities, through their combination with the camera to help the camera to convert the image plane perception results to the 3D world where the car is located, this 3D world is called the self-driving coordinate system in professional terms, if you ignore the elevation information, Then many people also call the self-driving coordinate system after the flattening as the BEV coordinate system (that is, the bird's-eye view coordinate system).

There is no need to be Superman or Batman, ordinary humans can drive cars with their eyes

However, as Elon Musk said, humans are not Superman, nor Batman, can not see lasers, nor equipped with radar, but through the images captured by the eyes, humans can still build a 3D space understanding of the surrounding world to master the ability to drive well, then to use the eyes (camera) to drive autonomously like a human, you must have the ability to convert from the 2D image plane to the 3D self-driving space.

Tesla's original scheme also did the same by using camera external parameters and ground plane assumptions to use a method called IPM (Inverse Perspective Mapping) to reproject the perceived results of the image plane back to the BEV coordinate system of the car, but when the local surface does not meet the plane hypothesis and the multi-camera field of view association is subject to various complex environmental influences, this method becomes increasingly difficult to maintain.

Tesla was the first to use IPM transformations and frame-to-frame matching for multi-camera stitching and BEV conversion

Due to the difficulties encountered by the IPM method, Andrej Karparthy and his team began experimenting with the spatial transformation of the image plane to BEV directly in the neural network, which became the most significant difference between the FSD Beta released in October 2020 and the previous Autopilot product.

Because the picture of the visual perception input is in the 2D plane of the camera, and the 2D image plane and the self-driving BEV are not unified in size and dimension, the most important thing to carry out the spatial transformation from the image plane to the BEV inside the neural network model is to find a way to transform the spatial scale of the Future Map inside the neural network.

The mainstream neural network operations that can transform on the spatial scale mainly include two methods in MLP, Fully Connected Layer and Cross Attention in Transformer, as shown in the following figure (the picture is quoted from towardsdatascience.com/), both of which can change the spatial scale of the input amount, so as to achieve the purpose of scale transformation.

In this diagram, both MLP and Transformer can change the input of 3D to an output of 2D

Originally, the goal was to explain the truth as little as possible, but the BEV transformation itself is a mathematical transformation at the scale level, and it is difficult to explain clearly without swinging the formula, here the formula is used to explain:

Fully Connected Layer :

After ignoring nonlinear activation functions and biases

Cross Attention:

Here K, V is obtained by linear transformation of the input X, Q is obtained by linear transformation of the index quantity on the output space, and a slight modification of the formula can be obtained:

The process of performing a BEV transformation is actually the process of converting the feature layer of the input 2D image space to the feature layer under the BEV self-driving coordinates using either of the two operations. Therefore, it is assumed that through multi-layer CNN feature extraction, the image space feature layer scale, that is, the scale of X here, is the height multiplier width of the image, and the BEV space scale where the output O is located is the raster space established within a few meters of the front and back left and right of the self-driving position as the origin.

The comparison shows that the biggest difference between the Fully Connect Layer and Cross Attention is actually the coefficient W acting on the input amount X, the W of the fully joined layer is fixed in the Inference stage once the training is completed, and the coefficient W of the X of the Transformer using Cross Attention is a function of the input amount X and the index amount. The Use of Cross Attention for spatial scale transformations may give the model a stronger expression capability during the Inference phase depending on the output X and index. For these reasons, Tesla actually used cross attention in their solution to implement the BEV layer.

AI Day Karparthy's Tesla-based BEV network structure based on Transformer

Query comes from rasterizing the BEV space as shown above, and then differentiatingly encoding each raster with Positional Encoding

The above two figures show the implementation scheme of Tesla's Transformer-based BEV Layer: first, the multi-scale feature layer is extracted through the CNN backbone network and BiFPN in each camera, and the multi-scale feature layer generates the Key and Value required in the Transformer method through the MLP layer on the one hand, and globalizes the multi-scale Property Map on the other hand The Pooling operation obtains a global description vector (that is, the Context Summary in the figure), and at the same time, by rasterizing the target output BEV space, and then encodes the location of each BEV raster, concatenates these positions with the global description vector and then obtains the query required by the Transformer through a layer of MLP layer. In the Cross Attention operation, the scale of query determines the output scale after the final BEV layer (that is, the scale of the BEV raster), while the Key and Value are respectively in the 2D image coordinate space, according to the principle of Transformer, through the Query and Key to establish the influence weight of each BEV raster received by the 2D image plane pixels, thus establishing the association from the BEV to the input image. Then use these weights to weight the Value obtained by the features under the image plane, and finally obtain the Future Map under the BEV coordinate system, complete the mission of the BEV coordinate conversion layer, and then you can use the mature perception function heads under the BEV to directly perceive in the BEV space based on the Future Map under the BEV. The perception results in the BEV space are unified with the coordinate system in which the decision planning is located, so the perception and subsequent modules are closely linked through the BEV transformation.

With the data-driven BEV transform, the Tesla Sense outputs directly under the BEV Vector Space, as shown in the lower right corner of the route

In this way, the camera parameters and changes in the geometry of the ground are actually internalized in the parameters by the neural network model during training. Karparthy added a Tesla method to deal with the difference in external references on AI Day: they used the calibrated external references to uniformly convert the images collected by each car to the same set of virtual standard camera placements through deschronization, rotation, and recovery of distortion, thus eliminating the tiny differences in the external parameters of different cars.

Tesla added Rectify's operation before the original image entered the neural network model, using camera calibration parameters to remove the impact of camera installation errors

The advantage of BEV network is not only that it can use the perception output to directly make decision planning, BEV's method is also a very effective multi-camera fusion framework, through the BEV scheme, it is difficult to correctly correlate the size estimation and tracking of large targets across the proximity of multiple cameras have become more accurate and stable, and this scheme also makes the algorithm more robust for a short period of occlusion of one or several cameras.

BEV incorporates multi-camera features to significantly improve the accuracy of near-up object detection

Space-time sequence Characteristic: Adds short-term memory to autonomous driving intelligence

The use of BEV established Tesla FSD as a leader in visual perception, but relying solely on HydroNet and BEV would still have the problem of continuous information loss due to the use of multiple images at a single moment as perceptual input. People's perception of speed and space comes from the time dimension, just like an old driver can complete the driving task well in a familiar scene, but if the video recorded in this scene is broken into a single frame picture in a disorderly order, and then ask the old driver how to play the steering wheel and how to step on the accelerator brake in one of the pictures, I am afraid that the old driver will also be embarrassed. The reason why the old driver cannot give driving operations to the single frame image is that the human brain familiar with continuous data processing cannot form a good spatial concept in the single frame image and cannot process the speed, trajectory and other time-related information of the surrounding objects, so our driving skills cannot work properly.

In the same way, FSD also needs to have the ability to process continuous spatio-temporal sequence data in order to correctly handle flashing traffic lights such as those common in urban environments, distinguish between temporary parking involved in traffic and stationary vehicles on the side of the road, predict the relative speed of surrounding objects and self-driving vehicles, predict the possible trajectory of objects involved in traffic according to historical information, solve the problem of time occlusion, remember the speed mark that has just been driven, the direction of driving in the lane, and so on. In other words, FSD needs to be given short-term memory.

A problem that autonomous driving AI that lacks short-term memory capabilities cannot solve

Exactly how human memory works is still one of the unsolved mysteries in brain science, so AI algorithms have not been able to reconstruct human long-term memory capabilities. However, Tesla hopes to train the neural network by replacing images with video fragments with time series information, so that the perceptual model has the ability to memorize for a short time, and this function is achieved by introducing a queue of features in the temporal and spatial dimensions into the neural network model, respectively. Karparthy gave an example of why the feature queue of space dimensions is needed, if the car is driving to the traffic light intersection to stop, the car observes the arrows of the allowable driving direction of each lane before reaching the intersection, but if you rely solely on the time queue, then when the red light is very long, the direction of the lane observed at the front moment will eventually be forgotten, but if the space queue is introduced, then because the car does not move under the red light, no matter how long it stops at the traffic light intersection, The spatial queue still retains the memory of the direction of the lane observed in front.

The timing queue is not competent in the case of intersection parking and requires the introduction of a spatial queue

For how to fuse timing information, Tesla tried three mainstream scenarios: 3D convolution, Transformer, and RNN. Karparthy said that the self-driving motion information only uses four-dimensional information including speed and acceleration, which can be obtained from the IMU, and then combined with the Characteristic Map (20x80x256) and The Positional Encoding (Concatenate) in the BEV space to form a 20x80x300x12-dimensional eigenvector queue Here the third dimension consists of 256-dimensional visual features + 4-dimensional kinematic features (vx, vy, ax, ay) and 40-dimensional positional encoding, so 300 = 256 + 4 + 40, and the last dimension is the 12-frame temporal/spatial dimension after downsampling.

Tesla tried three timing fusion schemes simultaneously: 3D CNN, Transformer, and RNN

The feature queue superimposes self-driving kinematics information and position-coding information on the basis of the BEV visual features

3D Conv, Transformer, RNN can process sequence information, the three have their own strengths and weaknesses in different tasks, but most of the time which scheme is used is actually not much difference, but ON AI Day Karparthy also shared a simple and effective, and the effect is very interesting and explainable scheme called Spatial RNN. Different from the above three methods, Spatial RNN because the RNN is originally a serial processing sequence information, the order between frames can be retained, so there is no need to encode the BEV visual feature position can be directly fed into the RNN network, so it can be seen that the input information here only includes the 20x80x256 BEV visual Characteristic Map and 1x1x4 self-driving motion information.

Take LSTM as an example to implement the Spatial RNN module

Spatial features in CNN often refer to features in the width and height dimensions of the image plane, and spatial in Spatial RNN refers to two dimensions in a local coordinate system similar to the BEV coordinates at a certain moment. The RNN layer of LSTM is used here for illustration, and the advantage of LSTM is that it is highly interpretable, and it is most appropriate to understand it here as an example. LSTM features that the Hidden State can retain the encoding of the state of the previous variable length of N moments (that is, short-term memory), and then the current moment can be used to determine which part of the memory state needs to be used, which part needs to be forgotten, and so on through the input and The Hidden State. In the Spatial RNN, hidden State is a rectangular raster area that is larger than the BEV grid space, the size (WxHxC) (see figure above, WxH is greater than the BEV size of 20x80), and the self-driving kinematics information determines which part of the Grid Hidden State affects before and after the BEV feature, so that the continuous BEV data will continuously update the large rectangular area of the Hidden State. And the position of each update is in line with the self-driving movement. After successive updates, a Hidden State Feature Map similar to a local map is formed as shown in the following figure.

Dynamic update process for partial feature layers extracted from spatial RNNs of the same Hidden State

Processing using BEV awareness and timing queues allows end to end to generate local maps

The use of timing queues gives the neural network the ability to obtain continuous perception results between frames, and when combined with BEV, FSD obtains the ability to selectively read and write to local maps in response to blind spots and occlusions of field of view, and it is precisely because of such real-time local map construction that FSD can not rely on high-precision maps for autonomous driving in cities.

Ukrainian hackers turned on the FSD function in the center of Kiev by cracking the FSD function, proving that the FSD can drive when it first arrives in a new city

One More Thing: Photon to Control, Tesla's secret weapon to replace ISPs

AI Day describes Tesla FSD's perception technology stack in great detail, but AI Day's time is limited and it is certainly impossible to cover all the perception technologies used by FSD. If you are concerned about the FSD beta test, you will find that each version in the FSD Release Note has such a description of features that are not well understood, as shown in the following figure:

A photon-to-control item appears in the FSD Release Note

Here photon-to-control literally means from photons to vehicle control, but its specific meaning is that even practitioners in the autonomous driving industry like me are not at a loss, and it seems that no autonomous driving company has used similar technology, and even the specific meaning of this technology is unknown.

Fortunately, some time ago, Lex Fridman, a Podcast anchor I liked very much, had a two-and-a-half-hour interview with Elon Musk on his show, in which Elon talked about photon-to-control that made this Tesla-unique visual processing technology come to the surface.

It turns out that most autonomous driving companies use the ISP (Image Signal Processing) module to perform operations such as white balance, dynamic range adjustment, filtering, etc. on the image after the sensor acquires the original image data to obtain the best quality image, and then send the adjusted image to the perception module for processing. This step is very important for the automatic driving perception module, because our human eye is a complex vision sensor, which can dynamically adjust the received image according to the actual situation of the surrounding environment, such as adjusting the exposure to adapt to changes in light and shade, adjusting the focus to focus on objects at different distances, etc., and the automatic driving system uses the ISP module to make the camera performance as close as possible to the human eye. However, the original purpose of is to obtain a good picture in the changing external environment, but what is the most needed picture form for automatic driving, there is still no conclusion, which makes the ISP sometimes unable to start when processing image data, and more often can only adjust the processing effect of the ISP according to human judgment.

However, Musk believes that the most primitive form of data collected by the camera is the CMOS sensor to count the visible photons (photon) that arrive at the sensor through the lens, and no matter what processing method the ISP uses, it will essentially lose part of the original image information. So I don't know when tesla began to try to use the image obtained from the original photon count as a neural network input, which not only saves the computing power required by the ISP, reduces the delay, but also preserves the information obtained by the sensor to the greatest extent. According to Musk, on the original photon count image, even in the dark night when the light is very weak, pedestrian reflections will cause small photon changes, and the final effect is that the perception system that uses photon as input not only responds faster and has lower latency, but also obtains a very long nighttime visual distance that is far beyond the human eye, as powerful as science fiction.

It all sounds sci-fi at first, but finding a photon-to-control update on Tesla FSD's official Release Note proves no doubt that this is the technology Tesla has applied to its product features. Of course, it is conceivable that such photon raw data may be difficult to label with traditional methods, so it may not be able to completely replace ISPs in a short period of time, but Tesla's attempts in these cutting-edge directions can also be a huge inspiration for other practitioners in the industry.

Questionnaire

Write at the end

About submissions

If you are interested in contributing to the "Nine Chapters of Intelligent Driving" ("Knowledge Accumulation and Collation" type article), please scan the QR code on the right and add the staff WeChat.

Note: When adding WeChat, be sure to note your real name, company, and current position

As well as information such as submission intentions, thank you!

"Knowledge Accumulation" manuscript quality requirements:

A: The information density is higher than the vast majority of reports of the vast majority of securities companies, and it is not lower than the average level of "Nine Chapters of Intelligent Driving";

B: Information is highly scarce, more than 80% of the information needs to be invisible in other media, and if it is based on public information, it needs to have a particularly strong and exclusive point of view. Thank you for your understanding and support.

Tesla FSD vehicle-side perception analysis

Read on