Horizon Liu Jingchu: God's Perspective and Imagination – A New Paradigm of Autonomous Driving Perception

On March 28th, Horizon's "Horizon Autonomous Driving Technology Session" at the Zhidong Open Class was successfully completed, and Dr. Liu Jingchu, Architect of Horizon Autonomous Driving System, gave a live lecture on the theme of "God's Perspective and Imagination - A New Paradigm of Autonomous Driving Perception".

Starting from the new requirements of algorithms proposed under the evolution of automatic driving architecture, Dr. Liu Jingchu gave an in-depth explanation on the new perception paradigm under software 2.0, the eighteen martial arts of BEV perception, and the development of BEV perception with end-cloud integration.

This special session is divided into two parts: the main lecture and the Q&A, and the following is the review of the main speaker:

Hello friends, I am Liu Jingchu, an architecture engineer of Horizon's automatic driving system, and I am very happy to share with you some of Horizon's thoughts and practices in the field of perception. This course is mainly divided into the following 4 parts:

1. The evolution of automatic driving structure proposes new requirements for algorithms

2. New perceptual paradigm under software 2.0

3. The Eighteen Martial Arts perceived by BEV

4. BEV perception development of end-cloud integration

Figure 1

Let's start with a simple diagram to share horizon's understanding of autonomous driving. The goal of autonomous driving is relatively simple, usually there are three: safety, comfort, and efficiency. General safety comes first, because no one wants an accident on autopilot, it's a system designed to avoid accidents. In addition to safety, it is comfort and efficiency, and for different models, the two goals will have different compromises. If it is a passenger car in daily life, comfort should be more important, because many people do not want to experience the feeling of pushing back frequently during daily traffic, so comfort is more important for passenger cars. Finally, efficiency, no one wants to be comfortable, the vehicle is only driving on the street at a speed of 5 miles. More is to meet the premise of safety and comfort, as soon as possible to send us to the destination we want to go.

In some other scenarios, comfort and efficiency may be exchanged sequentially. For example, some trucks that transport goods, there are no fragile items on the car, the comfort can be slightly reduced, under the premise of not damaging the goods, sharp braking or acceleration is allowed, then the priority of efficiency can be increased.

From my personal understanding, the ultimate system-level goal of autonomous driving is safety, comfort, and efficiency. To achieve this goal, autonomous driving has been in development for decades, and the typical system pipeline has not changed much. In a typical pipeline, the upstream is the sensor, analogous to the human senses, such as the eyes, ears, touch and other sensors; and then downstream is the brain's processing of the original sensor information, and extract some meaningful, abstract, and streamlined information from it, which is generally called environmental perception.

Further down is the location map. If it is not familiar with the road, the owner is very easy to get lost in the process of driving, in many mainstream automatic driving schemes, the map or high-precision map is a more important part, its basic role is as an offline sensor, the human eye can not see, the ear can not hear the over-the-horizon information, in advance through some pre-processing methods to form a data structure. When people drive online, they recall the information through positioning and put it under the current local coordinate system, so that they can see farther and more completely, and get better perception effects. Both environment awareness and location maps are designed to complete the perception, but one is online and one is offline, and the boundaries of these two modules will be divided differently in different companies.

Further downstream is decision planning. The general division of decision planning can be divided into decision modules and planning modules, and if it is more subdivided, it can be divided into four modules: decision-making, forecasting, planning and control. The core problem is how to deal with the dynamic game with other traffic participants in the scene. The reason why the icon of "playing chess" is placed on the planning decision module in Figure 1 is because the internal processes of decision-making and planning, the two most important technical modules, have many similarities with chess, and many methods used to solve chess problems can also be applied to the decision-making planning module to some extent.

Finally, control execution. This part is how to transmit the calculated vehicle throttle, brake, steering and other information to the body, and then let the car perform it.

The middle three links are more algorithmical, so they are expressed in the form of darker colors; while the hardware peripherals at both ends are determined by the hardware architecture of the car, so they are expressed in the form of lighter colors. Although pipeline has gradually taken shape in recent decades, there has been a great evolutionary trend in recent years in terms of how software and hardware are divided to implement these modules.

Looking at the bottom box in Figure 1 below, the above arrow expresses the mainstream division method in the past, and the vertical bar corresponds to the pipeline above, such as the sensor, in a general sense, the sensor is a pure hardware peripheral, and even a part of the perception work or some very front-end signal processing, are hardware peripherals to complete. Hardware peripherals are characterized by fast computation speeds and excellent power consumption ratios, but they are relatively less flexible.

The evolution of autonomous driving structure proposes new requirements for algorithms

When we first started doing environmental perception, we talked about AI, and most of them talked about AI applications in environmental perception. In order to make the AI algorithm run faster and better, it will generally not run on the hardware peripherals, nor will it choose to run on the CPU, the disadvantages of the two of them are either not programmable enough, or the energy consumption ratio is not enough, so a dedicated hardware accelerator will be designed, which is also the main point of force of the horizon.

A large range after that is the software on the CPU, which includes some software and middleware at the bottom, and a large number of logic algorithms at the upper level, as well as some traditional mathematical optimization methods. Because of its high flexibility, so the ai accelerator support will be relatively low, the traditional method generally runs these software on the CPU, some practices will also put it on the GPU, because the GPU for a lot of mathematical operations, basically a general computing device.

Finally, control execution is generally in hardware peripherals, which is currently the mainstream approach. As you can see, although there is a lot of correlation between autonomous driving and AI, in practical applications, the mainstream method is only in a very narrow range of environmental perception, which is where AI algorithms really play a role.

There has been a big trend change in recent years, as shown in the lowermost horizontal arrow in Figure 1, showing a more complex, but more interesting approach, where hardware peripherals are still hardware peripherals at both ends, but on the sensor side, the boundaries of hardware peripherals are backed towards the front end. In the past, many signals that needed to be computed using ISP devices can now directly send the information of the original sensor directly to the AI algorithm, skipping some dedicated hardware for calculation. At the same time, downstream of perception, a large number of rules for assisting perception can also be removed, that is, perceptual post-processing. So in environment awareness, AI algorithms are basically able to cover environment perception and the next step of environment perception.

Locating maps is another relatively different part. There is a lot of optimized content in the location map, so many core problems can be solved by software and CPU, and the map has relatively mature automation tools in the cloud, sometimes not necessarily relying on AI algorithms, and can also get good results. Therefore, most of the positioning maps that can be reflected on the car side are still the general software computing power of the CPU, but there are also some internal practices that find that in terms of positioning and map and perception fusion, the use of AI algorithms will have good results.

Then to the decision planning module, in the past, a lot of decision planning content was written rules, if-else or a huge decision tree, and then make decisions in it. Planning can be a matter of optimization, and many times the final behavior is determined by setting some boundary conditions and cost functions by setting rules. Recently, there is also a trend to dissolve the upper part of the decision-making planning, and even the boundary between decision-making and planning, and then the decision-making and planning of the core initial solution selection problem, through the AI algorithm to achieve, including imitation learning, reinforcement learning and other content, and then give the remaining part to the software to do, through this way to achieve better collaboration.

Finally, there are not many changes in control execution, mainly hardware peripherals. A big trend that can be seen from the above is that AI algorithms are getting bigger and bigger in the whole pipeline, and the best expressed here is Tesla's Andrej Karpathy, which Karpathy calls software 2.0. Software 2.0 is gradually replacing Software 1.0 in the system, so what is Software 2.0? I think the most accurate definition should be to use neural networks to achieve some of the functions originally implemented with rules and logic, and there are some broad definitions, which will also count the Machine Learning model and statistical learning model as software 2.0, but its core feature is the construction of the algorithm itself, on the one hand, it depends on people to build some architecture and skeleton, and more importantly, it trains its performance through data. Simply put, the evolution trend of soft-hard collaboration is to better utilize software 2.0 to serve the entire autonomous driving system.

Figure 2

The whole autonomous driving system is talked about above, and today's topic is perception. Perception is the upstream of the entire automatic driving system, if you review the perception system of the more mainstream processing method, you can analogize as the leftmost part of Figure 2, the automatic driving system in order to ensure a high enough safety performance, will put a lot of sensors. Each sensor is typically sent to a neural network after simple signal processing. The modality of each sensor will obtain some neural network output, like the bottom half of the leftmost part of Figure 2 is the scene seen by the 6-way camera, and the corresponding neural network will process these video inputs separately to obtain the output of each channel.

The middle part of Figure 2 is represented by the output of semantic segmentation, but in fact, the neural network of each road is generally a multitasking model, in addition to semantic segmentation, there are also tasks such as object detection and depth estimation. Their main feature is that the output of each neural network is in the original sensor space, such as the image input, and the output is also in the image space. The information they give, such as semantic segmentation and 2D object detection, although it is a good job, can only be an intermediate result, because an autonomous driving system is ultimately carried out in 3D space when planning, so it is first necessary to transfer the information in the middle of the original sensor to the 3D space through the rules, and at the same time, it is also necessary to make a certain combination and patchwork of many different intermediate results to form a form that can be used downstream.

Because there are a variety of different sensor inputs, but the downstream needs a full 360-degree output when using, some fusion will be done to combine the outputs of each sensor. The fusion process contains a lot of rules, like filtering, choosing who does the fusion and how to do the fusion in the end, how much weight to choose to fuse the information of different paths, etc. The biggest problem in the mass production practice of the above reasons is the process of conversion, patchwork, and integration based on rules, which is not a process that can be learned independently, but a process based on rules. Rules mean that mistakes need to be changed, and rules don't change on their own.

When an autonomous driving system is applied to a physical world, it encounters a variety of problems, that is, the long tail problem. The long tail problem means that when you think you've solved 90 percent of the problems, there may be 90 percent of the problems waiting for you, and it takes a lot of engineers to write the rules. For example, a decision tree, if 1 person writes a decision tree, after 1 month will forget the branch written before, if 100 people write 1 decision tree at the same time, the software stack will eventually become very complicated.

Let's take a very simple example: parking. When driving a school car, you will find that parking is a particularly complex process, because it relies on multi-channel sensor observation, such as the need to look at the left rearview mirror, the right rearview mirror, the rearview mirror in the middle, and the reversing image, and some cars do not even have a reversing image. Although there are sensor observations, it does not directly represent the position of the car in the 3D space, which is why many students do not like to drive, that is, it is not intuitive. In the driving school tutorial, you need to pay attention to a variety of rules, such as the position of the left line in the rearview mirror, the position of the right line in the rearview mirror, so as to implicitly map out what position the car is parked in the 3D space.

Figure 3

Many cars now have a surround view function, in which the final surround view image will give 360 degrees of scene information around the vehicle. For example, parking is in place, with a God's perspective to directly see the surrounding space, then parking becomes very simple, why? Because the human brain does not need to do observation and splicing, it directly gets the perspective of God and makes decision-making and planning. Returning to the automatic driving system, the God perspective simplifies and helps the system a lot, which is also the first requirement that the evolution of the perception architecture of automatic driving puts forward to us, that is, to directly output the information needed downstream from the God perspective. This is abbreviated as BEV, which is a bird's-eye view, similar to the god perspective.

Figure 4

BeV compared to the traditional perception algorithm, the input is the same, is a multiple sensor information, and then each sensor information will be processed by a separate neural network. The signal mode of the original sensor basically corresponds to some of the characteristic maps in the middle of Figure 4. The difference is that a neural network will be added in the middle, learning how to automatically combine and fuse the output of a single neural network, and finally directly output the output pattern needed for downstream regulation in 3D space.

For example, figure 4 shows a T-shaped intersection, where the lane line and zebra crossing are, and this information can be directly used for downstream regulation after a simple structure. Its biggest feature is that it comes from a perception system, it is a large neural network, which also means that as long as there is data, truth value, you can train it, and automatically learn how to get the final desired output from the original sensor input, without the need for a special number of rules, which also fully reflects the software 2.0 implementation in the field of perception.

Although many companies in the industry are talking about BEV perception, especially vision-based BEV perception. Many people think that the difficulty here is neural network architecture design, but Horizon believes that if the difficulty is sorted, the neural network architecture design may be relatively less difficult. There are also two difficulties, the first is what to output when there is such an architecture. Because I want to replace some of the output of the previous perceptual post-processing with the output of the original perceptual neural network. What exactly is the output content that can be used better downstream, which needs to explore what is the complete set of perceptual algorithms, which is an architectural design problem of the system.

The second most difficult point is how to get the truth value in the data. When the data is available, under the training method of supervised learning, there needs to be a truth value, and the neural network training knows where to go. The generation of truth value is not like doing some thesis research in schools, based on some relatively clean data sets, in the world of mass production, you need to have the ability to generate truth value, and the speed of truth value production should be fast enough, and the accuracy should be good enough.

The above challenges need to be broken one by one to achieve the BEV perception paradigm under the guidance of the general direction of software 2.0.

New perceptual paradigm under Software 2.0

The following sections discuss Horizon's thinking and practice in this regard.

Figure 5

The first is the architecture problem, how to do architecture design? Many students who do algorithms online know that BEV has many methods to derive, but today's discussion is how to do BEV perception with a relatively abstract structure. On the far left of Figure 5 is the input of the original sensor, mainly the camera, but also other different types of input. The first step in doing BEV perception is to have a single input Frontier network, which processes the input of one sensor, does not involve interaction coupling with other sensors, and is a separate neural network.

In many sensors, such as cameras, Frontend networks can be reused, with only some preliminary information extraction. The more important architectural design comes from how to integrate multiple information. Here is mainly divided into three steps, first of all, the same mode, such as the input of the same visual signal, it is calibrated; the second step also needs to be calibrated across modes, because the original signal form of different types of sensor output is not the same, so it is necessary to connect the signal under BEV to fuse; the third step is to design a neural network, in the spatial and temporal dimensions of these aligned information to fuse, after the fusion is completed, you can send the characteristic map to the downstream perception task, Implement a wide variety of task types.

So what kind of preliminary attempts does Horizon have at the moment?

Figure 6

The simplest is a fixed converged network, which is also the earliest BEV scheme. The advantage is that it's all seen, there's no particularly new architecture in it, but it's simple, robust, and easy to use, so it serves as a starting point for several different levels of content mentioned above. On the left are the above-mentioned align, temporal fusion, spatial fusion, and final Task heads, and Figure 6 takes the camera as an example, lidar may have some different variants, but the basic meaning is the same.

On the alignment of each camera, the simplest can project the input of the image under the BEV through the perspective transform. This projection has many assumptions, such as the assumption of ground straightness, the car can not have too much jitter, can only express the ground, can not express the content above the ground, and so on. But the better aspect is the content on the ground, which can be mapped to a relatively reasonable spatial position, so it can be used to do spatial identity.

After spatial information is aligned to BEV space via aligned, with 3D coordinates, spatial fusion can be performed with a neural network, the easiest way is through a convolutional network. The next step is to carry out the fusion of time, the most important thing about time fusion is how to make the choice of information on time, because the frame rate of the camera is very fast, obviously can not process all the information, so it is necessary to make certain selections and sampling. The easiest way to do this is to make a time-based queue in the original input, make a slider of equal intervals of sampling, and then put the most recent information inside. The neural network of time fusion can directly process the sliding window to obtain the fusion effect on the space. Finally, after fusion, the feature can be passed on to the final perceptual task, which is a fixed architecture.

Figure 7

The horizon quickly moves to the next step, first of all, when doing cross-camera argument, the assumptions mentioned above are easily interfered with, so how to improve it? It is hoped that some adaptive components will be added to the spatial identity, for example, when mapping the space, if the external parameter changes of the vehicle can be estimated, a very shaky IPM projection can be made relatively stable. At the same time, the same is true in time, if time synchronization cannot be strictly required, the exposure time of different cameras can also be synchronized in time.

After this step, it is a well-aligned spatial profile, and then it is passed through a fused neural network. There is a way of time inspired by the Tesla scheme, the previous queue was a queue of equal intervals, but in many real-world scenarios, it is found that important events do not occur at equal intervals, it is sparse, and may appear suddenly, or not appear for a long time. In order to deal with this situation, in addition to the time-based queue, there is also a disistion-based queue, according to the vehicle's odometer information, the information is queued at a fixed spatial interval, through this method to achieve two different scales of information queue effect.

Eventually, when doing time fusion, in addition to using the simplest convolutional neural network, recurrent neural network RNNs can also be used. Tesla's proposed scheme is called Spatial RNN, which is more complex than the traditional RNN, because the traditional RNN only has a Current memory, and each time must be continuously updated, but the Spatial RNN memory corresponds to the spatial position, and it can only be cycled when the information aligned in the upstream space can correspond to time.

Figure 8

After quickly practicing this scheme, Horizon did not stop and was still trying a more novel and larger network cap. It turned out that it had always been done first to align, then to do fusion or to do queues first, and then to do fusion. Can the two steps of aligning and fusion be further integrated? Because these two steps are two architectural steps that people separate through rules, if they can do both steps together through a neural network, perhaps the upper limit of the network will be higher.

Correspondingly, in the spatial fusion section, inspired by Tesla and some other work, you can see the Transformer way under Figure 8, which calculates the cross-correlation of the original image pixel and the final BEV space. The learning of cross-associations is itself a kind of identity, because it learns the association from the original pixel to the BEV pixel, and also learns the strength of the association, and the strength itself indicates a weight tendency on the fusion. So through the Transformer structure, it is possible to fuse align and spatial fusion.

There are similar operations in terms of time, in the past there was a fixed queue written down through the rules, each step of the neural network processed all or the latest samples in the queue, if you go further, can you let the neural network learn how to join the queue? How do I get out of the team? And when to choose what kind of information fusion from the queue? The work here is actually called Memory Networks, which is a neural network structure that learns how to store information, read information, and fuse information on its own.

You can see that under the paradigm of software 2.0, even the BEV architecture can have a lot of gameplay, and in order to be able to make a variety of ways to play, as soon as possible from the laboratory to the car, you need to have a strong chip tool chain support. Because I don't want to write a good model on the GPU, but move to the chip to deploy, there is no support of the toolchain. It's painful for developers to do all kinds of fixed-point quantization on chips, so a toolchain is needed to automate the process smoothly.

Horizon also has some achievements in this regard, Horizon's tool chain is not only used by internal engineers, but also hopes to enable our customers, including the automotive industry and customers outside the automotive industry, to train, quantify, deploy the entire process of neural networks from training, quantification, deployment can be more smoothly opened in a chain, in this way so that the BEV program can quickly evolve, iterate, and expand to a variety of tasks.

Eighteen martial arts perceived by BEV

In the challenge, I personally think that the simplest is to design a neural network architecture, slightly more difficult is how to design a non-repetitive and non-leaky set of perception tasks, its output can be used by the downstream without leakage, not heavy means that it is a very refined set, will not repeatedly output some content for the downstream to choose; no leakage means that there is no downstream use, but the perception task can not be extracted. The previous perceptual architecture stack had such a problem, often adding a variety of strange tasks to it in the process of practice to make up for the omissions in the previous task design, which will also introduce a lot of fusion-related problems, because the more output given upstream, the more you have to face the process of choosing fusion. It is best not to make a choice, hoping that the upstream is the most complete, non-repetitive and leaky input.

Figure 9

The understanding of this matter within the horizon can be divided into the following levels, from the low level to express the constraints of the physical world; to the middle semantic level, which focuses on extracting some logical Entity from the world; and finally from the Entity to get a structural level of understanding, including some concepts, associations, behaviors, and supporting downstream tasks through such different levels of content.

At the same time, this architecture can also directly lose the upstream to the capable downstream, such as the capable downstream is also a neural network, the communication between the neural network and the neural network, there is no need to do second-hand through these tasks designed, they can communicate directly through the implicit language of the neural network texture map.

Figure 10

Next, we'll look at some examples, starting with semantic perception. If divided into static and dynamic, static is mainly static information in the environment, ground information, such as lane lines, curbs, ground signs, stop lines, these are the most basic static constraints on driving. In the upper left corner of Figure 10, you can see that based on the BEV architecture, the BEV effect output on the right side is obtained through the input of the six sensing cameras. In addition to displaying the results of lane signals and curb segmentation, you can also get more tasks, including detecting the logical structure of sidewalks, intersections, and detecting road signs.

In addition to static, dynamic is also important. Because most of the road users in the world are dynamic, dynamic perception is also required. The right half of Figure 10 shows the effect of the vehicle measurement output directly under the BEV based on the input of the six sensors, and the green box in the image is the effect of the beV result in the image space. In this scene, there is a row of cars parked on the side of the road, and you can see that the space under the BEV has the same cool effect. The above is semantic perception, extracting some semantic information from the world that needs to be known.

The output of the semantic layer is generally what the downstream most want to use. Because it is relatively simple and structured, many semantic layer information is used in high-precision maps. So why do you like to use radar? Because radar extracts the semantic layer of dynamic perception very directly, if there is a reflection point on the car, it is a direct correspondence of a box. But in the real world, the semantic layer of information may not be enough, take the parking position as an example, if you drive a caravan out of the tour, after going to the caravan camp, you will find that the caravan camp is not marked, and the actual parking space is determined by how others park. Since the RV has different sizes and parking locations, it is difficult to decide through the semantic line on the ground, but through some very low-level constraints, that is, where others park, that is, where to avoid, to determine where the RV is parked. So in a chaotic and scattered world, you need to have a certain understanding of the underlying physical logic, even if you don't know anything about the world, you don't know anything about semantics, but at least you know what things can't crash into, which is the underlying vision under BEV to complete the work.

Figure 11

Figure 11 on the left shows the effect of the underlying visual static perception, it is the result of a frame on the road surface, the blue ground in the picture represents the ground, its height is relatively low, the red part of the red represents the higher the height, and there are some scattered contents, like the flesh on the ground indicates a slightly raised object on the ground, for these objects, it may be a pizza box or cement pier, the owner will think about whether to press over the car, will make some corresponding choices. There is no need to understand what it is, just know that there is something there, do not press over, and the car is safe.

On the right side of Figure 11 is the dynamic effect, which is an extension of the concept of optical flow. I believe that many students have heard of the optical flow, optical flow is the displacement of pixels between the two frames of the image, in the BEV space, there is a very intuitive embodiment, that is, you can directly obtain the objects in this world through the underlying physical perception, at what speed to move, such as the figure on the right shows the movement speed and motion orientation of moving objects in the world through the BEV network, different colors represent different moving objects, yellow is downward, purple is upward, different colors also represent different speeds of movement, The faster the movement, the darker the color. In this way, not only can the static information be completely covered through the underlying perception, but also the speed information can be obtained. Once you get the speed information, you can better move around the surrounding scene.

Is that enough to do the underlying perception? This involves sharing another theme, the importance of imagination to perception, and the personal view that imagination is the most important link in the perception of the next generation. Why is it so important?

Figure 12

Taking parking as an example, since I had the surround view image, I have tried to only look at the look around image driving in the basement, and found that the speed is particularly slow, and even the method of looking at the original camera is not as simple, why is this happening? Because in the BEV space, traditional perception only sees the visible content, and the invisible content cannot be perceived.

But if the perception can give more information, such as the perception can tell the owner, see a pillar, and according to some prior knowledge, also see a section of the road extending from behind the pillar, it can probably be guessed that there is another feasible area behind the pillar, equivalent to the feasible area can extend behind the pillar. In addition, it can also tell the owner of the future driving trajectory in this scenario, and the future behavior is to choose to go to the right, or even without thinking about whether the right side is feasible, the perception network directly tells the owner that he can go to the right, which represents a great paradigm shift.

In the past, perception was perceiving visible content, but if perception can perceive unseen content or content that may be seen, it may have a great paradigm shift in the use of downstream perception systems.

Figure 13

Next, static and dynamic examples, such as static, why do you like to use a map, because it is beyond the line of sight, 10 kilometers away from the content open the phone can see, but the perception is often invisible, the perception can only see the information displayed by pixels. What if imagination were introduced into BEV perception? It seems that the car end only relies on sensor information can also be guessed, if there is an intersection in front, see the crosswalk on both sides of the high probability there will be a road extending out, just like the vehicle in the process of driving, while doing perception, while doing mapping, called online maps. Figure 13 on the left shows online maps gradually expanding its understanding of surrounding lane lines, roads, and connections.

What is that dynamic imagination? One of the most typical examples is the prediction module often encountered in the field of regulation, which represents what a dynamic object will do or where it will go in the future. The right half of Figure 13 shows where each car will go in the future, and the line represents where the vehicle will be in the next few seconds. If this module forms part of perception, what happens? It will be found that there is a layer of perceptual results spaced in the previous perception and prediction, and it is necessary to carry out a lot of extraction and extraction of the perceptual results, and do not want the perceptual results to send out too much information. In fact, many mature prediction networks are already neural networks, and they can directly get the characteristic map of the original perception. If the perceptual network itself adds a perceptual head, it is to make predictions, then the perceptual network can use a lot of unstructured clues to make predictions, to complete some things that could not be done before perceiving and then predicting.

Beev perception development of end-cloud integration

Through the understanding of autonomous driving perception, a semantic perception, underlying vision to imagination and other different levels of perception tasks are constructed, forming a complete set of perceptions, but the most difficult point is data, where does the data come from? Where does the truth value come from? For the mass production of autonomous driving under software 2.0, I personally think that the difficulty is several orders of magnitude greater than that of neural networks, which is why Horizon will have a team that mixes systems, software, algorithms and hardware to do tough problems in the cloud platform AIDI. It is hoped that this capability can be developed so that more customers can experience this capability in AIDI.

It is basically divided into several steps, from the data collection of the car end, to the reconstruction of the collected data in the cloud, through the collection of data and data understanding, to obtain a more complete world of information. Then based on this information, a cloud perception is carried out, that is, not only can the perception of the car end, but also the perception of the cloud must be done well, so that the perception of the cloud can be used as a teacher to teach the perception on the car.

Since this is a true value production link, quality control is also required to complete the work.

Figure 14

The overall process is relatively simple, and what is considered in the car is how to dig up effective information, because the car is driving on the road all the time and will see a lot of information. So what kind of information is effective for improving the performance of neural networks, and what kind of data is needed? This in itself is a very difficult perceptual task, and there are various strategies behind it to accomplish this.

Like triggering on the Trigger side, the basic idea is to trigger according to the perception results that the car side can get, and then write some rules, scripts, and choose to trigger at certain conditions at a time when the data is useful to me.

The second is active learning, that is, the neural network in the car has the ability to learn autonomously, it is curious, not a silly neural network that accepts sensor input and only works. For example, this example is very interesting, if you want to take it back to study, it will choose this matter on its own.

There are also methods where if there is a Multi-sensor, each different source of sensor information may be slightly different, but logically know that they are consistent, inconsistencies in space and time, which can be used as a source of mining information. After digging the data into the cloud through this method, after some privacy processing, the information is sent to the cloud.

The cloud first reconstructs the entire world from the information transmitted from the car, and this world is not only a 3D world, but also knows the information in the time dimension, a process called 4D reconstruction. The second part is how to do the cloud perception, that is, to do the teacher model in the cloud, because the cloud and the car side do perception, their constraints and optimization goals are not the same, the car side is often in a certain power consumption, computing power conditions, or the frame rate accuracy is enough, the higher the frame rate, the better. But the cloud computing power is more abundant, so you can choose a larger model, and even use the future information, and the car side cannot know the future information. The third step is QA, which can be done by people like the production line, and can also be automatically done with related quality inspections.

After a string of truth value production, a very high-quality truth value and supporting raw data are obtained, which can be fed to the neural network, and then the neural network can be trained. Like Horizon after obtaining the truth value on AIDI, seamlessly connect the training task and the truth value on the AIDI, trigger a new round of training, and after training, it will automatically integrate the system, generate a new version of the software, and send it to the car end through the OTA to complete a round of iterative loop.

The new software will continue to do data mining work on the car side, and then complete a circular process similar to Figure 14. Through the iterative method mentioned above, the iteration efficiency of the entire system is very high, and some systems inside the horizon can reach a round of rapid iterations of the order of two or three weeks. Such rapid iterations are largely due to the cycle above.

In the whole process, especially the automatic labeling process in the cloud, there are currently many cloud algorithms and software teams, which are working hard to turn the cloud labeling process into a label free work, that is, there is no need to be like the previous labeling process, each pixel must be marked by people, but most of the work is solved by a large neural network model plus reconstruction, only a small part of the verification work and supplementary work are completed by people, which greatly improves the overall labeling efficiency. The speed of the entire iteration is also relatively fast, and the throughput is now relatively high, which is mainly due to the automation of the entire link.

Figure 15

Let's look at some specific examples, such as how to get the truth value on AIDI? First, the data of the same location will be collected from the most initial data source, and if the BEV perception is not good enough, the trigger fleet needs to collect data in different directions at one location. Figure 15 shows that near the same intersection, different collection results, the results of a single reconstruction, you can see that although the single reconstruction itself is not bad, but it is not complete, because only the route through the car can see more, the car is not passed is not visible. By aggregating the results of multiple single reconstructions, the algorithm can obtain a complete scene reconstruction.

In addition to the reconstruction of the static environment, dynamic scene reconstruction can also be carried out, or dynamic-oriented perception results, and finally put together a complete, holographic 4D world information for use by downstream cloud perception models. With such a 4D point cloud, how to carry out the labeling task?

Take the static perception label of BEV as an example, as shown in the upper right corner of figure 15, which shows a car, turning right once at an intersection, the box is the perception range, another car goes straight at the intersection once, and another car goes straight in the opposite direction once, after such information aggregation, you can get a complete scene. It is very similar to a map, but there are some different things from the map, which generally emphasize the latest appearance of the world, and the position in the relationship between the global and the world.

But in the labeling task, first of all, with the original sensor information matching, if you take a year ago image information and a year later reconstruction information matching, this is not called truth, because a certain intersection scene may change, this truth value labeling is useless; another point is that the labeling generally only cares about the local, because what you want to train is what the neural network is under this image input, what is its output. Itself is a relatively local content, only need to be slightly expanded compared to the "eye-reaching", do not need to expand a lot, it is enough to perceive the model training. There are many other tasks, including low-level visual information under BEV, height, optical flow, three-dimensional detection, etc., which can be extracted through the same holographic information.

The above is how to get the most difficult piece of content in BEV in a very automated way on AIDI, through the environment of AIDI cloud. The auto-annotation part of the horizon also has a strong team of algorithms to take care of.

Finally, to sum up, this lecture is to provide a system-level modeling, hoping to be able to see what you see and the horizon team sees the evolution of the perceptual paradigm of the autonomous driving industry through a relatively clear perspective, and present its most important aspects in a more formal, simplified way of expression. But we also know that the masters of statistics have said that all models are wrong, only certain models, because they embody some important aspect, so it helps us. Modeling about BEVs also has this feature.

From a personal point of view, mass production of autonomous driving technology is not a single technology can be completed, in the horizon of mass production technology, BEV is only a very important perspective of an algorithm, in fact, in order to complete the mass production of automatic driving systems, there are also many different technologies and BEVs to learn from each other to meet some needs of the final product level.

Mass production autonomous driving is not a single perspective, not only in the cloud to make an excellent model, you can achieve mass production, you need to look at both technology, but also to see the current market needs, what kind of technology can be deployed on a large scale under the current computing power and technology level. At the same time, Horizon believes that mass production of autonomous driving is not something that a company can do, and the world needs a hundred flowers to bloom.

Another point of the horizon that I admire is that it has a good ecological perspective and a great degree of openness to the industrial ecology. A few days ago, horizon founders also announced that in the future, not only will the algorithm have a certain degree of openness, but even the core BPU in the future, that is, the computing IP of the neural network accelerator, can also be provided to some OEM partners through a white box to assist them in designing hardware. In this way, Horizon hopes that there is not only one Tesla in the world, but also many companies that can reach Tesla level or even surpass Tesla through in-depth cooperation with Horizon and create a richer product set.

If you are interested in autonomous driving, welcome to join the industry. If you are interested in Horizon, come and join us in creating the real reality of the future.

Horizon Liu Jingchu: God's Perspective and Imagination – A New Paradigm of Autonomous Driving Perception

Read on