Scientists have developed a new module for autonomous driving to bring the understanding of autonomous driving scenarios closer to human cognition

author：DeepTech 2024-04-10 16:48:00

Recently, Xu Dongyang, a master's student at Tsinghua University, and his team, in order to help the further development of autonomous driving technology, they proposed a module called LVAFusion, which aims to integrate multimodal information more efficiently and accurately.

Scientists have developed a new module for autonomous driving to bring the understanding of autonomous driving scenarios closer to human cognition

图 | 徐冬阳（来源：徐冬阳）

Autonomous driving should have the ability to learn from good human drivers on the road, because humans can quickly locate key areas when faced with most scenarios.

To improve the explainability of the end-to-end autonomous driving model, the team introduced a human driver's attention mechanism for the first time.

By predicting the driver's attention area in the current context, they use it as a mask to adjust the weight of the original image, giving autonomous vehicles the ability to effectively locate and predict potential risk factors just like experienced human drivers.

The introduction of predictive driver visual gaze areas not only provides more fine-grained perception features for downstream decision-making tasks, but also ensures greater safety. Moreover, it also brings the scene understanding process closer to human cognition, which can improve interpretability.

(Source: arXiv)

In terms of potential applications:

First, the newly developed LVAfusion module can be used in vehicles equipped with LiDAR, which is expected to improve the perception and fusion capabilities of multi-modal large models.

Second, this model can be combined with existing multimodal large models.

For example, the driver's attention mechanism can be output in real time, allowing passengers to observe in real time the sections that the current large model considers to have more weight.

If the rider thinks it's unreasonable, they can voice the end-to-end model, which can automatically adjust for continuous learning and optimization.

What's so good about end-to-end autonomous driving?

According to reports, autonomous driving includes key links such as environmental perception, positioning, prediction, decision-making, planning and vehicle control, and through the coordination of these modules, real-time perception and safe navigation of the surrounding environment can be carried out.

However, this system architecture not only has a huge amount of code, complex post-processing logic, and high maintenance costs.

Moreover, the phenomenon of error accumulation is easy to occur in the actual application process, such as the sudden appearance of pedestrians in front, due to the omission of the perception module, the downstream prediction and decision-making module does not have the information input of pedestrians, which may lead to the occurrence of danger.

End-to-end autonomous driving is expected to solve this problem. End-to-end autonomous driving refers to the use of deep learning models to translate directly from raw input data (e.g., camera images, lidar point clouds) to control commands (e.g., steering wheel corner, accelerator, and brake).

This method attempts to simplify the traditional multi-module autonomous driving system, and regards the entire driving task as a mapping problem from perception to behavior.

The key advantage of end-to-end learning is that it reduces the complexity of the system and has the potential to improve generalization, as the model can be trained to directly handle many different driving situations.

In addition, multimodal end-to-end autonomous driving is expected to improve the system's ability to understand and respond to complex environments, enhance the accuracy and robustness of decision-making, and improve the safety and reliability of autonomous vehicles by integrating data from multiple sensors such as cameras, lidars, and radars.

However, end-to-end autonomous driving is based on a black-box deep learning model, so how to improve the driving performance of the model and improve the interpretability of the model is a problem and pain point that needs to be solved urgently.

Many of the existing methods are end-to-end autonomous driving, and Xu Dongyang and his team analyzed the model structure in detail and found that people had not made good use of multimodal information before.

The camera has a wealth of semantic information, but lacks depth information. Lidar can provide great distance information. Therefore, the two have good complementary characteristics.

However, most of the existing end-to-end learning methods use backbone networks to extract modal information separately and then stitch it in high-dimensional space, or use Transformer to fuse multi-modal information.

Among them, the query is initialized randomly, which may lead to the inability to utilize the prior knowledge buried in the multimodal features in the process of fusion using the attention mechanism.

This can lead to the misalignment of the same key object across multiple modalities, ultimately leading to slower and suboptimal convergence of model learning.

On a snowy winter night in Zhongguancun, I was typing code for experiments

In his research, with the accumulation of Xu Dongyang's professional skills and the development of end-to-end autonomous driving, he found that there are still some deficiencies in the end-to-end field when reading the literature.

For example, how to improve the interpretability of the model without ensuring accuracy is not fully explored whether multimodal information is fused. After some research, Xu Dongyang chose end-to-end autonomous driving as a research topic.

End-to-end autonomous driving is a large system, including multiple modules such as perception, tracking, prediction, decision-making, planning, and control. Therefore, it is necessary to design a method that can effectively collude with the above modules.

Once you've decided on a method, you'll need to collect a lot of data. Because end-to-end models are based on deep learning, they require large amounts of data to train.

It was also necessary to determine what inputs and outputs the model needed, and to go to Carla, an autonomous driving simulation platform, to collect data under multiple weather conditions and to check the integrity of the data.

After the data collection is completed, it is necessary to analyze whether the structural design of the model can help this task.

In the experiment, when introducing the pre-training weights, Xu Dongyang guided the wrong weights. However, due to the weight matching, the system did not report errors, but the results of the experiments that ran out were always unsatisfactory.

After a lot of model debugging, the problem was still not found. One night, when Xu Dongyang was walking in Zhongguancun, there was heavy snow in the sky, and he suddenly thought that he hadn't checked the training code, could it be a problem with the training process?

So, he immediately ran back to the computer, looked at the training process, and finally determined that the problem was the introduction of pre-training weights.

After adjustment, the experimental results were very much in line with expectations. This kind of discovery brings not only an understanding of the problem, but also a deep sense of satisfaction and achievement. Xu Dongyang said.

Due to the long training time, Xu Dongyang would submit multiple tasks to the training cluster every night. One night, due to a lot of experiments, some tasks were stopped due to priority.

When he came to see it the next day, he found that some of the results were missing, so he had to carefully analyze the results again and resubmit the missing experiments.

就在这样繁复的过程之中，他终于完成了研究。最终，相关论文以《M2DA：融合驾驶员注意力的多模式融合 Transformer》（M2DA：Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving）为题发在 arXiv 上[1]。

Figure | Related papers (source: arXiv)

Subsequently, the research group will focus on further optimizing the model, expanding the application scenarios, and improving the robustness and security of the system.

Specifically:

First of all, it is necessary to deepen the multimodal fusion technology.

Continue to explore and develop more efficient algorithms to improve the way data from different sensors is fused. For example, graph networks are used to match different modalities, and special attention should be paid to traffic scenarios in highly dynamic and complex environments.

Second, the driver's attention model should be enhanced.

That is, to further study the simulation mechanism of driver attention, to explore how to more accurately predict and simulate the attention focus of human drivers, and to explore the impact of these focuses on driving decisions.

Thirdly, it is necessary to carry out the verification of security and robustness.

That is, the existing model is deployed into the car in the physical world, and the performance of the model under real-world conditions is verified through more physical experiments.

In this way, the research can be extended to a wider range of driving scenarios and environmental conditions, such as bad weather and night driving, so as to verify and improve the versatility and adaptability of the system.

Finally, it is necessary to carry out research on human-computer interaction.

It is to explore how this technology can be more closely integrated with human-machine interaction, for example, by providing drivers with more intuitive risk warnings and assisted decision support, to enhance the interaction between autonomous vehicles and human drivers.

Through these follow-up research projects, Xu hopes to not only improve the performance of autonomous driving technology, but also ensure that it is closer to the understanding of human driving behavior, laying the foundation for safer and smarter autonomous driving technology.

Resources:

1. Hatps://ArXiv.org/PDF/2403.12552.pdf

Operation/Typesetting: He Chenlong

Scientists have developed a new module for autonomous driving to bring the understanding of autonomous driving scenarios closer to human cognition

Read on