Drive like a human, big language model to build automatic driving! Autonomous driving ushered in the ChatGPT moment

Big data digest authorized to be reprinted from Xi Xiaoyao Technology said

Author: IQ fell to the ground

Traditional autonomous driving systems rely on pre-programmed rules and patterns, limiting their adaptability and flexibility. To improve autonomous driving technology, researchers are beginning to experiment with large language models (LLMs) with semantic understanding and generative capabilities, which can generate human-like language by learning large amounts of text data.

Imagine you're sitting in your car waiting for a traffic light and there's a pickup truck with a traffic cone in front of you crossing the intersection. As a human driver, you can tell common sense that these traffic cones are cargo on a pickup truck, not road construction. But for many autonomous driving systems, this situation is difficult to handle. Developers can solve this problem through rules or collect data on traffic cones on the vehicle, but the algorithm fails when faced with a situation where the ground marks a prohibited area. It's like solving one problem and running into another, especially in situations that are rare in the real world. That's why we believe that traditional autonomous driving systems face a bottleneck in terms of performance.

Recent research has proposed a new approach to rethinking autonomous driving technology with large language models. The authors discuss the use of LLM to simulate the human understanding of the driving environment and analyze its reasoning, interpretation, and memory abilities when dealing with complex situations. They argue that traditional autonomous driving systems have performance limitations when dealing with boundary situations. To solve this problem, an ideal autonomous driving system is proposed, which can solve problems through driving experience and common sense like humans. To achieve this goal, we identified three key competencies: reasoning, interpretation, and memory. By building a closed-loop system, the feasibility of using LLM in driving scenarios is demonstrated to demonstrate its ability to understand and interact with the environment. The experimental results show that LLM demonstrates impressive reasoning and ability to solve complex situations, providing valuable insights for the development of human-like autonomous driving systems.

Code:

https://github.com/PJLab-ADG/DriveLikeAHuman

background

There are two main approaches to autonomous vehicles: modular and end-to-end.

A modular approach is made up of components that handle different tasks such as perception, planning, and control. The benefits of this approach are modularity and versatility. However, adjusting processes and managing errors can be difficult.
The end-to-end approach maps sensor inputs directly to planner or controller commands. This approach is often easier to develop, but lacks interpretability, making it difficult to diagnose errors, ensure safety, and follow traffic rules.

Recent studies have shown that combining the advantages of both methods can lead to better results. However, these two approaches tend to become fragile when dealing with long-tail data or distributed scenarios in real-world environments, creating challenges for safety-critical autonomous driving.

After rethinking the story of autonomous driving, the authors explain why traditional autonomous driving systems struggle with complex scenarios. Although systems based on optimization theory can decompose complex tasks into subtasks, when dealing with complex scenes, optimization goals often fall into local optimum, which limits the ability to generalize. Adding more data only reduces the performance gap between the current model and the optimization method. This is because the optimization process is biased towards learning the dominant patterns in the data, ignoring rare cases. Without adding common sense, the capabilities of the model cannot be improved.

In addition, in the continuous data collection process, there will always be endless unknown long-tail cases. Compared to current solutions, these long-tail edge cases are often overwhelming, and humans can easily solve them with experience and common sense. So an intuitive idea emerged: could we build a system that could accumulate experience by driving continuously, as humans do, rather than relying on a limited training corpus for fitting.

According to recent research, previous modular autonomous driving systems can be seen as internet AI trained on task-specific corpus and lacking advanced intelligence such as reasoning, interpretation, and self-reflection. The authors argue that if you want to get an agent that can drive a car like an experienced human driver, it is necessary to draw on the ideas of experiential intelligence research.

Continuous learning is another important aspect of driving. Novice drivers usually drive with caution in complex traffic situations because they have limited experience. Over time, they gain more experience, encounter new traffic scenarios, develop new driving skills, and consolidate previous experience, eventually becoming experienced drivers. Existing optimization methods simulate the process of continuous learning by taking failure cases and retraining the neural network, but this method is cumbersome and expensive and cannot achieve true continuous learning. Therefore, we need a more efficient way to achieve continuous learning of autonomous driving systems.

The success of Large Language Models (LLMs) is exciting because it demonstrates the extent to which machine learning is about human knowledge. LLM's latest research shows impressive performance in zero-sample prompting and complex reasoning, embodied intelligence research, and solving critical traffic problems.

PaLM-E employs fine-tuning techniques to accommodate pre-trained LLMs to support multimodal cues.
Reflexion combines self-reflection and thought chain prompts to further enhance the agent's reasoning ability, generating reasoning processes and task-specific actions.
VOYAGER proposes a lifelong learning mechanism based on LLM, including a prompt mechanism, a skill bank, and self-verification. These three modules are designed to enhance the development of the agent's more complex behavior. Generative agents utilize LLM to store the agent's complete record of experience and synthesize it into higher-level reflections to plan behavior.
Instruct2Act introduces a framework that utilizes large language models to map multimodal instructions to sequential actions of robot operation tasks.

Autonomous driving system design

Humans learn to drive through interaction with real environments, and improve their sense of road by interpreting, reasoning, and summarizing memories of various scenarios and corresponding operations.

Inductive reasoning: Thanks to their ability to reason logically, human drivers can summarize rules using common sense and apply them in more general scenarios.
Deductive reasoning: Previous experiences can be evoked subconsciously to deal with unpredictable situations.

In order to achieve the goal of driving like a human, three necessary capabilities of the system were identified:

Inference: In specific driving scenarios, the model should be able to make decisions through common sense and experience.
Explanation: The decisions made by the agent should be able to be explained. This indicates the capacity for introspection and the presence of declarative memory.
Memory: After reasoning and interpreting scenarios, a memory mechanism is needed to remember previous experiences and enable the agent to make similar decisions when faced with similar situations.

Based on the above three characteristics, the author refers to the way humans learn to design and simplify the normative form of the driving system.

Figure 1: (a) The relationship between human driving and existing automated driving systems, highlighting in particular the limitations of current approaches and why they cannot address all long-tail cases. (b) A model of a system that can drive like a human. The agent can explore and interact with the environment and self-reflect based on feedback from experts, ultimately accumulating experience.

The LLM-based approach proposed by the authors is shown in Figure 1(b), which consists of four parts:

The environment interacts with the agent, creating a stage;
The agent is like a driver, able to perceive the environment and make decisions based on memory and expert advice;
Memory allows agents to accumulate experience and perform actions;
Experts provide advice when training agents and provide feedback when behavior is inconsistent.

Specifically, the environment, agent, and expert can be represented as real-world or simulator, human driver or driving algorithm, and feedback from simulator or coach, respectively.

Human behavior is closely followed by an independent memory module. The memory module only records decision-making scenarios that deviate from the "expert's" decisions. Experts can be the developer's evaluation of LLM decisions, or the decisions of real-world human drivers. Once expert feedback is obtained, LLM self-reflects to identify the reasons for biased decision-making. It will then summarize the traffic situation and add it to memory as a new memory, attaching the right decisions. When a similar situation is encountered again, LLM can quickly retrieve this memory and make informed decisions.

experiment

▲图4: The lane-change decision-making process by GPT-3.5

In the example of Figure 4, the green car is in the rightmost lane, following vehicle 2 and keeping a certain distance. Previously, GPT-3.5 judged that the vehicle in front was too far away, so it decided to accelerate to keep up with vehicle 2. When starting the ReAct process, GPT-3.5 uses Get_available_action tools to obtain the four available actions for the current time step. It then found that Vehicle 2 was still driving ahead, and that both idle and acceleration actions were safe. GPT-3.5 Final Decision continues to accelerate because it "selects actions consistent with previous decisions," as explained in the final answer. As a result, the distance between the vehicle and the vehicle in front is shortened, contributing to the overall flow of traffic. Compared to the first example, GPT-3.5 uses a significant reduction in the number of tools and inference costs due to the citation of previous decision results.

For the second case, which is the case shown in Figure 6(b), traffic cones are scattered not only on the truck bed, but also on the ground. LLaMA-Adapter accurately represents this situation. Although there is a slight difference from the first case, GPT-3.5 answers the exact opposite. It believes the situation could be potentially dangerous because of the awls scattered around the trucks, and advises drivers of self-driving cars to slow down and keep their distance to avoid collisions with those awls.

The above example demonstrates LLM's powerful zero-shot comprehension and reasoning capabilities in driving scenarios. Using common sense knowledge not only enables LLM to better understand the meaning of the scene, but also enables it to make more rational decisions that are more in line with human driving behavior. Therefore, having common sense knowledge raises the upper limit of the capabilities of autonomous driving systems, allowing them to handle unknown special situations and truly approach the driving capabilities of human drivers.

summary

Inspired by recent research, large language models (LLMs) have excellent capabilities and new techniques (instruction following, context learning). Recent work has demonstrated LLM's ability to reason, interpret and remember. Therefore, this paper attempts to preliminarily explore the ability of LLM in understanding driving traffic scenarios, and analyzes the reasoning, interpretation and memory ability of LLM in scenarios similar to long-tail edge situations through a series of experiments. The main contributions are as follows:

It delves into how to make autonomous driving systems drive like humans to avoid catastrophic forgetting in the face of long-tail edge situations, and summarizes three key capabilities to achieve human-like driving: reasoning, interpretation, and memory.
For the first time, the feasibility of using LLM in driving scenarios was demonstrated and its decision-making capabilities in simulated driving environments were utilized.
Through extensive experiments, LLM's strong comprehension ability and ability to solve long-tail cases were demonstrated.

Previous autopilot systems were limited in dealing with some special situations because they tend to forget previous experience. Therefore, the authors summarize three abilities that an autonomous driving system should have, including reasoning, interpretation, and memory. Then, a new method was devised to mimic the process by which humans learn to drive. Finally, using GPT-3.5 as a testbed, it demonstrates its impressive ability to understand traffic scenarios. The authors initially reveal the potential of this approach in closed-loop driving, highlighting the benefits and opportunities for adopting this technology.

By training the model, it can understand and mimic the behavior and decision-making process of human driving. This makes the autonomous driving system smarter, more flexible, and able to adapt to a variety of driving scenarios and situations.

However, this approach also faces some challenges and limitations:

Large language models require large amounts of computing resources and data to train, which can increase the cost and complexity of the system.
The performance and accuracy of the model can be limited by the quality and diversity of the training data.

Despite the challenges, there is still a lot of potential to rethink autonomous driving technology with LLM.

This approach can make autonomous driving systems smarter and more adaptable, providing safer and more convenient solutions for future transportation and mobility. It is hoped that this research will provide new ideas for promoting innovation in academia and industry to build an AGI-based autonomous driving system that drives like humans.

Rent! GPU cloud resources have newly launched a batch of A100/A800 operator computer rooms, and the service is guaranteed, and the code is scanned for details ☝

Drive like a human, big language model to build automatic driving! Autonomous driving ushered in the ChatGPT moment

Drive like a human, big language model to build automatic driving! Autonomous driving ushered in the ChatGPT moment

Read on