On July 5th, it was announced at tonight's intelligent driving summer conference that the early bird plan of end-to-end + VLM was officially launched.
According to reports, the biggest feature of end-to-end is that the NPN is removed, which does not rely on prior information, and can truly realize that the whole country can be opened, and it can be opened with navigation.
The end-to-end model goes one step further, and the driving trajectory can be directly output by inputting data into the model through sensors.
Through the end-to-end and large-scale model deployment in the vehicle, autonomous driving can be processed quickly, with lower latency and higher upper limits, and users can feel the actions and decisions of the entire system more anthropomorphic.
VLM is a visual language model, and the overall algorithm architecture is composed of a unified Transformer model, which encodes the Prompt text with a tokenizer, and then encodes the visual information of the 120-degree and 30-degree forward-looking camera images and navigation map information, and performs modal alignment through the image-text alignment module, and hands it over to the VLM model for autoregressive inference.
The information output by the VLM includes an understanding of the environment, driving decisions, and driving trajectories, and is transmitted to System 1 to control the vehicle.
The VLM will always think about the current driving environment and give reasonable driving suggestions to System 1, and at the same time, System 1 can also call different Prompt problems in different scenarios, and actively ask System 2 for help to help System 1 solve some scenarios.
To put it simply, VLM allows the vehicle to have the ability to think, making the operation of autonomous driving more like that of a human driver.