laitimes

Li Feifei's new achievements! The robot accesses the large model, and 0 training can complete complex instructions...

author:Billion Euronet

The latest achievements of Li Feifei's team.

The research of embodied intelligence of Li Feifei's team has made new progress. They successfully integrated large models into the robot and could use natural language to give instructions to the robot without additional data or training. For example, "Open the upper drawer and be careful with the vase". The combination of large language models and visual language models can analyze goals and obstacles from 3D space and help robots plan for action.

Li Feifei's new achievements! The robot accesses the large model, and 0 training can complete complex instructions...

In the real world, robots can perform this task directly without being trained and can manipulate any object without prior scoping. For example, open the bottle, press the switch, or unplug the charging cable. At present, the project's homepage and paper are live, and the code will be released soon. There is also great interest in the academic community, which has generated extensive discussion. A former Microsoft researcher commented that the study is at the forefront of the most important and complex AI systems.

Li Feifei's new achievements! The robot accesses the large model, and 0 training can complete complex instructions...
Li Feifei's new achievements! The robot accesses the large model, and 0 training can complete complex instructions...

How to get the robot to understand people directly? Li's team named the system Vox Poser. The principle is very simple. First, you need to enter environmental information and natural language instructions to execute. The LLM big language model then writes code based on the content. The generated code will interact with the VLM visual language model to guide the system to generate the corresponding operation instructions. The map used by Vox Poser is a 3DValue Map.

Li Feifei's new achievements! The robot accesses the large model, and 0 training can complete complex instructions...

The resulting 3D map will serve as an objective function of the action planner and synthesize the trajectory of the final operation to be performed. This directly solves the problem of scarcity of robot training data. When compositing trajectories, use closed-loop visual feedback via cached output and quickly replan when encountering interference. Therefore, Vox Poser has a strong ability to resist interference.

About a year ago, Li Feifei published an article in the Journal of the American Academy of Arts and Sciences, pointing out that there are three directions for the development of computer vision: embodied intelligence, visual reasoning and scene understanding. Machines performing tasks require visual reasoning, understanding three-dimensional relationships in a scene, and understanding the people in the scene, including human intent and social relationships. Combining large models with robots is exactly one way to solve these problems.

Read on