laitimes

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

author:Quantum Position

Fengsei Mengchen comes from the Temple of Concave Fei

Qubits | Official account QbitAI

The latest achievements of Li Feifei's team embodied intelligence are here:

Large models are connected to the robot to convert complex instructions into specific action plans without additional data and training.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

Since then, humans can freely use natural language to give instructions to robots, such as:

Open the upper drawer and watch out for the vase!
Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

The large language model + visual language model can analyze the goals and obstacles that need to be bypassed from the 3D space to help the robot make action planning.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

And then the point is, real-world robots can perform this task without being "trained."

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

The new method realizes zero-sample daily operation task trajectory synthesis, that is, the robot can perform tasks that it has never seen before, and it is not even necessary to give him a demonstration.

The operable object is also open, and it can be completed by opening the bottle, pressing the switch, and pulling the charging cable without delimiting the range in advance.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

The project homepage and papers are now live, the code is coming soon, and it has already attracted a lot of interest from the academic community.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

A former Microsoft researcher commented that this research is at the forefront of the most important and complex artificial intelligence systems.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

Specific to the robot research community, there are also peers who said: it has opened up a new world in the field of motion planning.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

There are also people who did not see the dangers of AI in the first place, because of this research on AI combined with robots, they have changed their opinions.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

How can robots understand people directly?

Li Feifei's team named the system VoxPoser, as shown in the figure below, and its principle is very simple.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

First, given the environmental information (RGB-D images acquired with the camera) and the natural language instructions we want to execute.

Next, LLM (Large Language Model) writes code based on these contents, and the generated code interacts with VLM (Visual Language Model) to guide the system to generate a corresponding operation instruction map, that is, a 3D Value Map.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

The so-called 3D Value Map, which is the umbrella term for Affordance Map and Constraint Map, marks both "where to act" and "how to act".

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

In this way, if you take out the action planner and use the generated 3D map as its objective function, you can synthesize the trajectory of the final operation.

From this process, we can see that compared with the traditional method, additional pre-training is required, and this method uses a large model to guide how the robot interacts with the environment, so it directly solves the problem of scarcity of robot training data.

Furthermore, it is precisely because of this feature that it also achieves zero-sample capability, as long as the above basic processes are mastered, it can hold any given task.

In the specific implementation, the author translates VoxPoser's idea into an optimization problem, which is the following complex formula:

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

It takes into account that human instructions can be very wide and require contextual understanding, so it breaks down instructions into many subtasks, such as the first example at the beginning consisting of "grabbing the drawer handle" and "pulling open the drawer".

What VoxPoser achieves is to optimize each subtask, obtain a series of robot trajectories, and ultimately minimize the total workload and working time.

In the process of mapping language instructions to 3D maps with LLM and VLM, the system takes into account that language can convey rich semantic space, and facilitates the use of "entity of interest" to guide the robot to operate, that is, through the value marked in the 3DValue Map to reflect which object is "attractive" to it, and which objects are "repulsive".

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

Again, with the example at the beginning, the drawer is "attracting" and the vase is "repellent".

Of course, how these values are generated depends on the understanding ability of large language models.

In the final trajectory synthesis process, since the output of the language model remains constant throughout the task, we can quickly replan in the event of interference by caching its output and reevaluating the generated code using closed-loop visual feedback.

Therefore, VoxPoser has a strong immunity to interference.

△Put the waste paper into the blue tray

The following are the performance of VoxPoser in real and simulated environments (measured by average success rate):

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

You can see that regardless of the environment and situation (with or without interference, whether the instructions are visible or not), it is significantly higher than the primitive-based baseline task.

Finally, the authors were pleasantly surprised to find that VoxPoser produced 4 "emergence capabilities":

(1) Evaluate physical characteristics, such as given two blocks of unknown mass, let the robot use tools to conduct physical experiments to determine which block is heavier;

(2) Behavioral common sense reasoning, such as telling the robot "I am left-handed" in the task of laying tableware, and it can understand its meaning through the context;

(3) Fine-grained correction, such as performing a task with high accuracy requirements such as "covering the teapot", we can send precise instructions such as "you are deviating by 1 centimeter" to the robot to correct its operation;

(4) Vision-based multi-step operation, such as asking the robot to accurately open the drawer in half, due to the lack of object model caused by insufficient information may make the robot unable to perform such a task, but VoxPoser can propose a multi-step operation strategy based on visual feedback, that is, first fully open the drawer while recording the handle displacement, and then push it back to the midpoint to meet the requirements.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

Feifei Li: The 3 North Stars of Computer Vision

About a year ago, Li Feifei wrote an article in the journal of the American Academy of Arts and Sciences, pointing out three directions for the development of computer vision:

  • Embodied AI
  • Visual Reasoning
  • Scene Understanding
Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

Li Feifei believes that embodied intelligence does not only refer to humanoid robots, any tangible intelligent machine that can move in space is a form of artificial intelligence.

Just as ImageNet aims to represent a wide and diverse image of the real world, embodied intelligence research will also need to solve complex and diverse human tasks, from folding laundry to exploring new cities.

Following instructions to perform these tasks requires vision, but not just vision, but also visual reasoning to understand the three-dimensional relationships in the scene.

Finally, the machine must be able to understand the people in the scene, including human intentions and social relations. For example, seeing a person open the refrigerator can tell that he is hungry, or seeing a child sitting on an adult's lap can tell that they are parent-child.

Robots combined with large models may be one way to solve these problems.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

In addition to Li Feifei, the study was also participated in the study by Tsinghua Yao Class alumnus Wu Jiajun, who graduated from MIT with a Ph.D. and is now an assistant professor at Stanford University.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

Wenlong Huang, now a doctoral student at Stanford, participated in PaLM-E research during his internship at Google.

Li Feifei's new achievements of "embodied intelligence"! The robot accesses the large model to directly understand human words

Paper Address:

https://voxposer.github.io/voxposer.pdf

Project Homepage:

https://voxposer.github.io/

Reference Links:

[1]https://twitter.com/wenlong_huang/status/1677375515811016704

[1]https://www.amacad.org/publication/searching-computer-vision-north-stars

— End —

Qubits QbitAI · Headline number signed

Follow us and be the first to know the latest scientific and technological trends

Read on