Google let the robot act as the hand and eye of the big language model, and one task was disassembled into 16 actions in one go

2022-05-09 08:14:26

Reports from the Heart of the Machine

Editors: Zhang Qian, Mayte

Big models have found their place in robotics.

"I spilled my drink, can you help me?" This is a perfectly normal form of help in our daily lives. Hearing this, your family or friends will often hand you a rag, a few tissues or simply help you clean it up.

But if you switch to a robot, things are not so simple. It needs to understand what "the drink is spilled", "help me", and how to help. This is really difficult for robots that can only understand simple commands (such as moving to (x,y), grabbing a Coke bottle).

In order for the robot to understand, you can break down the above sentence into a few simple instructions, but the cumbersome process may make you give up using the robot. To eliminate this embarrassment, we need to install a smart brain for the robot.

Research in the field of deep learning has shown that large language models with excellent language understanding capabilities (such as GPT-3, PaLM, etc.) are expected to act as this brain. For the same sentence (I spilled the drink, can you help me?) The response from the big models might be, "Can you try a vacuum cleaner?" or "Would you like me to find a cleaning tool?"

As you can see, large language models can understand slightly more complex, high-level instructions, but the answers they give are not always feasible (for example, robots may not be able to access the vacuum cleaner or there is no vacuum cleaner in the house). To better combine the two, the big model also needs to understand the robot's skill range and the limitations of the surrounding environment.

Google's Robotics at Google recently took aim in this direction and proposed an algorithm called "Saycan" (DO AS I CAN, NOT AS I SAY). In this algorithm, they have the robot act as the "hand and eye" of the language model, which provides high-level semantic knowledge of the task.

In this cooperative mode, the robot is even able to complete a long task of 16 steps:

So, how does this work? The research team introduced their approach on the project website.

Project Site: https://say-can.github.io/

Address of the paper: https://arxiv.org/abs/2204.01691

Method overview

The researchers used the principle of combining large language models (LLMs) with the physics tasks of robots: In addition to having the LLM simply explain a single instruction, we can use it to assess the probability that a single action will be helpful for completing the entire high-level instruction. In simple terms, each action can have a language description, and we can let it rate these actions through the prompt language model. In addition, if each action has a corresponding affordance function, it can be quantified how likely it will be successful from the current state (such as the learned value function). The product of the two probability values is the probability that the robot will successfully complete an action that is helpful for the instruction. Sort a series of actions according to this probability, and select the one with the highest probability.

Once we've selected an action, we can have the robot perform it, a process that is done by iteratively selecting a task and adding it to a command. In fact, the planning here will be structured as a dialogue between the user and the bot: the user provides high-level instructions such as "How do you give me a Coke can?" Then the language model will respond in a clear order, such as "I will: 1, find a Coke can; 2, pick up the Coke can; 3, bring it to you; 4, finish."

In summary, given a high-level instruction, SayCan combines the probability from the language model (the probability that an action is useful to the advanced instruction) with the probability from the value function (the probability of successfully executing the above action), and then selects the action to perform. The actions selected by this method are feasible and useful. The researcher repeats this process by attaching the selected action to the bot response and querying the model again until the output step terminates.

Experimental data

The researchers tested the proposed algorithm Saycan in two scenarios, one for an office kitchen and the other for a simulated office kitchen, where 101 tasks were specified by natural language instructions. Some of the highlights of the results are shown below.

From the following diagram, we can visualize SayCan's decision-making process. The blue bar indicates the (normalized) LLM probability, and the red bar indicates the probability (normalized) of the successful execution of the selected action. The overall score is a green bar, and the algorithm selects the action with the highest overall score. This visualization highlights the interpretability of SayCan.

For example, the task is "I spilled the Coke, can you get me something to clean up?" SayCan successfully planned and executed the following steps: 1. Find a sponge; 2. Pick up the sponge; 3. Bring it to you; and 4. Finish. This is shown below:

If you adjust the task slightly to "I spilled the Coke, can you change a bottle for me?" SayCan will use these steps to perform: 1. Find a bottle of Coke; 2. Pick up the Coke; 3. Bring it to you; 4. Finish. This shows that SayCan can take advantage of the large capacity of LLM, and their semantic knowledge of the world is useful for interpreting instructions and understanding how to execute them.

In the next example, SayCan uses the power of affordance to "overturn" the choices made by the language model. Although the language model believed that taking the sponge was the right action, Affordance realized that this was impossible and chose to "look for the sponge". This highlights the need for affordance grounding.

The authors apply SayCan to a very complex command, "I sprinkled Coke on the table, can you throw it away and take something to wipe it", which requires 8 movements, and the robot successfully plans and completes this task. In the experiment, the longest task was up to 16 steps.

In summary, the method achieves a 70% overall planned success rate of 101 tasks, a 61% execution success rate, and roughly half of the performance is lost if affordance grouding is removed.

More details can be found in the original paper. There are 43 authors, co-corresponding by Karol Hausman, Brian Ichter, and the Chinese scholar Xia Fei.

Google let the robot act as the hand and eye of the big language model, and one task was disassembled into 16 actions in one go

Read on