Housekeeping robots or will they be updated again? The Chelsea Finn team launched a new BID algorithm to make robots smarter with one click

Stanford University's team at Chelsea Finn has done something new.

Chelsea Finn's team has always been one of the teams at Stanford that is at the forefront of embodied intelligence research, and the ALOHA stir-fry robot that has become popular all over the Internet was created by this team. Team leader Chelsea Finn's startup Pi has received $70 million in funding from Sequoia Capital, OpenAI and other companies in less than a month after its establishment. Leifeng Net, Leifeng Net

Recently, Chelsea Finn's team found that while lengthening the action block improves the ability of the strategy to capture time dependence, doing so reduces the observation of the robot's recent state, making it more prone to errors in random environments.

To overcome this dilemma, they developed a novel Bidirectional Decoding (BID) algorithm. BID combines action blockization with closed-loop operations to enhance the temporal consistency of extended sequences by sampling multiple predictions at each time step and finding the optimal one, while enabling adaptive reprogramming in a random environment.

To verify the effectiveness of the BID algorithm, they conducted simulation tests on the Franka Kitchen dataset and found that the robot performed well in the home environment. They also did real-world experiments with the Franka Panda robot, and the results showed that BID significantly improved the robot's placement success rate when the target was moving.

These tests are reminiscent of their previous stir-fry robots, and perhaps the team is planning to apply BID to ALOHA to give housekeeping robots a complete technical upgrade.

It is worth mentioning that half of the team is made up of Chinese faces, and all of the people who developed ALOHA before were Chinese students.

At present, the paper has been published in arXiv, and the related code has also been open-sourced.

论文标题：Bidirectional Decoding：Improving Action Chunking via Closed-Loop Resampling

Address: https://bid-robot.github.io/static/BID_paper.pdf

Project Website: https://bid-robot.github.io/

Code address: https://github.com/YuejiangLIU/bid_diffusion

https://github.com/Jubayer-Hamid/bid_lerobot

Overview of the paper

Research the question

This paper aims to address challenges in robot learning, with a particular focus on action chunking, the process of predicting and executing action sequences, often derived from human demonstrations, without intermediate re-planning. Challenges include the trade-off between capturing time dependence and reacting to unexpected changes in a random environment, as well as greater style variability between different demonstrations.

The motivation of this study is to enhance the learning and execution process of the robot system by conducting deeper analysis of action chunking and providing practical decoding algorithms. The issues to be addressed include:

The trade-off between time dependence and reactivity in action chunking

There was considerable stylistic variability between different demonstrations

A practical decoding algorithm is needed to improve the performance of robot behavior cloning.

Proposed methodology

In this paper, a bidirectional decoding (BID) method is proposed.

BID is an inference algorithm that combines action chunking with closed-loop operations in robot learning. It samples multiple predictions at each time step and optimizes selections based on backward consistency (alignment with previous decisions) and forward comparison (proximity to the results of a stronger strategy).

This integrated approach enhances the temporal consistency of long action sequences while maintaining the flexibility to adapt to changes in the dynamic environment. BID significantly outperforms existing closed-loop approaches in a variety of robotic tasks, representing a significant improvement in the learning and execution process of robotic systems.

Experiments and results

data set

In this paper, experiments were conducted on three datasets: Push-T, RoboMimic, and Franka Kitchen.

Housekeeping robots or will they be updated again? The Chelsea Finn team launched a new BID algorithm to make robots smarter with one click

For the Push-T dataset, this paper evaluates the proposed two-way decoding (BID) algorithm on seven tasks, including placing objects into a cup held by a human. The robot used in the experiment was Franka Panda, equipped with two cameras that provided visual observation with a resolution of 256 x 256 pixels. This paper also evaluates the scalability and compatibility of BID under high-volume and existing inference methods.

For the RoboMimic dataset, five tasks are used in this paper, namely Lift, Can, Square, Transport, and Tool Hang. The training dataset for each task contains 300 rounds collected from multiple human demos.

For the Franka Kitchen dataset, this paper evaluates the learned strategy on a test case involving four or more objects, which is a challenging but practical application of the robotic manipulation task in the home environment.

Real-world experiments

This paper also further evaluates the proposed BID through two real-world experiments.

Dynamic placement experiments

They have collected a total of 150 demo rounds, including 50 clean and consistent demos and 100 noisy and varied demos. The robot used in the experiment is Franka Panda and operates with a vision-based diffusion strategy.

The robot's task is to feed the object in its grip into a cup in the hands of a human. Each presentation consists of four main phases: (a) randomly initializing the robot position, (b) approaching the target cup, (c) slowing down near the target cup, and (d) releasing the item. The position of the target cup may change during the presentation.

Notably, BID has a similar success rate in a dynamic setting to a static setting, suggesting that it has the potential to extend action blocks into uncertain environments.

Dynamic pick-up experiments

This paper evaluates the performance of different methods, including common open-loop and closed-loop sampling, open-loop and closed-loop sampling for BID, and closed-loop sampling for EMA.

The robot's task is to pick up a cup and place it on a nearby plate. The four main stages are: (a) initializing the robot, (b) approaching the target cup, (c) grabbing the target cup, (d) picking up the cup, and (e) placing the cup on the target plate. The position of the target cup may change in one process.

The results show that in a dynamic environment, the success rate of BID is at least 2 times higher than that of other methods, while maintaining its performance in a static environment.

Interpretation of BID technology

Action tiling is good for modeling the time dependence in the presentation, but sacrifices the ability to react to unexpected states in a random environment. They chose to solve this problem by closing the loop to connect the long action blocks.

Their main hypothesis is that while the probability of any pair of samples sharing the same potential strategy is low, the probability of finding a consistent pair of samples from a large number of samples is much higher. This intuition led them to understand the closed-loop action chunking problem as finding the optimal action in a batch of plans sampled at each time step.

Thereinto? Is it a collection of action blocks, L? And L? (B and F are both subscripts, which cannot be typed in Feishu documents) are two criteria for measuring time dependence, and these two criteria will be described in detail below.

LB stands for Reverse Consistency.

Here, ρ is a decay hyperparameter that is used to explain the case where the uncertainty increases over time. This backward loss encourages similar potential strategies between adjacent steps, while allowing for gradual adaptation to unforeseen transition dynamics.

LF stands for Forward Contrast.

where ?+=?∖{?} is a strong strategy ? The positive set of predictions, ?− is the negative set of weak strategy ?′ predictions, while ? is the sample size.

The figure below illustrates the effect of inverse agreement and forward contrast criteria on sample selection.

Since all steps in BID can be computed in parallel, the overall computational cost remains modest on modern GPU devices.

Meet the team

Chelsea Finn

Dr. Chelsea Finn graduated from the University of California, Berkeley, where she studied with Sergey Levine. She spent 6 years at Google DeepMind and is now an assistant professor in the Department of Computer Science and Electrical Engineering at Stanford University and a co-founder of Pi.

Chelsea Finn's research interests are the ability of robots and other agents to develop a wide range of intelligent behaviors through learning and interaction. Her lab, IRIS, focuses on large-scale robotic interactive intelligence and is part of the SAIL and ML Group.

The other three Chinese students in the team include:

Yuejiang Liu

Yuejiang Liu is a postdoctoral fellow at IRIS Lab and graduated from the Ecole Polytechnique Fédérale de Lausanne in Switzerland. His research focuses on self-supervised learning, causal representation learning, and test-time adaptation, and applies them to computer vision and multi-agent systems.

Annie Xie

Annie Xie is a graduate of the University of California, Berkeley, having worked with Sergey Levine at the Berkeley Artificial Intelligence Research (BAIR) Lab and is now a PhD student supervised by Chelsea Finn. Her research focuses on the development of robotic systems that learn with minimal human supervision.

Maximilian Du

Maximilian Du graduated from Stanford University this year with a bachelor's degree in Computer Science, Psychology (minor), and Creative Writing (minor), working in robotics learning at Chelsea Finn's IRIS Lab, and is now an incoming PhD student at Chelsea Finn. Leifeng Net, Leifeng Net