laitimes

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

author:Heart of the Machine Pro

Heart of the Machine column

Heart of the Machine Editorial Department

Researchers from the Institute for Interdisciplinary Information Sciences at Tsinghua University have proposed a "GenH2R" framework that allows robotics Xi generalizable vision-based human-to-robot handover policies. This generalizable strategy enables robots to more reliably catch objects with diverse geometries and complex trajectories from people's hands, providing new possibilities for human-computer interaction.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

With the advent of the era of embodied AI, we expect intelligent bodies to actively interact with the environment. In this process, it is important for robots to integrate into the human living environment and interact with humans. We need to think about how to understand human behavior and intentions, meet their needs in a way that best meets human expectations, and put humans at the center of embodied AI. One of the key skills is the Generalizable Human-to-Robot Handover, which enables robots to better cooperate with humans for a variety of everyday tasks such as cooking, housekeeping, and furniture assembly.

However, if you consider that large-scale interaction between robots and humans is Xi Xi dangerous and expensive in the real world, machines are likely to harm humans:

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

Training in a simulation environment, using character simulation and dynamic grasping motion planning to automatically provide massive and diverse robotics Xi data, and then deploying it to the real robot (Sim-to-Real Transfer) is a more reliable learning-Xi based method, which can greatly expand the ability of robots to interact with humans collaboratively.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

Therefore, the "GenH2R" framework is proposed, starting from the three perspectives of simulation, demonstration, and imitation, so that the robot can learn for the first time based on the end-to-end method Xi the general handover of any grasping method, arbitrary handover trajectory, and arbitrary object geometry: 1) in "GenH2R-Sim" The environment provides millions of easy-to-generate complex simulation handover scenarios, 2) introduces an automated process for generating expert demonstrations based on vision-motion collaboration, and 3) uses 4D information and prediction aids (point cloud + time) based on Xi Imitation Learning.

Compared with the SOTA method (CVPR2023 Highlight), the GenH2R method has an average success rate of 14% and a 13% shorter time on various test sets, and achieves more robust results in real-world experiments.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接
  • Address: https://arxiv.org/abs/2401.00929
  • Paper homepage: https://GenH2R.github.io
  • Paper Video: https://youtu.be/BbphK5QlS1Y

Methodology

A. 仿真环境(GenH2R-Sim)

To generate a high-quality, large-scale human-human-object dataset, the GenH2R-Sim environment models the scene in terms of both gripping poses and motion trajectories.

In terms of grasping postures, GenH2R-Sim introduced a wealth of 3D object models from ShapeNet, selected 3266 daily objects suitable for handover, and generated a total of 1 million scenes of objects grasped by human hands using the dexterous grasping generation method (DexGraspNet). In terms of motion trajectories, GenH2R-Sim uses several control points to generate multiple smooth Bézier curves, and introduces the rotation of human hands and objects to simulate various complex motion trajectories of hand-handed objects.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

GenH2R-Sim's 1 million scenes not only far surpass the latest work in terms of motion trajectory (1,000 vs 1 million) and the number of objects (20 vs 3266), but also introduce interactive information close to the real situation (such as when the robotic arm is close enough to the object, the person will cooperate to stop the movement and wait for the handover to be completed), rather than a simple trajectory playback. Although the data generated by the simulation is not completely realistic, the experimental results show that large-scale simulation data is more helpful for learning Xi than small-scale real data.

B. Large-scale generation of examples of experts who are conducive to distillation

Based on large-scale human hand and object motion trajectory data, GenH2R automatically generates a large number of expert examples. The "experts" GenH2R seeks are improved Motion Planners (such as OMG Planner), which are non-Xi, control-optimized, vision-independent point clouds that often require some scene state (such as the target gripping position of an object). In order to ensure that the subsequent visual strategy network can distill information that is beneficial to the learning Xi, it is critical to ensure that the examples provided by the "experts" have a vision-action correlation. If the final landing point is known during planning, then the robotic arm can ignore the vision and directly plan to the final position to "wait for the rabbit", which may cause the robot's camera to not be able to see the object, which is not helpful for the downstream visual strategy network, and if it is frequently re-planned according to the position of the object, it may cause the robotic arm to move discontinuously, appear strange shapes, and cannot complete a reasonable grasp.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

In order to generate expert examples that are distillation-friendly, GenH2R introduced Landmark Planning. The trajectory of the human hand will be divided into multiple segments according to the smoothness and distance of the trajectory, with Landmark as the dividing marker. In each segment, the human hand trajectory is smooth, and the expert method plans towards the Landmark point. This approach guarantees both visual-action correlation and action continuity.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

C. Prediction-assisted 4D imitation learning Xi network

Based on large-scale expert examples, GenH2R uses a method of mimicking Xi to construct a 4D strategy network to decompose the geometry and motion of the observed time-series point cloud information. For each frame of the point cloud, the iterative Closest Point algorithm is used to calculate the pose transformation between the point cloud and the previous frame to estimate the flow information of each point, so that each frame of the point cloud has motion characteristics. Then, PointNet++ is used to encode the point cloud of each frame, and finally not only decode the final required 6D egocentric action, but also output an additional prediction of the future pose of the object, so as to enhance the prediction ability of the strategy network for the future hand and object motion.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

Unlike more complex 4D backbones (e.g., Transformer-based), this network architecture has fast inference speed, which is more suitable for human-computer interaction scenarios that require low latency, such as handing over objects, and it can also effectively use timing information to achieve a balance between simplicity and effectiveness.

experiment

A. Simulation environment experiments

The GenH2R and SOTA methods were compared under various settings and the method of training with large-scale simulation data in GenH2R-Sim achieved significant advantages over the method of training with small-scale real-world data (an average of 14% increase in success rate and 13% reduction in time across various test sets).

In the real-world test set s0, GenH2R's approach can successfully hand over more complex objects and be able to adjust the posture in advance, avoiding frequent posture adjustments when the gripper is close to the object:

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

In the simulation data test set t0 (introduced by GenH2R-sim), GenH2R's approach can be able to predict the future posture of an object for a more reasonable approach trajectory:

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

In the real-world test set t1 (GenH2R-sim was introduced from HOI4D, about 7 times more than the previously worked s0 test set), GenH2R's approach can be generalized to unseen real-world objects with different geometries.

B. Real machine experiments

GenH2R also deploys the learned strategies to real-world robotic arms to complete the "sim-to-real" jump.

GenH2R's strategy is more adaptable to more complex trajectories (e.g., rotation), and GenH2R's approach is more generalizable to more complex geometries:

GenH2R has completed real-world testing and user research on various handover objects, demonstrating strong robustness.

让机器人感知你的「Here you are」,清华百万场景打造通用人机交接

For more information about the experiments and methods, please refer to the paper homepage.

Meet the team

The authors of the paper are Tsinghua University students Wang Zifan (co-author), Chen Junyu (co-author), Chen Ziqing and Xie Pengwei, and the supervisors are Yi Li and Chen Rui.

Tsinghua University's 3D Visual Computing and Machine Intelligence Laboratory (3DVICI Lab) is an artificial intelligence laboratory under the Institute for Interdisciplinary Information Sciences of Tsinghua University, established and directed by Professor Yi Li. 3DVICI Lab aims at the most cutting-edge of artificial intelligence in general 3D vision and intelligent robot interaction, and its research directions cover embodied perception, interaction planning and generation, human-machine collaboration, etc., and is closely related to robotics, virtual reality, autonomous driving and other application fields. The team's research goal is to equip agents with the ability to understand and interact with the three-dimensional world, and the results have been published in top computer conferences and journals.

Read on