The humanoid robot is really about to land! "Cyber Nanny" started the new year, and the financing orders of start-ups were soft

智东西（公众号：zhidxcom）

Author | vanilla

Edit | Li Shuiqing

Just three weeks after 2024, the AI+robotics track has ushered in an explosive start!

In the past, there was a Stanford robot show cooking "Manchu Han Full Banquet", and then there was a Tesla Optimus Prime incarnate nanny who folded T-shirts into "tofu blocks", and two start-up robots competed to make coffee. Here, OpenAI-backed robot company 1X has just announced a $500 million financing, and the startup Figure has announced that its robots have settled in BMW's car factory.

This seems to confirm a prediction made by Jim Fan, a senior scientist at Nvidia, late last year: 2024 will be the year of the explosion of robots, second only to large language models (LLMs) in importance, "We are still 3 years away from the ChatGPT moment of physical AI agents." ”

The humanoid robot is really about to land! "Cyber Nanny" started the new year, and the financing orders of start-ups were soft

▲Jim Fan said that the importance of robots in 2024 is second only to LLM (Source: X)

However, under the "carnival" of robot companies, the authenticity of their promotional videos and the practicality of robot products have also caused controversy. Many netizens pointed out that these demos seem to have misleading work in terms of editing.

So specifically, what can AI robots do now? Is it autonomous execution or human control behind the various behaviors? What is the current stage of development of the AI robot track? What pain points are still faced at the landing level? Zhidong has conducted in-depth exchanges with Hu Debo, CEO of Kepler Exploration Robot, Xiong Youjun, co-founder, chief technology officer and executive director of UBTECH, and other practitioners to find answers to these questions.

Hu Debo said that the scenarios where AI robots are most likely to land first are mainly focused on simple and repetitive and relatively controllable tasks, including industrial manufacturing scenarios, warehousing and logistics scenarios, and some dangerous scenarios. He believes that the real-time problem caused by the invocation of cloud large models is the biggest pain point at the landing level.

When it comes to the pain points of AI robots, Xiong Youjun analyzed from the aspects of data, scenarios, security, and migration costs. For example, most of the existing training data is based on desktops, which is far from the application in real scenarios, and the uninterpretability of large models may lead to problems similar to "hallucinations" in language models.

1. Cooking, making coffee, folding clothes, Stanford Google Tesla fancy "roll"

If at the end of last year, it was just an empty slogan to predict that 2024 would become the "Year of Robotics", then this year, Stanford, Google, Figure, and Tesla have released more than 6 new demos or new developments in less than a month, which provides a strong argument for this view.

In the early hours of January 4, a three-person team from Stanford University released a demonstration video of a robot based on the Mobile ALOHA system, showing how the robot can complete complex mobile control tasks, whether it's cooking, cleaning the tabletop, or pressing the elevator button and riding the elevator.

▲Mobile ALOHA cooking, elevator ride, and cleaning demonstration (Source: Mobile ALOHA team)

The team open-sourced all the software, hardware and data of the Mobile ALOHA system, and from the list of materials, the hardware cost totaled about 31,800 US dollars, equivalent to about 228,000 yuan.

▲Mobile ALOHA hardware material list (Source: Mobile ALOHA team)

According to reports, Mobile ALOHA is a low-cost full-body remote operating system for data collection, and during the training process, only 50 demonstrations were performed for each task, the key of which is to use the data collected by Mobile ALOHA to perform supervised behaviors, and train in collaboration with static ALOHA data, which can increase the success rate by 90%.

ALOHA is a low-cost open-source hardware system for two-handed remote operation, which was released in March last year by teams from Stanford, UC Berkeley, Meta and other institutions.

▲ALOHA system demonstration (Source: ALOHA team)

Less than 24 hours later, Google DeepMind released three new developments, AutoRT, SARA-RT and RT-Trajectory, to improve the robot's speed, data collection and generalization capabilities.

All three new advances are based on DeepMind's RT-2 model (Robotics Transformers), a vision-language-action (VLA) model that learns from network and robot data and translates what has been learned into generic instructions for robot control.

▲RT-2模型的原理演示(图源:DeepMind)

AutoRT is an embodied basic model system for large-scale orchestration of robot agents.

The robot first uses a visual language model (VLM) for scene understanding, inputs descriptions into a large language model (LLM) to derive natural language instructions, and then refines the instructions to achieve safer behaviors under the guidance of another LLM called the Robot Constitution.

▲AutoRT工作原理(图源:DeepMind)

Among them, the robot constitution contains three types of rules, namely the basic rule, which means that robots must not harm humans, safety rules, which allow robots to attempt tasks involving humans, animals, or creatures, and that robots must not interact with sharp objects (such as knives), and embodiment rules, which allow robots to perform tasks that require two arms if they have only one arm.

According to reports, in more than 7 months of field evaluation, the AutoRT system can safely coordinate up to 20 robots at the same time, collecting 77,000 robot trials including 6,650 unique tasks.

▲Latency demonstration of AutoRT running on 8 robots (Source: DeepMind)

SARA-RT proposes an adaptive robust attention mechanism to improve the RT model to a more efficient version without losing quality. After providing a brief image history, the best SARA-RT-2 model was 10.6% more accurate and 14% faster than the RT-2 model.

▲SARA-RT-2 model is used for robot operation tasks (Source: DeepMind)

RT-Trajectory is a model that summarizes robot tasks through post-event trajectory sketches to improve the generalization ability of robots. It takes each video in the training dataset and overlays it with a 2D trajectory sketch of the robot arm gripper as it performs the task, providing practical visual cues.

When testing 41 tasks not seen in the training data, the success rate of the robotic arm controlled by RT-Trajectory reached 63%, compared to 29% for RT-2.

▲RT-Trajectory模型原理（图源：DeepMind）

On January 7, the startup Figure released a video of the robot Figure 01 making coffee, emphasizing that the robot uses an end-to-end AI system to complete training in 10 hours just by observing humans making coffee.

▲机器人Figure 01煮咖啡演示（图源：Figure）

It is said that Figure 01's neural network receives video training and outputs motion trajectories. It also learns to self-correct, such as when the espresso isn't in the right place, it adjusts it to the right position.

▲机器人Figure 01自我修正（图源：Figure）

On January 11, OpenAI-backed AI and robotics company 1X announced the completion of a $100 million Series B funding round from investors including Samsung NEXT Fund and Swedish private equity fund EQT.

The funds will be used primarily to bring its second-generation bipedal humanoid robot, Android NEO, to market, as well as to support existing enterprise customers in logistics and security. Designed for everyday household assistance, NEO provides versatile support for a wide range of housekeeping tasks in the consumer market.

▲1X second-generation bipedal humanoid robot Android NEO (Source: 1X)

Within a few days, Optimus, the "top stream" in the humanoid robot world, also came to join in the fun. On January 16, Musk posted a video of Optimus Prime folding clothes, which instantly ignited social networks and was viewed more than 71 million times.

In the video, Optimus Prime takes out a T-shirt from the basket next to him and folds it into a "tofu block" in two or three clicks.

▲擎天柱叠衣服演示（图源：X）

On January 18, Figure announced that it had signed a commercial agreement with BMW for the robot Figure 01 to enter BMW factories to "automate difficult, unsafe, and tedious tasks" in the car manufacturing process.

On January 20, MagicLab, a startup from China, released a video of a humanoid robot doing somersaults, which is said to be the first time that an electrically driven humanoid robot has achieved a somersault. In addition, MagicLab also demonstrated the process of making coffee and latte art by this robot.

▲MagicLab robot makes latte art (source: X)

2. False propaganda or real talent and real learning? The authenticity and practicality of the explosion have caused controversy

I have to say that in the first three weeks of the new year, the industry, academia and research circles are "crazy" AI robots. However, these new achievements have also caused some controversy while they have exploded on the screen, such as whether the demonstration is real, whether the robot system is really practical, etc.

After the release of the Mobile ALOHA presentation video, in addition to the praise, there were also many doubts in the comment area.

Bloomberg columnist Karl Smith commented: "Sorry, I don't think the shrimp are fully cooked. This is another Gemini Ultra-style demo. ”

▲Netizens questioned the authenticity of the demonstration video and the practicality of the robot (Source: X)

As an aside, it seems that Google's "fake" use of editing in Gemini demo videos is really impressive, and "Gemini-style demo" has become a new adjective.

"But how does it taste?" said developer Nick Dobos.

▲Netizens questioned the practicality of robot cooking (source: X)

Netizen Sarah Roark questioned that it was remotely controlled by humans: "To be clear - is this definitely not remotely controlled?"

▲Netizens questioned whether the robot is in autonomous mode (Source: X)

In the face of these doubts, especially the controversy over autonomous mode and remote control, the Mobile ALOHA team quickly released a collection of robot "rollovers" on January 6 to clarify.

In fact, Stanford has released several Mobile ALOHA demonstration videos at the same time, including one of the authors, Zipeng Fu, who posted a video of the operation in autonomous mode.

▲Zipeng Fu released a demonstration video of the autonomous mode (Source: X)

Another author, Tony Z. Zhao, released a demonstration video of the "Manchu Han Full Seat", which was done remotely by humans in hybrid mode, but many people mistakenly believe that all the demonstrations were completed in autonomous mode.

▲Mobile ALOHA in hybrid mode (Source: X)

In the clarification video, the team shows some of the "stupid mistakes" that robots have made in autonomous mode.

For example, you think it can gracefully pick up a goblet, but in fact it "slips" its hands many times:

▲Mobile ALOHA slid the wine glass off (Source: X)

The fried shrimp should have been poured into a bowl, but it fell on the table, and the pot was still half burnt:

▲Mobile ALOHA pours shrimp on the table (Source: X)

In the process of frying shrimp, the spatula is often unstable:

▲Mobile ALOHA炒虾失败（图源：X）

However, after the video of the mistake collection was released, netizens not only did not fall into the ground, but expressed their encouragement.

"Thanks for sharing these. Many people see the previous video and think that the robot is fully autonomous, but in reality it is operated remotely. As this video shows, autonomous mode is much more difficult!", said netizen Phil Trubey.

Tony Z. Zhao also responded, "It's really a hybrid model, and we really want people to be able to visit the project website and read the paper/code!"

▲Tony Z. Zhao回应网友评论(图源:X)

"I prefer this video because it shows the effort and progress behind it. Netizen Kevin Hu praised this kind of sincere display of the mistakes behind it.

▲ Netizens commented on the mobile ALOHA mistake video (source: X)

Yoshihiro Tanaka, CEO of Japanese creative studio Taziku, said: "It's not perfect, but in other words, it's cute and likable. ”

▲ Netizens commented on the mobile ALOHA mistake video (source: X)

On Optimus Prime's side, sharp-eyed netizens found that there seemed to be a hand in the lower right corner of it that was remotely controlling the movement.

▲A manipulator appears in the lower right corner of Optimus Prime (Source: X)

Musk added in the comment area for the first time: "Optimus Prime is not yet able to perform the operation of folding clothes autonomously, but in the future, it will definitely be able to perform this operation completely autonomously in any environment (no need for a fixed table with a box with only one shirt)." ”

▲Musk emphasized that Optimus Prime completed the operation involuntarily (Source: X)

Like Mobile ALOHA, Optimus Prime's folded clothes display has been questioned in terms of practicality.

Some netizens said: "My mother may have chased it away and said: It's too slow, let's do it." ”

▲网友质疑擎天柱实用性（图源：X）

"IT OPERATES REMOTELY LIKE AN ALOHA ROBOT...... In my opinion, the biggest problem with Optimus Prime is cost. Bindu Reddy, CEO of AI startup Abacus, said.

▲Netizens questioned the cost performance of Optimus Prime (source: X)

Some netizens felt that it was too slow: "Will it be so slow when they try to rule the world? If that's the case, I don't have to worry about the Terminator as much as before." ”

▲ Netizens questioned the speed of Optimus Prime's action (Source: X)

Third, there is less data, many scenarios, and poor real-time, and the landing of embodied robots must overcome these difficulties

Although these demonstrations contain more or less hype and packaging components, it is undeniable that they have made a lot of contributions to the track of embodied intelligent robots.

On the one hand, the explosion of demo videos has brought more attention to this area, and on the other hand, they have also demonstrated the potential for sophisticated physical operations, low-cost solutions, etc.

Regarding the video of the mistake released by the Stanford Mobile ALOHA team, Hu Debo, CEO of Kepler Discovery Robotics, told Zhidong that this cannot be seen as a "rollover", but an inevitable experience behind the success.

He believes that the main reason why Mobile ALOHA is popular is that it has inspired everyone's expectations for the application of robots in housework scenarios. On a technical level, its greatest contribution lies in the finesse of the physical operation. Cooking, watering flowers, washing clothes...... Mobile ALOHA demonstrates the ability of robots to solve these menial tasks needed to enter the home.

▲Hu Debo and Kepler humanoid robots at CES 2024 (Source: provided by the interviewee)

Xiong Youjun, co-founder, chief technology officer and executive director of UBTECH, also believes that this is not a "rollover", but an inevitable process of technological development. Collecting data through remote control and other methods in real scenarios can lay the foundation for future robot training and provide more efficient solutions.

Talking about the main contribution of Mobile ALOHA, he believes that this system demonstrates a low-cost solution, such as the selection of hardware such as webcams, laptops, etc. And it is still in the demo stage, and if it is put into mass production in the future, the cost will be lower.

If we use the iteration of the GPT model as an analogy, Hu Debo believes that the current development stage of AI robots is roughly equivalent to GPT-2.

Specifically, robots at this stage have shown some intelligence and autonomy, being able to learn and complete some simple operations autonomously, that is, the IQ of robots has been significantly improved. But at present, there is no robot like GPT-3 that can solve problems on a large scale, form a large number of users, and become a phenomenal product.

At the landing level, Hu Debo believes that the biggest pain point is real-time. Since the response time of invoking the cloud model may reach the second level, it is difficult for robots that need real-time operations to support their deployment in the scene.

In addition, Xiong Youjun told Zhidong that data, scenarios, security, and migration costs are also pain points faced by many enterprises.

▲Xiong Youjun, co-founder, chief technology officer and executive director of UBTECH (Source: World Robot Conference Forum)

The first thing you face when training a large model is the problem of data collection. The data required to train a robot model is different from that of training a large language model, which requires not only text corpus, but also a large number of pictures, real scenes, and other data.

In terms of scenarios, due to the complexity of the physical environment in reality, most of the existing training is based on desktops, and there is still a big gap between actual implementation and life.

In terms of security, because large models are black-box operations, many behaviors are not explainable. In a language model, if there is an "illusion" problem such as an error, it may only mislead the user, while if a robot model makes an error, it may cause harm to the environment or humans, causing irreversible consequences.

Finally, the success rate of migrating from training to real-world scenarios is still low, and many engineers need to spend a lot of effort to solve these problems, so the migration cost is high, and there is still a long way to go to achieve more than 99% accuracy and reliability.

Although there are still many problems in the implementation of AI robots, Xiong Youjun is also optimistic about it. The AI robot track has attracted a lot of attention, with many companies and resources such as the above, coupled with the rapid development of AI technology, more progress has been made in the past two years than in the past ten years.

Overall, Hu Debo said that the scenarios where AI robots are most likely to land first are mainly focused on simple and repetitive, relatively controllable tasks.

The first is the manufacturing scenario, which includes a large number of auxiliary and relatively simple work, the second is the warehousing and logistics scenario, including some repetitive manual labor such as sorting and handling, and the third is the dangerous scene, such as the patrol and inspection of nuclear power plants, chemical plants, military factories and other places.

Conclusion: It's too early for robots to "take over humans".

Robots that can cook, clean, and fold clothes on their own are certainly eye-catching, but when we calm down and look at them, we will find that these robots still need to be controlled remotely by humans, and in fully autonomous mode, they are "clumsy", and there is still a certain distance from true intelligence.

Issues such as data, scenarios, and security remain the "Achilles' heel" of bots, and the good news is that we've seen more progress in these areas by organizations like DeepMind.

In any case, the "volume" of enterprises and institutions is a good thing, and we look forward to seeing AI robots learn more skills in 2024 and go further on the road to entering industrial, home and other scenarios.

The humanoid robot is really about to land! "Cyber Nanny" started the new year, and the financing orders of start-ups were soft

The humanoid robot is really about to land! "Cyber Nanny" started the new year, and the financing orders of start-ups were soft

Read on