laitimes

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

author:New Zhiyuan

Editor: Peach So sleepy

The latest video of Tesla's humanoid robot "Optimus Prime" has been released, and with the blessing of end-to-end neural networks, it can accurately classify objects and find the sense of body balance, making many netizens exclaim that it will change human beings.

Over the weekend, Tesla's humanoid robot "Optimus Prime" was updated in a wave, attracting many netizens to watch.

In an official video, Optimus Prime can now sort objects autonomously.

This is all thanks to the neural network behind it to complete the end-to-end training, that is, "video input, control output".

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

It is now able to precisely control the movements of the hands and legs, allowing it to learn tasks more efficiently.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Even with only vision and joint position encoders, it is possible to precisely position the hand in space.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

In addition, its neural networks run entirely on in-vehicle devices and use only visual capabilities.

Behind the powerful technical blessing, "Optimus Prime" can automatically classify blocks of different colors.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Even if someone interferes, Optimus Prime is not afraid and is still working seriously. It also has the ability to correct itself, the building blocks are down, picked up and put right.

Not only can you sort blocks, but you can also perform the opposite action to take them out again.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

After a day's work, do another stretch. At this time, "Optimus Prime" stands upright on one leg with his arms extended.

Finally, put your hands together "Namaste".

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Netizens who have seen the video exclaimed that less than 2 years ago, "Optimus Prime" needed to be pushed onto the stage, but now it can complete the performance so quickly! Moreover, this is not a pre-made trick! It uses AGI, it's amazing!

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Some netizens ridiculed, look at the balance of "Optimus Prime"... Already beat me in yoga.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

This is October 2022, on the AI DAY, the "Optimus Prime" prototype was carried up by three strong men to greet everyone.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Musk has introduced that the powerful vision system built by "Optimus Prime" and Tesla FSD (full autonomous driving) can be shared, and the underlying modules of the two have been opened.

In his view, Tesla has always been an AI company, not just a car company.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

"Soon, we will see that the number of Optimus Prime will far exceed Tesla cars."

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

How to achieve this?

At Tesla's shareholder meeting this year, a video of 5 "Optimus Prime" marching forward at the same time was released.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Compared with "Optimus Prime", which debuted last year, it has completed a very large iterative upgrade.

This time, through vision, fine control of hand movements, and even more full of buffs.

Nvidia senior scientist Jim Fan "reverse-engineered" Optimus Prime to analyze the ways its tech stack might be implemented.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

It is worth mentioning that Jim Fan's in-depth analysis even attracted Musk's return!

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

1. Imitation learning

Optimus' fluid hand movements are almost certainly trained on imitation learning (behavioral cloning) of human operators.

In contrast, reinforcement learning in simulations can result in shaky movements and unnatural hand postures.

Specifically, there are at least 4 methods that can be used to collect human demonstrations:

(1) Custom remote operating system: This is the most likely means adopted by the Tesla team.

Open source example: ALOHA is a low-cost dual-robotic arm and remote operating system developed by Stanford, UC Berkeley and Meta. It enables very precise and dexterous movements, such as loading AAA batteries into a remote control or operating contact lenses.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

ALOHA Project Address: https://tonyzhaozh.github.io/aloha/

(2) Motion capture (MoCap) method 1: Use the MoCap system used in Hollywood movies to capture the subtle movements of hand joints.

Optimus' five-fingered hands are a great design strategy for direct mapping – without a "figurative gap" from a human operator.

For example, the presenter puts on a CyberGlove and grabs a square on a table. At this point, CyberGlove captures motion signals and haptic feedback in real time and redirects them to the Optimus.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

(3) Motion capture (MoCap) method two: through computer vision technology.

NVIDIA's DexPilot enables low-labeling, gloveless data acquisition, allowing human operators to complete tasks with their own hands.

Among them, 4 Intel RealSense depth cameras and 2 NVIDIA Titan XP GPUs (yes, that's a 2019 job) can convert pixels into precise motion signals for robots to learn.

In the official demonstration of NVIDIA, the robot arm supported by the DexPilot system can accurately complete the grasping and placing tasks.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

(4) VR headset: Turn the training room into a VR game that allows humans to "play" Optimus.

Using a native VR controller or CyberGlove to control the hands of virtual Optimus can bring the advantage of remote data collection – annotators from around the world can contribute without being there.

For example, research projects such as Jim Fan's iGibson Home Robot Simulator have similar VR demonstration technology.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

iGibson Project Address: https://svl.stanford.edu/igibson/

The above 4 are not mutually exclusive, and Optimus can be combined according to different scenarios.

2. Neural architecture

Optimus is end-to-end training: input video, output action.

To be sure, this is a multimodal transformer that contains the following components:

(1) Image: Efficient ViT variant, or just the old ResNet/EfficientNet backbone network. The pick-and-place demonstration of blocks does not require complex visual techniques. The spatial feature map of the image backbone can be easily segmented.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

EfficientNet paper address: https://arxiv.org/abs/1905.11946

(2) Video: Two methods. Either compress the video into a series of images and generate tokens independently, or use a video-level token.

There are many ways to handle video pixel volumes efficiently. You don't necessarily need a Transformer backbone network such as SlowFast Network and RubiksNet.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

SlowFast Network paper address: https://arxiv.org/abs/1812.03982

RubiksNet Project Address: https://stanfordvl.github.io/rubiksnet-site/

(3) Language: It's unclear if Optimus supports language hints. If so, a way to "fuse" linguistic representation with perception is needed.

For example, the lightweight neural network module FiLM can achieve this. You can intuitively think of it as "cross-attention" embedded in the neural pathways of language in image processing.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

FiLM paper address: https://arxiv.org/abs/1709.07871

(4) Action segmentation: Optimus needs to convert continuous motion signals into discrete tokens so that the autoregressive transformer can work properly.

- Assign the continuous values of each hand joint control directly to different intervals. [0,0.01)->token#0,[0.01,0.02)->token#1, etc. This method is straightforward, but may not be efficient due to the long sequence length.

- Joint movements are highly dependent on each other, which means they occupy a low-dimensional "state space". Applying VQVAE to motion data results in a shorter compressed token set.

(5) Combining the above parts, we have a Transformer controller, which consumes the video token (which can optionally be fine-tuned by language) and outputs the action token step by step.

The next frame in the table is fed back to the Transformer controller so that it knows the result of its actions. This is the self-correcting ability demonstrated in the presentation.

The structure will be similar to Google's RT-1 and NVIDIA's VIMA:

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Google RT-1:https://blog.research.google/2022/12/rt-1-robotics-transformer-for-real.html?m=1

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

NVIDIA VIMA:https://vimalabs.github.io

3. Hardware quality

As mentioned earlier, following the human form closely is a very wise decision so that there are no gaps when it comes to mimicking humans.

In the long run, Optimus' five-finger hands will perform better in their daily work than Boston Dynamics' humble hands.

FSD is the starter, Optimus Prime is the future

Another netizen lamented the upgrade of Tesla's humanoid robot, "This will change the world forever."

In the following long article, he analyzes Optimus Prime's technological upgrades and future visions.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

On August 19, 2021, Tesla showed the world for the first time a humanoid robot "Optimus Bot" that will be launched.

The only people who appeared and danced on the spot were humans in robot performance suits.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Musk then gave a 10-minute presentation outlining plans to expand the product lineup to humanoid robots.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Fast forward to now, and Tesla has built multiple usable robot prototypes.

They can walk autonomously, pick up, place objects, navigate their surroundings, and perform tasks such as sequencing.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

In the latest video, Optimus Prime has been able to complete the brick classification.

At first glance, it probably won't be impressive, especially when you compare it to Boston Dynamics' robot Artemis for backflips and parkour.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

But "how to learn to sort" is a breakthrough I want to focus on, with exciting implications not only for Tesla, but for the global labor market.

"Video input, control output."

This is a topic that Musk has been talking about for a long time. The premise is to build a neural network system that doesn't require humans to write code that tells machines what to do.

Moreover, this set of principles is similar to Tesla's automatic driving system FSD.

Some time ago, when Musk live-streamed the test drive of FSD v12, he proudly introduced the training of the neural network behind it, all the video data used, the ability to perform tasks, and no need to write a line of code by hand.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Tesla headquarters has an "AI brain" that analyzes the vast amount of video data collected by the car and then tells the car how to walk in each scene it encounters on the road.

Instead of a line of human writing code to interpret stop signs, traffic lights, etc., Tesla FSD learned how to do this by observing driving situations.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

This is indeed a big deal.

That means Tesla is now limited by how much video data it can collect from its EV driving, and how many chips (from the Nvidia H100 and internal DOJO chips) can process that data.

Fortunately, they are no longer limited by "code" breakthroughs, and the AI brain they have can solve this problem with enough examples.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

What's more, this approach to solving real-world driving problems can be applied to any physical task.

Just input video is required and the AI will emit a control signal. Therefore, the "Optimus Prime" robot is the real future.

Even though Optimus Prime and Tesla seem like two completely different objects, they have a lot more in common than they seem.

They all use software to navigate the physical objects of their environment, the same on-board computers to process said software, the same batteries to power motors so that each object can move, and artificial intelligence brains to teach themselves how to perform tasks by analyzing countless video data.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Based on the information Tesla has published so far, it is safe to assume that the robot will be able to do this, not because the human-written code "picks up the blue block and puts it in the blue area"...

But by analyzing video footage of blocks sorted by appropriate colors, this is no different from how cars learn to drive themselves.

A seemingly inconspicuous gesture highlights this point, but proves how powerful this approach can be.

Including in the later clip, "Optimus Prime" straightens the blocks that are poured out on the side. This could mean that the AI brain has video footage showing objects being classified head-on, rather than side-up.

The robot automatically understands that the block it is sorting falls on its side without human code, picks it up, adjusts its orientation, and then puts it back on the correct side.

This means that the robot is able to adjust dynamically without any clear instructions on how to handle the complexities of the real world.

As long as Tesla can build a robot that can reliably execute commands from a physical point of view. This means that actuators, batteries, hands, joints, etc. are manufactured to be extremely durable and capable of repetitive tasks.

The world will be changed forever.

With enough strength and flexibility, Tesla's robots can handle almost any physical task just by watching video footage of people performing the tasks mentioned above.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

Pick up a vacuum cleaner and run it around the house, sort and fold laundry, clean up the house, move materials from point A to point B, pick up garbage and put it in the bin, push a lawn mower, monitor an area for safety-related problems, lay bricks, hammer nails, use power tools, wash dishes...

Like cars, robots are not limited by code breakouts when it comes to handling the above tasks.

It is limited by the amount of video data and chips Tesla's AI brain can process to tell the robot what to do.

Now, with Optimus Prime, Tesla is beginning to transform into a product category that the vast majority of the world believes will take decades or even millennia to achieve. But in fact, the company is knocking on the door of a paradigm shift that could upend what it means to work.

In the latest "Musk Biography", the discussion between Musk and his engineers is excerpted.

"The goal of the robot should be to run for 16 hours without charging." This is equivalent to 2 8-hour shifts of human labor, completely uninterrupted.

It drastically reduces labor costs, making budgets for products and services likely to be a fraction of what they are today. And it gives businesses no reason to hire a person to produce products and services and do the same job in 5 years at 7 times the cost.

Tesla Optimus robot video exploded! End-to-end AI brain blessing to challenge difficult yoga

The reality is that this future is much closer than many people think.

Tesla seems to have solved the most difficult problem of human labor — the AI brain will automatically generate actions based on video analyzed in the real world.

With their manufacturing expertise, they should be able to produce millions of these products per year for decades to come, which should lead to tremendous enrichment.

Resources:

https://twitter.com/Tesla_Optimus/status/1705728820693668189

https://twitter.com/DrJimFan/status/1705982525825503282

https://twitter.com/farzyness/status/1706006003135779299

Read on