laitimes

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

author:New Zhiyuan

Editor: Editorial Department

The ChatGPT moment of the robot is really coming! The startup Figure's own robot watched 10 hours of video and learned to make coffee. On the other hand, the Alter3 robot, powered by GPT-4 at the University of Tokyo, is able to imitate any human movement. Humans only need to issue natural language instructions, and no programming is required at all!

Robot, this week really ushered in the ChatGPT moment!

Figure, a start-up company, has built a robot that can learn how humans can make coffee.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Just yesterday, the founder of Figure announced in advance the major breakthrough made by his laboratory on social platforms.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

The University of Tokyo has connected GPT-4 to the humanoid robot Alter3.

GPT-4 can convert natural language commands into executable code that allows robots to mimic humans in any action, including playing guitar, taking selfies, playing ghosts, and even going to the cinema to steal someone's popcorn.

The mood is very laid-back, a cup of tea.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Rock 'n' roll with a guitar.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Pretend I'm a snake.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Take a selfie and put on a playful and pompous face like the influencers.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Watching a movie while eating popcorn in the movie theater, he suddenly realized that he was eating someone else's popcorn, and immediately laughed awkwardly.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Watch humans make coffee for 10 hours and learn this skill

The robot, named Figure 01, uses an end-to-end artificial intelligence system.

It only needs to watch a video of a human making coffee to learn the skill of making coffee in 10 hours.

The robot uses a neural network to process and analyze video data. By watching the video, it learns human movements and gestures, and then mimics those movements to learn the process of making coffee Xi.

This process proves that the robot can learn Xi skills completely autonomously without having to be programmed!

Just say to it: Figure 01, can you make me a cup of coffee?

It will put the coffee capsule into the machine, press the button with your hand, and in a short time, a cup of fragrant coffee is ready!

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

One of the most valuable things about this process is that the robot can learn how to correct its own mistakes Xi, such as if the coffee capsule is not placed correctly, it will correct itself.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Brett Adcock explains why video data training is so important.

This is groundbreaking because if you have access to human data for an application (e.g., making coffee, folding laundry, working in a warehouse, etc.), you can train AI systems end-to-end on top of Figure 01.

This is one way that can be scaled to every application. As the number of bots grows, more data is collected from the bot swarm and retrained to achieve better performance.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

It is worth mentioning that many netizens expressed surprise at the speed at which the robot brewed coffee. The official response said that the video did not speed up.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

The University of Tokyo's ghost-pretending robot

And this humanoid robot at the University of Tokyo has been out of the circle before because of its lifelike "ghost" behavior.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Researchers at the University of Tokyo have linked the humanoid robot, called Alter3, to GPT-4.

With commands, it can perform a series of human actions, such as playing the guitar, taking selfies, pretending to be ghosts, etc., and even going to the movie theater to steal other people's popcorn.

In this process, the LLM converts written instructions into executable code that allows the robot to mimic a variety of human movements. Judging from the video effect, it is really a stumbling Stanford housework robot that has been in the limelight recently.

In other words, the reason why Alter3 can play the upper body and imitate the "ghost" so superbly is because GPT-4's prompt is so good!

"0 creates a wide-eyed facial expression of fear, opens your mouth and screams silently",

"1 leans back quickly, as if startled by a sudden apparition",

"2 Raise your hands and wave them at your face, imitating ghostly movements",

"3 big mouths shaking their heads, showing a dramatic reaction of fear",

"4 Move your upper body from one side to the other as if haunted by the presence of a ghost",

"5 hands in front of the chest, showing extreme anxiety",

"6 The eyes glanced from one side to the other, as if they were witnessing a strange activity",

"7 leaning forward, then leaning backward, imitating the floating motion of a ghost",

"8 Slowly return to the resting position while maintaining a frightened expression."

Through code, human movements are mapped to robots

How does Alter3 use LLMs to generate spontaneous motion?

The specific process is to integrate GPT-4 into Alter3, so as to effectively combine GPT-4 with Alter's body movements.

In general, low-level robot control is hardware-dependent, which is beyond the scope of the LLM corpus, which presents a challenge for direct LLM-based robot control.

However, with Alter3, the researchers achieved a breakthrough - through the program code, they mapped the language expression of human actions to the robot's body, which made direct control feasible.

Instead of explicitly programming each body part, this approach allows Alter3 to take a variety of poses, such as taking a selfie or playing a ghost, and generate a series of actions over time.

This fully proves the zero-shot learning ability Xi of the robot.

Even, verbal feedback can adjust the robot's posture without fine-tuning.

For example, if GPT-4 is used to tell the robot, "Play metal music," it receives the message and begins to play the electric guitar in a similar manner, with its head still shaking to the rhythm.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

"0 creates a facial expression of intense excitement, opens your eyes wide, opens your mouth slightly, and shows a wild smile",

"1 Leans forward aggressively, as if ready to dive into the music",

"2 Imitate the movement of holding the neck of the guitar with the left hand",

"3 Start playing the air with your right hand as if you were playing a heavy improvisation",

"4 Shake your head up and down rhythmically, mimicking the head-shaking action associated with metal music",

"5 Raise your left hand as if reaching for a higher guitar note, your eyes locked on your imaginary fretboard",

"6 Imitate a dramatic guitar strumming with your right hand, as if hitting a powerful chord",

"7 The right hand slowly sweeps over the imaginary guitar strings, imitating the guitar solo",

"8 Imitating the action of smashing an imaginary guitar on the floor, embodying the wild spirit of metal music",

"9 Gradually return to a resting position, but maintain intense facial expressions and show lingering excitement"

LLMs liberate humans from iterative work

Before the advent of LLMs, in order for a robot to mimic a person's posture or pretend to be a behavior, such as serving tea or playing chess, researchers had to control all 43 axes in a certain order.

In this process, many improvements are required to be made manually by human researchers.

Thanks to LLMs, human researchers are now freed from iterative work. As long as you use verbal commands, you can control the Alter3 program.

The researchers applied two chain-of-thought protocols written in natural language successively, and did not need to learn the iteration of the Xi process (i.e., zero-shot Xi learning).

As shown, the researchers used the following protocol.

It is important to note that GPT-4 is non-deterministic, even when $temperature=0.$.

As a result, different movement patterns can be produced even if the inputs are the same.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

Programs that control the Alter3 humanoid robot using verbal commands. Alter3 is controlled using natural language by using Prompt1 and 2 to output python code, which is based on CoT

Verbal feedback

Alter3's inability to observe the effects of its own generation on any physical processes is highly unnatural in the human sense.

As a result, Alter3 is unable to accurately understand details such as "how high the hand is raised" and therefore cannot improve its movements accordingly.

By empirically developing and utilizing external memories through feedback, Alter3's body model can be integrated with GPT-4 without the need to update its parameters.

Alter3 can now rewrite code based on human verbal feedback.

For example, if the user suggests that the arm be raised a little higher when taking a selfie, Alter3 can then store the modified action code in the database as an action memory.

This ensures that the next time the action is generated, the robot will use the improved and trained action.

Through this feedback, the robot accumulates information about its own body, and the memory can effectively act as a body icon.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

The diagram above illustrates the language feedback system in Alter3.

During this process, the user provides verbal feedback to guide Alter3 in each motion segment, such as "setting axis 16 to 255" or "moving the arm more forcefully".

In this process, the user only needs to provide verbal instructions without having to rewrite any code, and Alter3 will automatically modify the corresponding code.

Once the action is perfected, it is saved in a JSON database with descriptive tags, such as "hold the guitar" or "thoughtfully tap the chin".

For generating actions using prompt2, JsonToolkit helps the database search for these tags, and the LLM determines memory usage and the creation of new actions.

\textbf{(b)} compares scores with and without feedback, with sports with feedback scoring higher than those without feedback.

outcome

To quantify GPT-4's ability to generate actions, the researchers evaluated 9 videos of different generated actions, grouping them into two categories.

The first case is "instant gestures", which include everyday actions such as "taking a selfie" and "drinking tea", as well as imitation actions such as "playing ghosts" and "playing snakes".

The second scenario is an action over a period of time, including more complex scenarios. For example, the embarrassing plot of "eating popcorn while watching a movie in the theater and realizing that you are actually eating the popcorn of the person next to you", and the emotional scene of "jogging in the park, the world seems to tell an ancient survival story, and every footstep echoes with an eternal existence".

These actions are all generated by GPT-4. The temperature of Tip 1 is set to $0.7$ and the temperature of Tip 2 is set to $0.5$. Subjects ($n=107$) were recruited through the Prolific platform.

They watched the videos and evaluated GPT-4 (GPT-4-0314) for its expressive abilities (assessed on a 5-point scale, with 1 being the worst).

In the control group, the researchers took Alter3's random actions and appended the random action symbols generated by GPT-4 as markers for these actions.

These labeled control videos were cleverly incorporated into the survey, with 3 of them scattered among the main experimental videos shown to the participants.

To determine whether there was a significant difference in scores between the control video and other videos, the research team first used the Friedman test. The results show a noticeable difference in ratings between videos. Further post-hoc analysis using the Nemenyi test showed that while there was no significant difference in p-value between the control videos, the control group had significantly smaller p-values compared to other videos, suggesting a significant difference (see figure).

As a result, GPT-4 generated significantly higher action scores compared to the control group. This suggests that the android actions generated by GPT-4 are different from those perceived by the control group. This result shows that the system can generate a variety of actions, from everyday actions such as taking selfies and drinking tea to imitating non-human actions such as ghosts or snakes.

GPT-4 blessed Alter3 robot to play guitar, Figure 01 watched the video to learn to make coffee

The average assessment score for each action

The training of LLMs consists of linguistic representations of a series of actions. GPT-4 can accurately map these representations onto Alter3's body.

Most notably, Alter3 is a humanoid robot identical to a human, which allows GPT-4 to directly apply a wealth of knowledge about human behavior and movements.

In addition, with Alter3, LLMs can express emotions such as embarrassment and joy.

Even in texts that do not explicitly express emotions, LLMs are able to infer appropriate emotions and reflect them in Alter3's performance. This integration of verbal and nonverbal communication can enhance the potential for more nuanced and compassionate interactions with humans.

LLMs can drive embodied intelligence

Alter3's high-energy demonstration answers the question of whether embodied intelligence is necessary for LLMs.

First of all, Alter3 does not require additional training to perform many movements. This means that the dataset on which the LLM is trained already contains action descriptions.

In other words, Alter3 can achieve zero-shot learning Xi.

In addition, it is also able to imitate ghosts and animals (or people who imitate animals), which is quite amazing.

It can even understand what it hears and reflect whether the story is sad or happy through facial expressions and gestures.

At this point, the blessing that Alter3 has received through LLM is already very obvious.

Resources:

https://tnoinkwms.github.io/ALTER-LLM/?continueFlag=bcae05c73de8a193cf0ec0b4e1046f97

https://twitter.com/Figure_robot/status/1743985067989352827?t=lMaAK1frDFSgyjuaE5KyOw&s=19

Read on