laitimes

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

author:Head technology

#暑期创作大赛#

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results
The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results
The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

Wen 丨Congerry

Is the day when AI rules the world near?

A research team led by AI scientist Feifei Li at Stanford University has announced a new result in the field of embodied intelligence, using large language models (LLMs) and visual language models (VLMs) to drive robots.

Robots are able to plan and execute operational tasks based on complex instructions given by humans in natural language. To put it bluntly, the vernacular can command the robot.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

Open the top drawer and watch out for that vase!

What's more, with the support of large models, robots are not only able to interact effectively with the environment, but also complete various tasks without additional data and training, such as bypassing obstacles, opening bottles, pressing switches, unplugging charging cables, etc.

This system, named VoxPoser by Li Feifei's team, does not require an additional pre-training process like traditional methods, but directly solves the problem of scarcity of robot training data.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

Address: https://voxposer.github.io/voxposer.pdf

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

How is VoxPoser made?

How does VoxPoser manage to understand natural language instructions without the need for predefined motion primitives or additional data and training?

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

First, the robot uses a camera to collect environmental information.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

Second, based on language instructions, large language models (LLMs) generate code that interacts with visual language models (VLMs).

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

A 3D map is then generated.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

Finally, the robot plans the route according to the map information to complete the action.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results
The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

Mapping to the real world completes the operation of opening the drawer.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

This process makes the robot more human-like, no longer relying on the database entered in advance, thus achieving zero-sample capability. (Receiving instructions→ eyes to obtain information → action)

In addition to opening drawers, the robot can "sort the garbage into a blue tray", "remove the bread from the toaster", "take out a napkin", "open the vitamin bottle", "measure the weight of the apple", "close the top drawer", "sweep the trash into the dustpan", "unplug the phone charger", "hang the towel on the shelf", "press the moisturizer pump", "put down the spoon", "turn on the light", etc.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

And, even in the event of interference, the robot can still complete the task.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

In addition, VoxPoser has emerged four behavioral abilities.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results
  • Evaluate physical characteristics: Given two blocks of unknown mass, the robot's task is to conduct physical experiments using the available tools to determine which block is heavier.
  • Behavioral common sense reasoning: In the task of laying out the dishes by the robot, the user can specify behavioral preferences, such as "I am left-handed", which requires the robot to understand its meaning in the context of the task.
  • Fine-grained language correction: For tasks that require high accuracy, such as "put a lid on a teapot," the user can give the robot accurate instructions, such as "you are off by 1 centimeter."
  • Multi-step vision program: Given a task "to open the drawer precisely in half", due to the lack of information due to the absence of an object model, the robot can propose a multi-step operation strategy based on visual feedback, that is, first fully open the drawer while recording the handle displacement, and then push it back to the midpoint to meet the requirements.

At present, VoxPoser has some limitations, such as it requires an external perception module, the need to manually enter prompt words for the built-in large model, and the need for a general dynamic model to achieve more diverse actions.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

Embodied intelligence, Li Feifei pointed out the development direction of computer vision

Who is Li Feifei?

Feifei Li is the world's top Chinese female AI expert, tenured professor and director of the Artificial Intelligence Lab at Stanford University, former vice president of Google and chief scientist of Google Cloud, whose research areas include computer vision, machine learning, deep learning, cognitive neuroscience, etc.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

She has also cultivated many outstanding AI talents, such as Andrej Karpathy, a former Open AI researcher and current director of artificial intelligence and autonomous driving vision at Tesla.

The reason for completing VoxPoser is because Li Feifei knows the importance of data to machine learning and the difficulty of obtaining it.

Feifei Li led the creation of the ImageNet dataset in 2006, the first large-scale labeled image dataset for computer vision algorithms, which contains tens of millions of labeled images that can train complex machine learning models, and is considered a milestone in the history of artificial intelligence.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

But the data was time-consuming to collect and process, taking nearly 50,000 crowdworkers from 167 countries three years to complete.

In 2022, a paper by Feifei Li and Krishna R. titled "In Search of the North Star of Computer Vision" was published in the journal Daedalus.

The robot uses a large model, and the buff is stacked! AI academic goddess Li Feifei released embodied intelligence results

In the paper, Li Feifei pointed out that after the success of ImageNet and object recognition, there are many exciting research directions and challenges in the field of computer vision, such as embodied intelligence, visual reasoning, scene understanding, etc.

Li Feifei believes that embodied intelligence is an important and challenging direction of artificial intelligence, which requires robots or other agents to be able to interact with the physical world in a complex and changeable environment, combining vision, language, reasoning and other capabilities.

Moreover, embodied intelligence is not limited to humanoid robots, and any morphologically moving intelligent machine belongs to embodied intelligence.

In addition to Li Feifei, NVIDIA founder Huang Jenxun and Tesla CEO Musk are also very optimistic about the prospects of embodied intelligence.

At present, Li Feifei's team has taken the first step, does this mean that the pace of AI dominating the world is one step closer?

If you have anything to say, welcome to leave a message in the comment section 7 before the screen to discuss! We will give unlimited red envelopes to students who like, comment and follow~

Read on