laitimes

From text models to world models!Meta's new research allows AI agents to understand the physical world

author:New Zhiyuan

Editor: Mindy

Meta's newly released Open Vocabulary Experience Q&A (OpenEQA) benchmark aims to measure the AI Agent's ability to understand physical space, but the current level of AI Agent is still not comparable to that of humans.

LLMs can already understand text and images, and can answer questions based on their historical knowledge, but they may be ignorant of what is currently happening in the world around them.

Now that LLMs are also starting to learn to understand 3D physical space, by enhancing the ability of LLMs to "see" the world, people can develop new applications and get help from LLMs in more scenarios.

AI agents, such as robots or smart glasses, can sense and understand the environment to answer open-ended questions such as "Where did I put my key?"

From text models to world models!Meta's new research allows AI agents to understand the physical world

Such an AI agent needs to use perceptual modes such as vision to understand its surroundings and be able to communicate effectively with people in clear everyday language.

This is similar to building a "world model", where the AI agent can generate its own internal understanding of the external world and be able to make human queries through language.

This is a long-term vision and a challenging area, and an important step towards artificial universal intelligence.

Meta's new research on the OpenEQA (Embodied Question Answering) framework, the Open Vocabulary Experience Question Answering Framework, provides new possibilities for us to explore this field.

What is EQA?

EQA (Embodied Question Answering) is a tool that checks if the AI agent truly understands what is happening in the world around them.

After all, when we want to determine how well a person understands a concept, we ask them questions and form an assessment based on their answers. We can do the same with entity AI agents.

For example, some examples of the problem in the following figure:

From text models to world models!Meta's new research allows AI agents to understand the physical world

[Object Recognition]

Q: What is the red object on the chair?

A: A backpack

[Attribute Identification]

Q: Out of all the chairs, what is the unique color of this chair?

A: Green

[Spatial Understanding]

Q: Can this room accommodate 10 people?

A: Yes

[Object Status Recognition]

Q: Are plastic water bottles open?

A: No, it is not

[Functional Reasoning]

Q: What can I write on with a pencil?

Answer: Paper

[World Knowledge]

Q: Have any students been here lately?

A: Yes

[Object Positioning]

Q: Where are my unfinished Starbucks drinks?

A: On the table next to the whiteboard in the front

In addition to this, EQA is also more straightforward.

For example, when you are ready to go out but can't find your work card, you can ask where the smart glasses are. The AI agent, on the other hand, will use its plot memory to reply that the badge is on the dinner table.

From text models to world models!Meta's new research allows AI agents to understand the physical world

Or if you're hungry on your way home, you can ask the home robot if there's any fruit left. Depending on its active exploration of the environment, it may reply that there are ripe bananas in the fruit basket.

From text models to world models!Meta's new research allows AI agents to understand the physical world

These behaviors may seem simple, after all, LLMs excel in tasks that many find challenging, such as passing the SAT or bar exam.

But the reality is that even today's most advanced models struggle to match human performance levels when it comes to EQA.

That's why Meta has also released the OpenEQA benchmark, which allows researchers to test their own models and see how they compare to human performance.

OpenEQA:面向AI Agent的全新基准

The Open Vocabulary Experience Question and Answer (OpenEQA) framework is a new benchmark that measures AI Agent's understanding of the environment by asking them open vocabulary questions.

The benchmark contains more than 1,600 non-templated question-and-answer pairs from human annotators, represents real-world usage, and provides video and scan pointers for more than 180 physical environments.

OpenEQA consists of two tasks:

(1) Episodic memory EQA, in which an entity's AI agent answers questions based on its recollection of past experiences.

(2) Active EQA, in which the AI agent must take action in the environment to gather the necessary information and answer questions.

OpenEQA is also equipped with LLM-Match, an automated assessment metric used to score open vocabulary answers.

The following is the process of LLM-Match scoring, through the input of questions and scenarios, the AI model will give an answer, and the answer will be compared with the human answer, and then get the corresponding score.

From text models to world models!Meta's new research allows AI agents to understand the physical world

Performance of VLM at this stage

Generally speaking, the visual capabilities of AI agents are based on vision + language basic models (VLMs).

The researchers used OpenEQA to evaluate several state-of-the-art VLMs and found that even the best-performing models (e.g., GPT-4V at 48.5%) had a significant gap with human performance (85.9%).

From text models to world models!Meta's new research allows AI agents to understand the physical world

It's worth noting that even the best VLMs are almost "blind" for problems that require spatial understanding, i.e. they perform hardly better than text-only models.

For example, for "I'm sitting on the couch in the living room watching TV. Which room is behind me?", the model basically guesses different rooms at random, without gaining an understanding of the space from visual episodic memory.

This shows that VLM is actually a return to the text to capture prior knowledge about the world as a way to answer visual questions. Visual information does not bring them substantial benefits.

This also shows that at this stage, the AI agent has not yet reached the ability to fully understand the physical world.

But it's too early to be discouraged, OpenEQA is just the first EQA benchmark to open the vocabulary.

Combining challenging open-vocabulary questions with the ability to answer in natural language through OpenEQA can spur more research, help AI understand and communicate information about the world it sees, and also help researchers track future advances in multimodal learning and context understanding.

It's not impossible, but suddenly one day AI Agent will bring us a big surprise?

Resources:

https://ai.meta.com/blog/openeqa-embodied-question-answering-robotics-ar-glasses/

Read on