Editor: Mindy

Meta's newly released Open Vocabulary Experience Q&A (OpenEQA) benchmark aims to measure the AI Agent's ability to understand physical space, but the current level of AI Agent is still not comparable to that of humans.

LLMs can already understand text and images, and can answer questions based on their historical knowledge, but they may be ignorant of what is currently happening in the world around them.

Now that LLMs are also starting to learn to understand 3D physical space, by enhancing the ability of LLMs to "see" the world, people can develop new applications and get help from LLMs in more scenarios.

AI agents, such as robots or smart glasses, can sense and understand the environment to answer open-ended questions such as "Where did I put my key?"

From text models to world models!Meta's new research allows AI agents to understand the physical world

Such an AI agent needs to use perceptual modes such as vision to understand its surroundings and be able to communicate effectively with people in clear everyday language.

This is similar to building a "world model", where the AI agent can generate its own internal understanding of the external world and be able to make human queries through language.

This is a long-term vision and a challenging area, and an important step towards artificial universal intelligence.

Meta's new research on the OpenEQA (Embodied Question Answering) framework, the Open Vocabulary Experience Question Answering Framework, provides new possibilities for us to explore this field.

What is EQA?

EQA (Embodied Question Answering) is a tool that checks if the AI agent truly understands what is happening in the world around them.

After all, when we want to determine how well a person understands a concept, we ask them questions and form an assessment based on their answers. We can do the same with entity AI agents.

For example, some examples of the problem in the following figure:

[Object Recognition]

Q: What is the red object on the chair?

A: A backpack

[Attribute Identification]

Q: Out of all the chairs, what is the unique color of this chair?

A: Green

[Spatial Understanding]

Q: Can this room accommodate 10 people?

A: Yes

[Object Status Recognition]

Q: Are plastic water bottles open?

A: No, it is not

[Functional Reasoning]

Q: What can I write on with a pencil?

Answer: Paper

[World Knowledge]

Q: Have any students been here lately?

A: Yes

[Object Positioning]

Q: Where are my unfinished Starbucks drinks?

A: On the table next to the whiteboard in the front

In addition to this, EQA is also more straightforward.

For example, when you are ready to go out but can't find your work card, you can ask where the smart glasses are. The AI agent, on the other hand, will use its plot memory to reply that the badge is on the dinner table.

Or if you're hungry on your way home, you can ask the home robot if there's any fruit left. Depending on its active exploration of the environment, it may reply that there are ripe bananas in the fruit basket.

These behaviors may seem simple, after all, LLMs excel in tasks that many find challenging, such as passing the SAT or bar exam.

But the reality is that even today's most advanced models struggle to match human performance levels when it comes to EQA.

That's why Meta has also released the OpenEQA benchmark, which allows researchers to test their own models and see how they compare to human performance.

OpenEQA：面向AI Agent的全新基准

The Open Vocabulary Experience Question and Answer (OpenEQA) framework is a new benchmark that measures AI Agent's understanding of the environment by asking them open vocabulary questions.

The benchmark contains more than 1,600 non-templated question-and-answer pairs from human annotators, represents real-world usage, and provides video and scan pointers for more than 180 physical environments.

OpenEQA consists of two tasks:

(1) Episodic memory EQA, in which an entity's AI agent answers questions based on its recollection of past experiences.

(2) Active EQA, in which the AI agent must take action in the environment to gather the necessary information and answer questions.

OpenEQA is also equipped with LLM-Match, an automated assessment metric used to score open vocabulary answers.

The following is the process of LLM-Match scoring, through the input of questions and scenarios, the AI model will give an answer, and the answer will be compared with the human answer, and then get the corresponding score.

Performance of VLM at this stage

Generally speaking, the visual capabilities of AI agents are based on vision + language basic models (VLMs).

The researchers used OpenEQA to evaluate several state-of-the-art VLMs and found that even the best-performing models (e.g., GPT-4V at 48.5%) had a significant gap with human performance (85.9%).

It's worth noting that even the best VLMs are almost "blind" for problems that require spatial understanding, i.e. they perform hardly better than text-only models.

For example, for "I'm sitting on the couch in the living room watching TV. Which room is behind me?", the model basically guesses different rooms at random, without gaining an understanding of the space from visual episodic memory.

This shows that VLM is actually a return to the text to capture prior knowledge about the world as a way to answer visual questions. Visual information does not bring them substantial benefits.

This also shows that at this stage, the AI agent has not yet reached the ability to fully understand the physical world.

But it's too early to be discouraged, OpenEQA is just the first EQA benchmark to open the vocabulary.

Combining challenging open-vocabulary questions with the ability to answer in natural language through OpenEQA can spur more research, help AI understand and communicate information about the world it sees, and also help researchers track future advances in multimodal learning and context understanding.

It's not impossible, but suddenly one day AI Agent will bring us a big surprise?

Resources:

https://ai.meta.com/blog/openeqa-embodied-question-answering-robotics-ar-glasses/

From text models to world models!Meta's new research allows AI agents to understand the physical world

Meta's newly released Open Vocabulary Experience Q&A (OpenEQA) benchmark aims to measure the AI Agent's ability to understand physical space, but the current level of AI Agent is still not comparable to that of humans.

Read on

The iPhone 16 series phone model is exposed, and the appearance is basically determined

The large model gives birth to the opportunity for the transformation of the search industry, and the effect of a hundred flowers blooming is geometry?

Self-improvement life mindset

【Junior High School Physics】"Buoyancy" often tests difficult models

OpenAI secretly launched a mysterious model, suspected to be ChatGPT4.5 for public testing

A summary of 9 models of geometric guide angle problems in the mathematics common test of the high school entrance examination

Five forces model to improve personal core competence

Meta AI released the most powerful open-source large model, Llama 3, which is available in versions 8B and 70B?

How to use AI models to solve practical problems?

In the era of large models, is the data center outdated now?

轩辕大模型的实践与应用 | ML-Summit 2024

The mobile UI model came out, and the Apple iPhone may welcome a new cycle of upgrades

iFLYTEK does not tell the "sexy story" of large models

Meta released the "strongest open-source AI model", and the next generation may be stronger than GPT

面壁新模型:早于Llama3、比肩 Llama3、推理超越 Llama3!

"Flowers are the universal language of the world"

Finally, I know why China is full of confidence in front of the world in the longitudinal comparison of the "trump cards" of China, the United States and Russia

The Snooker World Championship 1/4 finals 3 world champions have been upset one after another or welcome the 23rd new king

The famous fighter was found to have died unexpectedly at the age of 57 and had won three world boxing titles

Huawei's profit soared by 564% in the first quarter, Tianya community recovered, and Xiaohongshu tested its self-developed large model

Top 10 bikini models in the world: all of them are super invincible beauties!

Top 10 strongest women in the world

Metasurfaces: A world-changing technology for manipulating light fields

Yuan Longping: The father of hybrid rice did only one thing in his life, but changed the world with a seed

World of Tanks: 8-Shot Super Magazine Hit You!What is the magic of the AMX 13 57?

World of Tanks: Probably one of the most special items in the history of the WOT TOG II "Hot Land Dog" style

The only world-class power that did not participate in World War II, what was it busy with during the war?

DNF: Version 5.23 content sneak peek!Deleted across 8, super-world reset, 3 class features enhanced

The queen of Chinese Jiu-Jitsu: hot and easy to pose in various poses, 20 years of hard training to win three world records

In the latest report of the World Championships, the Rockets lost the world's No. 1 in an upset, and the qualifiers dominated the semifinals

Big upset, the latest report of the Snooker World Championships, the world's second 9-13 upset

European exchange students came to China, and after a while, they sighed: China is really the safest country in the world