laitimes

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

author:Quantum Position

Chen Lin contributed to Au Fei Temple

量子位 | 公众号 QbitAI

The large model can answer the visual question correctly without looking at the picture?!

The research team of the University of Science and Technology of China, the University of Hong Kong Chinese, and the Shanghai AI Lab accidentally discovered this bizarre phenomenon.

The first thing they saw was that large models like GPT-4V, GeminiPro, Qwen1.5-72B, Yi-VL-34B, and LLaVA-Next-34B, whether closed-source or open-source, language models or multimodal, could get good results based on the question and option text in the multimodal benchmark MMMU test.

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

△ Blue represents LVLMs that can see the diagram, and orange and green represent LLMs and LVLMs that receive only question and option text, respectively

(LLMs: Large Language Models; LVLMs: Multimodal Large Models)

I don't know, I thought it was the hidden skill of the large model that was discovered.

Some netizens questioned their souls: Is our method of evaluating multimodal models correct?

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

The results also piqued the researchers' curiosity, and they decided to investigate further.

Big model hidden skills discovered?

In response to the existing evaluation sample and evaluation process, the researchers believe that there are two main problems that contribute to this phenomenon.

First, some multimodal evaluation samples lack dependence on visual content.

This kind of problem reflects the irrationality of the existing benchmark. There are two scenarios for this problem:

One is that some of the answers to the sample can be included in the questions and options, eliminating the need to look at the picture.

For example, there will be this kind of question, what is the shape of this circular soil circle?

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

On the other hand, some of the evaluation samples can be directly answered by the large language model using the embedded rich world knowledge, without relying on images.

For example, the question in the image below: What is the capital of Nebraska?

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

Second, the existing evaluation process does not consider the problem of data leakage in the process of language and multimodal large model training.

LVLM typically consists of a vision encoder, a language model base, and a vision-language linker. Moreover, a large number of assessment samples in existing multimodal benchmarks are converted from unimodal text corpora (e.g., from test questions).

Therefore, if the training data of the large language model inadvertently leaks the evaluation samples that are not sufficiently transformed in the multimodal benchmark, it will affect the fair comparison between LVLMs.

In order to quantitatively observe the widespread leakage in large language models, the researchers used 22 large language models to evaluate them on six public benchmarks.

These large language models include 2 closed-source models (GPT4-Turbo and GeminiPro) and 20 open-source models of different sizes and architectures (such as Qwen series, LLaMA2 series, Baichuan series, Mixtral-8x7B, etc.), and use 2-shot inference strategies to reduce rejection and align response formats.

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

The results show that the closed-source model GeminiPro and the open-source model Qwen1.5-72B can achieve amazing results of 42.7 and 42.4 respectively on the challenging MMMU benchmark, which is close to the performance of multimodal models such as GeminiPro-Vision (44.4), LLaVA-Next-34B (47.0) and Yi-VL-34B (43.2) when the image can be seen.

Further, they also quantitatively observed the data leakage of the multimodal large model during training: the image input of LVLM was masked so that it was evaluated only based on text questions and options (labeled LVLM-text).

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

It can be seen that Sphinx-X-MoE and Monkey-Chat can be improved by an astonishing 17.9 and 12.6 respectively on the MMMU benchmark compared to the original large model after multimodal training without looking at the graph, while they can only get a performance improvement of 1.2 and 4.7 even if they see the picture.

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

GPT-4 did not pass on the new benchmark

In order to solve the above problems and make a fairer and more accurate evaluation, the researchers designed a multimodal evaluation benchmark MMStar -

It contains 1,500 visually dependent high-quality evaluation samples, covering six core competencies of coarse perception, fine perception, example reasoning, logical reasoning, science and technology, and mathematics, as well as 18 detailed ability dimensions.

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

Along with the MMStar benchmark, the authors also proposed two evaluation indicators, multi-modal gain (MG) and multi-modal leakage (ML), to reflect the true performance gain and data leakage degree of LVLMs during multimodal training.

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

Subsequently, in order to test the quality of the proposed MMStar, they carried out three evaluations.

1) The 22 large language models were evaluated only according to the problems and selection in MMStar, and the results showed that their performance was close to random selection, which indicated that MMStar had few data leaks in the existing large model training corpus.

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

2) Evaluate the performance of 16 multimodal models on MMStar.

GPT4V achieves the highest average performance of 57.1 at high resolution settings (but still fails).

In the open-source model, InternLM-Xcomposer2 achieved an average performance of 55.4, and LLaVA-Next performed slightly better than GPT4V and GeminiPro-Vision in the mathematical dimension.

It is important to note that no multimodal large model is able to pass Fine Perception (FP), Logical Reasoning (LR), Science and Technology (ST), and Mathematics (MA).

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

3) Extensive evaluation of MG and ML indicators on 6 public benchmarks and the proposed MMStar using 16 LVLMs.

USTC and others made an unexpected discovery: large models can answer visual questions correctly without looking at pictures!

As you can see, MMStar exhibits the least average level of data leakage.

The research team believes that this kind of ML metrics across models will also be beneficial for the community to test the newly developed multimodal benchmarks later.

Paper Links:

Hattapus://ArXiv.org/PDF/2403.20330.pdf

Project Links:

https://mmstar-benchmark.github.io/

HTTPS://HuggingFace.co/datasets/Lin-Chen/Mmstar

Code Links:

hatpas://github.com/manstar-benchmark/manstar

— END —

QbitAI · Headline No

Follow us and be the first to know about the signing of cutting-edge technology trends

Read on