laitimes

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

author:New Zhiyuan

Editor: Momoko So sleepy

【New Zhiyuan Guide】Kill crazy! Overnight, the world's strongest end-side multimodal model was refreshed again, defeating the multimodal giants Gemini Pro and GPT-4V with only 8B parameters. Moreover, its OCR long image recognition refreshes SOTA, and the image encoding speed skyrockets by 150 times. This is the most romantic 520 gift from the domestic head model company to the developers.

拳打GPT-4V,脚踢Gemini Pro,仅仅8B参数就能击败多模态大模型王者。

Today, the world's most powerful end-side multimodal model is completely "crazy"!

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

As we all know, device-side models are a major trend in AI development - from Microsoft and Google to Apple and Intel, global technology giants are vying to implement AI in devices such as PCs and mobile phones.

But what I never expected was that the performance of the device-side model could be so fierce and the evolution speed could be so fast!

What's even more surprising is that it is not from a large foreign manufacturer, but from the top company in the research and development of domestic large models, Facing Wall Intelligence-they have recently created a small steel cannon MiniCPM-Llama3-V 2.5.

Moreover, the choice to launch on the special day of 520 today is said to be a Valentine's Day gift to the open source community, which is simply romantic and not like a technology company~

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

MiniCPM-Llama3-V 2.5开源地址:

https://github.com/OpenBMB/MiniCPM-V

MiniCPM series open source address:

https://github.com/OpenBMB/MiniCPM

Hugging Face下载地址:

https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5

So how strong is this little steel cannon? How can it afford the title of the world's strongest end-side multimodal model?

In summary, MiniCPM-Llama3-V 2.5 not only supports more than 30+ languages, but also has:

  • 最强端侧多模态综合性能:超越多模态巨无霸Gemini Pro、GPT-4V;
  • OCR Capability SOTA! 9x pixels are clearer, and it is difficult to accurately recognize long and long texts;
  • Image encoding is 150 times faster! The first device-side system-level multimodal acceleration.
国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

The following diagram reflects the global trend of small-parameter, high-performance multimodal large models.

而其中最亮眼的一颗星正是面壁小钢炮MiniCPM-Llama3-V 2.5。

MiniCPM-Llama3-V 2.5 proves with strength - the model is not only "the bigger the parameter is the better the performance", but the strongest performance can be leveraged with the smallest parameters!

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

In addition, as the parameters of large models are decreasing and the computing power of the device is increasing, the momentum of high-performance device models is strong.

However, due to the high-frequency image vision processing requirements of smart terminal devices such as mobile phones and PCs, higher requirements for multimodal recognition and inference capabilities are required for the deployment of AI models on the device side.

Judging from the rapid evolution of the triple jump of the "small steel cannon" facing the wall in March, the cost of inference has been greatly reduced, and the large model has been implemented efficiently, and victory is in sight.

The OCR capability is SOTA+, and the most powerful end-side multimodality

8B端侧模型,超越GPT-4V、Gemini Pro

This time, the MiniCPM-Llama3-V 2.5 contributed an amazing OCR (Optical Character Recognition) SOTA score with 8B device-side model parameters, as well as the best multimodal comprehensive score and hallucinatory ability level among the device-side models.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

The model radar chart, MiniCPM-Llama3-V 2.5 comprehensive capability level is excellent

在综合评测权威平台OpenCompass上,MiniCPM-Llama3-V 2.5以小博大,综合性能超越多模态巨无霸GPT-4V和Gemini Pro。

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

OCR (Optical Character Recognition) is one of the most important capabilities of multimodal large models, and it is also a hard-core indicator to examine the multimodal recognition and reasoning capabilities.

新一代MiniCPM-Llama3-V 2.5 在OCR综合能力权威榜单OCRBench上,越级超越了Claude 3V Opus、Gemini Pro等标杆模型,实现了性能SOTA。

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

In terms of hallucination ability, an important indicator for evaluating the performance and reliability of multimodal large models, MiniCPM-Llama3-V 2.5 surpasses GPT-4V and many other models on the Object HalBench list (note: the target hallucination rate should be 0).

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

In the RealWorldQA list, which evaluates the basic real-world spatial understanding capabilities of multimodal models, the MiniCPM-Llama3-V 2.5 once again surpasses GPT-4V and Gemini Pro, which is a rare achievement for the 8B model.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

150 times faster! The first device-side system-level acceleration

Support 30+ languages and embrace the world's open source community

For the first time, device-side system-level acceleration was carried out, and the MiniCPM-Llama3-V 2.5 was efficiently deployed on mobile phones.

In terms of image encoding, the facewall integrates NPU and CPU acceleration frameworks for the first time, and achieves a 150-fold acceleration improvement in MiniCPM-Llama3-V 2.5 image encoding.

In terms of language model inference, the current report results of the open source community show that the decoding speed of the Llama 3 language model on the mobile phone side is around 0.5 token/s, in contrast, the device-side operation of the multi-modal large model faces greater efficiency challenges, and the language decoding speed of MiniCPM-Llama3-V 2.5 on the mobile phone is increased to 3-4 token/s by CPU, compilation optimization, video memory management and other optimization methods.

Currently, image coding acceleration for language models is also underway, and a more responsive and interactive experience is coming.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

(The GIF here is a 2x speed demonstration, and the face wall is being further accelerated and optimized)

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

(The GIF here is a 2x speed demonstration, and the face wall is being further accelerated and optimized)

Different from the common Chinese-English bilingual models, MiniCPM-Llama3-V2.5 can support more than 30+ languages.

Including German, French, Spanish, Italian, Russian and other mainstream languages, basically covering the Belt and Road countries.

Based on the self-developed cross-language generalization technology, the performance of multilingual and multimodal conversations can be efficiently generalized through the instruction fine-tuning of a small number of translated multimodal data.

Now, billions of people in hundreds of countries can finally freely use their native language and device-side large models to communicate, no longer detached from the main line of cutting-edge technology development, and therefore enjoy more possibilities for AI applications to land, improve quality of life and participate in scientific and technological competitions. Really let more people enjoy the fun of big models!

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

Multilingual case study (language acceleration work is ongoing, 2x speed here)

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

多语言版本LLaVABench评测结果,MiniCPM-Llama3-V 2.5对话能力更胜一筹

9x pixel sharper

It is difficult to accurately identify long and long texts

With the further polishing of OCR technology, the evolution of complex reasoning and multi-modal recognition capabilities, and the accurate recognition of difficult images, long images, and long texts by MiniCPM-Llama3-V 2.5 once again brings outstanding performance!

The self-developed high-definition image high-efficiency encoding technology of the face wall can efficiently encode and non-losslessly recognize 1.8 million high-definition pixel images, and support any aspect ratio, and even "a little perverted" 1:9 limit ratio images, breaking through the bottleneck of traditional technology that can only recognize 200,000 pixel small images.

Previously, the MiniCPM-V series multimodal model has won a good reputation for its efficient analysis of difficult scenes such as street view and long map.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

With the technical upgrade, MiniCPM-Llama3-V 2.5 has made further breakthroughs in complex reasoning capabilities. It can better understand the image and think and solve problems at a more complex and human-like level, which can be called the "little Sherlock Holmes" of the big model.

The complex reasoning capability enables the model to not only understand modal information such as single text or images, but also to make more accurate and in-depth analysis across the comprehensive information between different modalities.

For example, given an architectural landscape full of dense handwriting, it is difficult for the human eye to distinguish, but the MiniCPM-Llama3-V 2.5 can understand the theme of "The Three-Body Problem" at a glance, and can also correctly deduce that these buildings are designed to commemorate "The Three-Body Problem" and its contribution to Chinese science fiction literature, which makes people smile.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座
国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

Throwing the same problem at GPT-4V, the results are not ideal.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

In addition, the recognition of flowcharts containing complex logic is an intuitive embodiment of the reasoning ability of multimodal models, and MiniCPM-Llama3-V 2.5 can not only easily understand the spatial positions and complex logical relationships between the words and arrows of different modules in the flowchart, but also give clear and easy-to-understand explanations.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座
国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

Forward an Asian food pyramid chart to your mom, but she can't read English?

With excellent reasoning ability, MiniCPM-Llama3-V 2.5 not only deeply understands and analyzes the types and distribution of diets in the images, but also provides insight into the nutritional balance needs behind it, makes intelligent combinations, and directly recommends three meals for a week in Chinese at one time.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座
国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

In terms of full-text OCR capabilities, the improvement of structured information extraction capabilities is of great help for the accurate recognition of long images and long texts.

For example, if you input a long image containing dense information, MiniCPM-Llama3-V 2.5 will recognize the full text verbatim.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

Swipe up and down to see

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

If you want to flip through several screens of long pictures and texts, the MiniCPM-Llama3-V 2.5 can also give you the correct answer.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

Swipe up and down to see

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座
国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

If you input a train ticket taken by your phone, the MiniCPM-Llama3-V 2.5 can also accurately extract the information and give an error-free "json" format output.

国产小钢炮一夜干翻GPT-4V、Gemini Pro!稳坐端侧多模态铁王座

Finally, Face Wall is an enthusiastic contributor to the open source community, as well as a beneficiary.

The leap performance of the MiniCPM-Llama3-V 2.5 relies on the innovative polishing of multi-modal technology by the facewall team, and is inseparable from the performance foundation of the Llama3-8B-Instruct as the base model.

Thanks to the excellent work of our world's best colleagues, we stand on each other's shoulders, reach out to the stars, and point to a higher and more brilliant place of science.

We will also continue to give back to the community by open-source more excellent models, data, and infra tools, so as to spread the spark of open source and openness in the sky of the world's collaborative innovation.

Resources:

MiniCPM-Llama3-V 2.5开源地址:

https://github.com/OpenBMB/MiniCPM-V

MiniCPM series open source address:

https://github.com/OpenBMB/MiniCPM

Hugging Face下载地址:

https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5

Read on