laitimes

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

author:Translation technology is a thousand questions

A significant current trend is towards building larger and more complex models with hundreds/hundreds of billions of parameters capable of producing impressive linguistic output.

However, existing large-scale language models mainly focus on textual information and cannot understand visual information.

Advances in the field of multimodal large language models (MLLMs), which fuse visual and textual information into a single Transformer-based model that enables the model to learn and generate content based on these two modalities, are therefore designed to address this limitation.

MLLMs show potential in a variety of practical applications, including natural image understanding and text image understanding. These models leverage language modeling as a common interface to multimodal problems, enabling them to process and generate responses based on text and visual input.

However, the existing MLLMs mainly focus on natural images with low resolution, and MLLM research on text-dense images is still rare, so making full use of large-scale multimodal pre-training to process text images is an important research direction of MLLM research.

By incorporating text images into the training process and developing models based on text and visual information, we can open up new possibilities for multimodal applications involving high-resolution text-dense images.

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Address: https://arxiv.org/abs/2309.11419

KOSMOS-2.5 is a multimodal large-scale language model based on text-dense images, which is developed on the basis of KOSMOS-2, highlighting the multimodal literate model for text-dense images.

The proposed model highlights its excellent performance in understanding text-intensive images, bridging the gap between vision and text.

At the same time, it marks the evolution of the task paradigm, moving from the previous encoder-decoder architecture to a pure decoder only architecture.

The goal of KOSMOS-2.5 is to enable seamless visual and text data processing in text-rich images in order to understand image content and generate structured text descriptions.

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Figure 1: Overview of KOSMOS-2.5

As shown in Figure 1, KOSMOS-2.5 is a multimodal model designed to handle two closely related tasks using a unified framework.

The first task involves generating a spatially aware text block, that is, generating the content of the text block and the coordinate frame at the same time;

The second task involves generating structured text output in Markdown format, while capturing various styles and structures.

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Figure 2: KOSMOS-2.5 architecture diagram

As shown in Figure 2, the two tasks utilize a shared Transformer architecture with task-specific prompts.

KOSMOS-2.5 combines a Vision Transformer-based vision encoder with a decoder based on the Transformer architecture, connected via a resampling module.

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Figure 3: Pre-training dataset

As shown in Figure 3, in order to train this model, the authors prepared a huge dataset with a total of 324.4M for pre-training.

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Figure 4: Example of training samples for a text line with a bounding box

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Figure 5: Example of a training sample in Markdown format

The dataset contains various types of text-dense images, including lines of text with bounding boxes and Markdown format for plain text, with examples of training samples visualized in Figure 4 and Figure 5.

This multi-task training method enhances the multimodal capability of KOSMOS-2.5 as a whole.

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

[Figure 6] End-to-end document-level text recognition experiment

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Figure 7: Generating a Markdown format text experiment from an image

As shown in Figures 6 and 7, KOSMOS-2.5 is evaluated on two tasks: end-to-end document-level text recognition and generating Markdown format text from images.

The experimental results show the excellent performance of KOSMOS-2.5 in understanding text-intensive image tasks.

KOSMOS-2.5: Multimodal Large Language Model for Reading "Text-Dense Images"

Figure 8: Sample input and output of KOSMOS-2.5

In addition, KOSMOS-2.5 demonstrates promising capabilities in both few-shot and zero-shot learning scenarios, making it a versatile tool for practical applications of text-rich images.

The authors point out that instruction fine-tuning is a promising approach to achieve a wider range of model application capabilities.

In the broader field of research, an important direction lies in the further development of the ability to scale model parameters.

As the scope of tasks continues to expand and complexity increases, scaling models to handle larger amounts of data is critical to the development of text-intensive multimodal models.

The ultimate goal is to develop a model that can effectively interpret visual and textual data and roll it out in more text-intensive multimodal tasks.

Statement: This public account reprints this article for the purpose of disseminating industry information and insights, if it violates your legitimate rights and interests, please write: [email protected], we will adjust the handling in time. Thanks for the support!

-END-

This article is reproduced from: Translation Technology Education and Research Public Account

Reprinted Editor: Panpan

Read on