Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

In the front, there is OpenAI's GPT-4o, and then there is Google's series of king bombs, and advanced multi-modal large models have exploded one after another.

Other practitioners were shocked and began to think about how to catch up with these super models again.

At this time, a paper by HuggingFace and the Sorbonne University in France summed up the key experience of building large visual models and pointed out a way for developers.

These experiences cover model architecture selection, training methods, training data, etc., and the authors provide a detailed summary after multi-party comparison, and the core points include the following:

If you want to do a good job of the visual model, the choice of architecture is very important.
The language model has a greater impact on overall performance than the visual module.
The phased pre-training strategy is more conducive to building model capabilities.
Training data should contain multiple types, and attention should be paid to the proportional balance between them.

It can be said that HF was able to create Idefics2, a visual model of SOTA of the same scale, because of these experiences.

Idefics2 is based on Mistral-7B, and has 8B parameters as a whole, which can accurately recognize handwritten fonts.

Professionals commented that this is a good survey report, which is very helpful for visual model developers, but also cautions against taking it as a panacea.

Of course, some people joke that any architecture data is a floating cloud, and having a GPU is the most important.

There is some truth to it, but jokes aside, let's take a look at what lessons HuggingFace has brought us.

From the SOTA model development practice

The lessons learned from the HuggingFace paper came from the development of the vision model Idefics2.

Compared to the previous generation of Idefics1 and Flamingo and other large-scale pre-SOTAs, Idefics2 outperformed multiple datasets and even outperformed the larger 13B model.

At the same time, compared to MM1, which is slightly better than Idefics2 on the COCO dataset, Idefics2 consumes significantly fewer tokens per graph.

From the development of Idefics2, the experience brought to us by HuggingFace includes at least the following aspects:

Choice of backbone and architecture
Training methods and strategies
Data diversity and processing strategies

Language models have a greater impact on overall performance

The current visual models are mainly developed in the form of language model + visual encoder, and the authors evaluate the impact of the two on the overall performance.

The results show that the quality of the language model is more important than the visual model.

Using a better language model (e.g., replacing Llama-7B with Mistral-7B) with the same number of parameters can significantly improve the performance of the large visual model on downstream tasks.

Upgrading a visual encoder has a limited benefit, so the best thing to do when there is a trade-off is to prioritize a stronger language model.

Of course, this does not mean that upgrading the visual encoder is useless, and if conditions allow, choosing a better visual encoder can also bring some performance improvement.

Care should also be taken to select a visual encoder that supports variable resolution for text recognition tasks, such as for text recognition tasks; If the task requires a high level of inference speed, you can choose a more lightweight model.

And in real-world applications, inference speed and memory footprint are also trade-offs, and the SigLIP-SO400M selected by Idefics2 strikes a good balance between performance and efficiency.

Select the architecture type based on your needs

Regarding the choice of architecture, this paper discusses the common two types of architecture: full autoregression and cross-attention.

A fully autoregressive architecture generates each output in an autoregressive manner, taking into account the dependencies of the entire sequence;

The latter allows the model to dynamically focus on different parts of one modality when dealing with another, enabling more flexible intermodal interactions.

In their work, the authors found that which architecture performed better depended on whether the pre-trained backbone was frozen.

(To put it simply, if the pre-trained backbone participates in the formal training process, it is not frozen, and if it does not participate, it is frozen)

If it is not frozen, the full autoregressive architecture performs better, and vice versa, the cross-attention architecture is better.

Whether or not you need to freeze the backbone depends on the focus of the developer's needs.

Under the condition of limited resources, if high performance and high latency sensitivity are required, freezing is more appropriate.

If you want your model to be more flexible and adaptable, you should choose a non-frozen training method.

Specific to Idefics2, the choice is not to freeze the backbone, so a fully autoregressive architecture is adopted accordingly.

Experience in the training phase

Choosing the right architecture is important, but so is the training process, and the authors have summarized these lessons for our reference during the Idefics2 training process:

The first is to adopt a phased pre-training strategy as a whole, using lower-resolution images at the initial stage and then introducing higher-resolution PDF documents, which can gradually build multiple capabilities of the model.

The second is to use Learned Pooling instead of directly feeding image features into the language model, which can greatly reduce the number of image tokens, significantly improve the training and inference efficiency, and also bring performance improvements.

The third is data augmentation, one method is to split the image into multiple sub-images and send it to the model during training, which can exchange computation time for stronger performance during inference, which is especially effective for tasks such as text recognition, but not all images need to be processed in this way.

Fourth, the use of more diverse data and tasks in the instruction fine-tuning stage can improve the generalization and robustness of the model.

In addition, in order to stabilize the training, the authors also used LoRA technology to adapt the pre-training parameters when the pre-trained unimodal backbone participated in the training (non-frozen).

Data diversity and processing strategies

In addition to the training process itself, the data you choose can have a significant impact on your model's performance.

From the collection stage, care should be taken to select multiple types of data, for example, Idefics2 uses three types of data – image-text aligned documents (e.g., web pages), image-text pairs (e.g., image captions), and PDF documents with OCR annotations.

The proportion of various types of data should also be appropriately balanced according to actual needs, rather than simply divided equally.

As for the size of the data, the more the better, and of course care should be taken to filter out low-quality data.

Of course, collection is only one step to obtain training data, and if you want to train the model well, you need to go through a certain amount of processing.

Different preprocessing and enhancement strategies are used for different types of data, for example, for OCR data, it is necessary to use a higher resolution image, and for other data, a lower resolution can be used.

It should be noted that the original aspect ratio and resolution should be retained when processing images, which can greatly reduce the computational overhead of training and inference while improving the adaptability of the model.

If you think these experiences have inspired you, you can read the original paper for more details, and you are also welcome to share your development experience in the comment section.

Address:

https://arxiv.org/abs/2405.02246

— END —

量子位 QbitAI 头条号签约

HuggingFace教你怎样做出SOTA视觉模型

From the SOTA model development practice

Read on

19 Best Large Language Models in 2024

One of the top 10 models for data analysis: the funnel model

Who is the cockpit ceiling of new energy vehicles? The HarmonyOS cockpit is famous, but a new challenger has emerged! #智能座舱#6月12日, Great Wall Motor announced CoffeeO

Summary of today's bidding board (June 13) The structure of the 1-3rd daily line is under great pressure, and the bidding is flawed or suspected of being tempting, so they did not enter the market, but in the end they were all closed. The bid on the 4th is acceptable,

The "price war" of large models has begun, and the AI industry has ushered in great changes?

The Road to Large Model Applications: From Prompt Words to Artificial General Intelligence (AGI)

The first batch passed! SenseTime Little Raccoon received the highest rating in the Code Model Evaluation of the Academy of Information and Communications Technology

iPhone 16会杀死大模型APP吗？

HUAWEI Developer Conference: HarmonyOS Next, Pangu 5.0 model and other technologies were unveiled

The latest research progress and review of large models in the field of continuous learning

Why can the large model pull through the business system?

Ali Tongyi Qwen2 won the world's first open source in the latest evaluation of large models Zhou Hongyi issued a message to congratulate: The future open source model will definitely surpass closed source

Under the tuyere of AI mobile phones, Byte chose to be a large model supplier of mobile phone manufacturers

A model of a Wensheng diagram that produces animated effects: AnimateDiff

NVIDIA's open-source, the strongest general-purpose model, Nemotron-4 340B: opening a new era of AI synthetic data!