laitimes

HuggingFace教你怎样做出SOTA视觉模型

author:Quantum Position

Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

In the front, there is OpenAI's GPT-4o, and then there is Google's series of king bombs, and advanced multi-modal large models have exploded one after another.

Other practitioners were shocked and began to think about how to catch up with these super models again.

At this time, a paper by HuggingFace and the Sorbonne University in France summed up the key experience of building large visual models and pointed out a way for developers.

HuggingFace教你怎样做出SOTA视觉模型

These experiences cover model architecture selection, training methods, training data, etc., and the authors provide a detailed summary after multi-party comparison, and the core points include the following:

  • If you want to do a good job of the visual model, the choice of architecture is very important.
  • The language model has a greater impact on overall performance than the visual module.
  • The phased pre-training strategy is more conducive to building model capabilities.
  • Training data should contain multiple types, and attention should be paid to the proportional balance between them.

It can be said that HF was able to create Idefics2, a visual model of SOTA of the same scale, because of these experiences.

Idefics2 is based on Mistral-7B, and has 8B parameters as a whole, which can accurately recognize handwritten fonts.

HuggingFace教你怎样做出SOTA视觉模型

Professionals commented that this is a good survey report, which is very helpful for visual model developers, but also cautions against taking it as a panacea.

HuggingFace教你怎样做出SOTA视觉模型

Of course, some people joke that any architecture data is a floating cloud, and having a GPU is the most important.

HuggingFace教你怎样做出SOTA视觉模型

There is some truth to it, but jokes aside, let's take a look at what lessons HuggingFace has brought us.

From the SOTA model development practice

The lessons learned from the HuggingFace paper came from the development of the vision model Idefics2.

Compared to the previous generation of Idefics1 and Flamingo and other large-scale pre-SOTAs, Idefics2 outperformed multiple datasets and even outperformed the larger 13B model.

At the same time, compared to MM1, which is slightly better than Idefics2 on the COCO dataset, Idefics2 consumes significantly fewer tokens per graph.

HuggingFace教你怎样做出SOTA视觉模型

From the development of Idefics2, the experience brought to us by HuggingFace includes at least the following aspects:

  • Choice of backbone and architecture
  • Training methods and strategies
  • Data diversity and processing strategies

Language models have a greater impact on overall performance

The current visual models are mainly developed in the form of language model + visual encoder, and the authors evaluate the impact of the two on the overall performance.

The results show that the quality of the language model is more important than the visual model.

Using a better language model (e.g., replacing Llama-7B with Mistral-7B) with the same number of parameters can significantly improve the performance of the large visual model on downstream tasks.

Upgrading a visual encoder has a limited benefit, so the best thing to do when there is a trade-off is to prioritize a stronger language model.

HuggingFace教你怎样做出SOTA视觉模型

Of course, this does not mean that upgrading the visual encoder is useless, and if conditions allow, choosing a better visual encoder can also bring some performance improvement.

Care should also be taken to select a visual encoder that supports variable resolution for text recognition tasks, such as for text recognition tasks; If the task requires a high level of inference speed, you can choose a more lightweight model.

And in real-world applications, inference speed and memory footprint are also trade-offs, and the SigLIP-SO400M selected by Idefics2 strikes a good balance between performance and efficiency.

Select the architecture type based on your needs

Regarding the choice of architecture, this paper discusses the common two types of architecture: full autoregression and cross-attention.

A fully autoregressive architecture generates each output in an autoregressive manner, taking into account the dependencies of the entire sequence;

The latter allows the model to dynamically focus on different parts of one modality when dealing with another, enabling more flexible intermodal interactions.

In their work, the authors found that which architecture performed better depended on whether the pre-trained backbone was frozen.

(To put it simply, if the pre-trained backbone participates in the formal training process, it is not frozen, and if it does not participate, it is frozen)

If it is not frozen, the full autoregressive architecture performs better, and vice versa, the cross-attention architecture is better.

HuggingFace教你怎样做出SOTA视觉模型

Whether or not you need to freeze the backbone depends on the focus of the developer's needs.

Under the condition of limited resources, if high performance and high latency sensitivity are required, freezing is more appropriate.

If you want your model to be more flexible and adaptable, you should choose a non-frozen training method.

Specific to Idefics2, the choice is not to freeze the backbone, so a fully autoregressive architecture is adopted accordingly.

HuggingFace教你怎样做出SOTA视觉模型

Experience in the training phase

Choosing the right architecture is important, but so is the training process, and the authors have summarized these lessons for our reference during the Idefics2 training process:

The first is to adopt a phased pre-training strategy as a whole, using lower-resolution images at the initial stage and then introducing higher-resolution PDF documents, which can gradually build multiple capabilities of the model.

The second is to use Learned Pooling instead of directly feeding image features into the language model, which can greatly reduce the number of image tokens, significantly improve the training and inference efficiency, and also bring performance improvements.

The third is data augmentation, one method is to split the image into multiple sub-images and send it to the model during training, which can exchange computation time for stronger performance during inference, which is especially effective for tasks such as text recognition, but not all images need to be processed in this way.

Fourth, the use of more diverse data and tasks in the instruction fine-tuning stage can improve the generalization and robustness of the model.

In addition, in order to stabilize the training, the authors also used LoRA technology to adapt the pre-training parameters when the pre-trained unimodal backbone participated in the training (non-frozen).

Data diversity and processing strategies

In addition to the training process itself, the data you choose can have a significant impact on your model's performance.

From the collection stage, care should be taken to select multiple types of data, for example, Idefics2 uses three types of data – image-text aligned documents (e.g., web pages), image-text pairs (e.g., image captions), and PDF documents with OCR annotations.

The proportion of various types of data should also be appropriately balanced according to actual needs, rather than simply divided equally.

As for the size of the data, the more the better, and of course care should be taken to filter out low-quality data.

Of course, collection is only one step to obtain training data, and if you want to train the model well, you need to go through a certain amount of processing.

Different preprocessing and enhancement strategies are used for different types of data, for example, for OCR data, it is necessary to use a higher resolution image, and for other data, a lower resolution can be used.

It should be noted that the original aspect ratio and resolution should be retained when processing images, which can greatly reduce the computational overhead of training and inference while improving the adaptability of the model.

If you think these experiences have inspired you, you can read the original paper for more details, and you are also welcome to share your development experience in the comment section.

Address:

https://arxiv.org/abs/2405.02246

— END —

量子位 QbitAI 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on