LLM/Ali: Tongyi Qianwen Qwen-VL-Chat multimodal large model [Benchmark VisualGLM]

Demo address: Tongyi Qianwen - multimodal dialogue - Demo

Paper address Report::https://arxiv.org/pdf/2308.12966.pdf

code地址：：GitHub - QwenLM/Qwen-VL: The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.----https://github.com/QwenLM/Qwen-VL/blob/master/TUTORIAL_zh.md----https://github.com/QwenLM/Qwen-VL/blob/master/FAQ_zh.md----https://github.com/QwenLM/Qwen-VL/blob/master/touchstone/README_CN.md----

0 Introduction

Qwen-VL is a Large Vision Language Model (LVLM) developed by Alibaba Cloud. Qwen-VL can take images, text, detection boxes as input, and text and detection frames as output. Features of the Qwen-VL series of models include: zero-shot image description generation, General VQA general visual question and answer, Text-oriented VQA text-oriented document visual answer-image Chinese word understanding, and Referring Expression Comprehension fine-grained visual positioning:

Powerful performance: In the standard English evaluation of four types of multimodal tasks (Zero-shot Captioning/VQA/DocVQA Knowledge Q&A, Image Q&A Document Q&A, Fine-grained Visual Positioning/Grounding Fine-grained Visual Positioning), the best results under the same general model size are achieved;
Multilingual dialogue model: naturally supports multilingual dialogue such as English and Chinese, and supports end-to-end long text recognition in Chinese and English.
Multi-image interleaved dialogue: support multi-image input and comparison, specified picture Q&A, multi-picture literary creation, etc.;
The first general model that supports Chinese open domain positioning: detection box labeling by Chinese open domain language expression; That is, it can accurately find the target object in the picture;
Fine-grained recognition and understanding: Qwen-VL is the first open source LVLM model with 448 resolution compared to the 224 resolution currently used by other open source LVLMs. Higher resolution improves fine-grained text recognition, document Q&A, and frame callouts.
#
Qwen-VL: Qwen-VL uses the pre-trained model of Qwen-7B as the initialization of the language model, and uses Openclip ViT-bigG as the initialization of the visual encoder, adding a single-layer randomly initialized cross-attention in the middle, which is obtained after about 1.5B of graphic data training. The final image input resolution is 448.
Qwen-VL-Chat: Based on Qwen-VL, Qwen-VL-Chat, a visual AI assistant based on a large language model, is trained using an alignment mechanism, which supports more flexible interaction methods, including multi-graph, multi-round question answering, creation and other capabilities. It is more suitable for direct conversation or building chatbots that support multimodal input.

1 Model structure

Qwen-VL is based on Qwen-7B language model, in the model architecture LLM, the introduction of visual encoder ViT-Visual Encoder [using Openclip ViT-bigG as the visual encoder, the input image data is encoded into text information, so that the model has the ability to understand and process visual information], and through the position-aware visual language adapter Position-aware Vision-Language Adapter [introduces a single-layer randomly initialized cross-attention mechanism in the process of information fusion to fuse visual information directly into the decoder layer of the language model] connects the two, so that the model supports visual signal input.

2 Data

The entire model is trained on a 1.5B scale graphic dataset. And Qwen-VL supports image input at 448 resolution, while previously open source LVLM models usually only supported 224 resolution.

The specific training data is divided into three stages:

3 Training process

The specific training process is divided into three steps:

Pre-training: Only the visual encoder ViT and the visual language adapter Adapter are optimized, and the LLM language model is frozen. Using large-scale image-text pairing data, the input image resolution is 224x224.
Multi-task pre-training: Introduce higher resolution (448x448) multi-task visual language data, such as VQA, text VQA, denotation understanding, etc., for multi-task joint pre-training.
Supervised fine-tuning: freeze the vision encoder ViT, optimize the LLM language model and adapter adapter. The dialog interaction data is used for prompt tuning to obtain the final Qwen-VL-Chat model with interactive capabilities.

4 Model evaluation and evaluation

We evaluate the capabilities of both models from two perspectives:

4.1 Evaluate the model's basic task capability on the English standard Benchmark. At present, four major types of multimodal tasks are evaluated (Zero-shot Caption: Zero-shot Image Description Generation/General VQA General Visual Q&A/DocVQA Text-oriented VQA-Image Chinese Word Comprehension/Grounding, Fine-grained Visual Positioning, Referring Expression Comprehension) tested Qwen-VL in its standard English assessment:

Zero-shot Captioning: Evaluate the model's ability to describe zero-sample images on unseen datasets;
General VQA: Evaluate the general question answering ability of the model, such as judgment questions, colors, numbers, categories, etc.;
Text-based VQA: evaluates the model's ability to recognize/answer questions and answer related to images and Chinese characters, such as document Q&A, chart Q&A, text Q&A, etc.;
Referring Expression Compression: Evaluates the model's ability to draw a frame for a given object description;

4.2 TouchStone: To measure the overall graphic dialogue ability and human alignment level of the model. This builds a benchmark test set based on the GPT4 scoring mechanism to evaluate LVLM models: TouchStone. In TouchStone-v0.1:

The evaluation criteria cover a total of 300+ images, 800+ questions, and 27 categories. Including basic attribute Q&A, character landmark Q&A, film and television works Q&A, visual reasoning, counterfactual reasoning, poetry creation, story writing, commodity comparison, picture problem solving and other categories as wide as possible.
In order to make up for the current defect that GPT4 cannot directly read the picture, we provide a full detailed description of manual annotation for all the pictures with evaluation, and give the detailed description of the picture, the problem and the output of the model to GPT4 for scoring.
Reviews are available in both English and Chinese.

The results of the evaluation are as follows:

Qwen-VL achieves the best results of open-source LVLMs of the same size. Qwen-VL has significant advantages over SOTA's Generalist Models in multiple VL missions, and also covers a more comprehensive range of capabilities.

5 Quick to use

If you want to use Qwen-VL-chat for inference, all you need to write is a few lines of code like the following. Make sure you're using the latest code.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)
# 请注意：分词器默认行为已更改为默认关闭特殊token攻击防护。
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 打开bf16精度，A100、H100、RTX3060、RTX3070等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# 打开fp16精度，V100、P100、T4等显卡建议启用以节省显存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# 使用CPU进行推理，需要约32GB内存
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cpu", trust_remote_code=True).eval()
# 默认gpu进行推理，需要约24GB显存
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()


# 可指定不同的生成长度、top_p等相关超参
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
# 第一轮对话 1st dialogue turn
query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'}, # Either a local path or an url
    {'text': '这是什么?'},
])


response, history = model.chat(tokenizer, query=query, history=None)
print(response)# 图中是一名女子在沙滩上和狗玩耍，旁边是一只拉布拉多犬，它们处于沙滩上。
# 第二轮对话 2nd dialogue turn
response, history = model.chat(tokenizer, '框出图中击掌的位置', history=history)
print(response)# <ref>击掌</ref><box>(536,509),(588,602)</box>
image = tokenizer.draw_bbox_on_latest_picture(response, history)
if image:
  image.save('1.jpg')
else:
  print("no box")

Bibliography:

Alibaba Cloud open-source Tongyi Qianwen multimodal large model

#完

LLM/Ali: Tongyi Qianwen Qwen-VL-Chat multimodal large model [Benchmark VisualGLM]