Alibaba Cloud Tongyi Qianwen is open source again! Continue to promote the construction of China's large-scale model ecology

author：Intelligent relativity 2023-08-25 15:24:00

Tongyi Qianwen open source second wave! On August 25, Alibaba Cloud launched Qwen-VL, a large-scale visual language model, which is directly open source in one step. Qwen-VL is developed with Qwen-7B, the 7 billion parameter model of Tongyi Qianwen, as the base language model, which supports graphic input and has multi-modal information understanding capabilities. In the mainstream multimodal task evaluation and multimodal chat ability evaluation, Qwen-VL has achieved far better performance than the general-purpose model of the same scale.

Compared with the previous VL model, Qwen-VL not only has basic graphic recognition, description, question and answer and dialogue capabilities, but also adds visual positioning, image Chinese word understanding and other capabilities.

Alibaba Cloud Tongyi Qianwen is open source again! Continue to promote the construction of China's large-scale model ecology

Multi-modal is one of the important technical evolution directions of general artificial intelligence. The industry generally believes that from a single-sensory language model that only supports text input, to a multi-modal model that supports text, image, audio and other information inputs, it contains the great possibility of intelligent leap of large models. Multi-modality can improve the understanding of the world of large models and fully expand the use scenarios of large models.

Vision is the first sensory ability of humans, and it is also the multimodal ability that researchers first want to give to large models. Following the previous launch of the M6 and OVA series of multimodal models, the Alibaba Cloud Tongyi Qianwen team has open-sourced the Large Vision Language Model (LVLM) Qwen-VL based on Qwen-7B. Qwen-VL and its visual AI assistant Qwen-VL-Chat have been launched in the ModelScope magic community, which is open source, free and commercially available.

Users can download models directly from the Modai community, or access and call Qwen-VL and Qwen-VL-Chat through the Alibaba Cloud Lingji platform, providing users with a full range of services including model training, inference, deployment, and fine tuning.

Qwen-VL can be used in scenarios such as knowledge Q&A, image title generation, image Q&A, document Q&A, and fine-grained visual positioning.

For example, a foreign tourist who does not understand Chinese goes to the hospital to see a doctor, and does not know how to go to the corresponding department, he takes a floor guide map and asks Qwen-VL "which floor is the orthopedic department" and "which floor is the otolaryngology department going to", Qwen-VL will give a text reply according to the picture information, which is the image question and answer ability; For example, enter a photo of the Bund in Shanghai and let Qwen-VL find the Oriental Pearl, and Qwen-VL can accurately circle the corresponding building with the detection frame, which is the visual positioning ability.

Qwen-VL is the industry's first general model that supports Chinese open domain positioning, and the open domain visual positioning capability determines the accuracy of the "vision" of the large model, that is, whether it can accurately find the thing you want to find in the picture, which is crucial for the landing of VL model in real application scenarios such as robot control.

Qwen-VL uses Qwen-7B as the base language model, introduces a visual encoder on the model architecture, makes the model support visual signal input, and through the design training process, the model has fine-grained perception and understanding of visual signals. Qwen-VL supports an image input resolution of 448, whereas previously open-source LVLM models typically only supported 224 resolution. On the basis of Qwen-VL, the Tongyi Qianwen team used the alignment mechanism to create Qwen-VL-Chat, a visual AI assistant based on LLM, which allows developers to quickly build conversation applications with multimodal capabilities.

In the standard English evaluation of four types of multimodal tasks (Zero-shot Caption/VQA/DocVQA/Grounding), Qwen-VL achieved the best results of open-source LVLM of the same size. In order to test the multimodal dialogue ability of the model, the Tongyi Qianwen team built a set of test set "touchstones" based on the GPT-4 scoring mechanism to compare Qwen-VL-Chat with other models, and Qwen-VL-Chat achieved the best open source LVLM results in the alignment evaluation of Chinese and English.

In early August, Alibaba Cloud open-sourced Qwen-7B, a 7 billion parameter general model and Qwen-7B-Chat, becoming the first large-scale technology enterprise in China to join the ranks of large-model open source. The Tongyi Qianwen open source model has attracted widespread attention as soon as it was launched, rushing to the HuggingFace trend list that week, and has received more than 3,400 stars on GitHub in less than a month, and the cumulative number of model downloads has exceeded 400,000.

Open Source Address:

ModelScope Magic Community:

Qwen-VL https://modelscope.cn/models/qwen/Qwen-VL/summary

Qwen-VL-Chat https://modelscope.cn/models/qwen/Qwen-VL-Chat/summary

Model experience: https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo/summary

HuggingFace：

Qwen-VL https://huggingface.co/Qwen/Qwen-VL

Qwen-VL-Chat https://huggingface.co/Qwen/Qwen-VL-Chat

GitHub：

https://github.com/QwenLM/Qwen-VL

Technical Paper Address:

https://arxiv.org/abs/2308.12966

Alibaba Cloud Tongyi Qianwen is open source again! Continue to promote the construction of China's large-scale model ecology

Read on