Cracking of OpenAI multimodal models.
OpenAI's multimodal model can unify different information sources or forms to achieve transformation and unification between modalities, including touch, hearing, vision, smell, etc. The medium of this information can be voice, video, text, etc. The multimodal model can realize various functions such as image generation text, picture-related reasoning, image reasoning, mathematical reasoning, and video reasoning. It is suitable for scenarios such as story generation, web pages, development, image review, video recognition, and job answer generation.
OpenAI's multimodal model uses the clip model for text encoding and image encoding, and unifies them by aligning their encoding vectors. Compared with open-source models, OpenAI's models have advantages in training data, computing power, and model scale, and the overall effect is better.
OpenAI's self-used visual extraction model is trained with more high-quality data. Several stages of training and aligning the relationship between vision and language, mainly including pre-training and instruction fine-tuning. In the pre-training phase, visual and linguistic modalities are aligned, and the instruction fine-tuning phase answers user questions by asking questions in a natural way.
Training a multimodal model requires the use of NVIDIA's graphics cards, such as the A100 or H100 to train a model of about 70, e-parameter usually requires about three days of training with multiple graphics cards. The pre-training phase requires a lot of computing power, while the fine-tuning phase requires less computing power because there are many open source models available in the community. Some lower-cost inference cards can be used during the inference phase, such as NVIDIA's t4A20 and A40. Unigroup and Cambrian's inference cards are also suitable for the deployment of language models or multimodal models.
OpenAI multimodal models have advantages in processing scenarios such as image understanding and recognition, image reasoning and mathematical reasoning, and video recognition. In terms of image-related reasoning, the model can identify people, comment on pictures, identify movies, and so on. In mathematical reasoning, models can solve complex problems, such as math problems and homework problems.
In terms of video inference, the model can understand the video content and give answers by extracting each frame of the video for overall extraction and training.
The advantages of OpenAI multimodal models include larger model size, more powerful inference capabilities, and stronger visual extraction capabilities. However, due to the limitations of the visual extraction module, the model still has difficulties in recognizing some details, words and marks.