laitimes

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

author:Reporter

Report producer: Shenwan Hongyuan

The following is an excerpt from the original report

------

1. Overseas AI application update, focusing on multimodal capabilities

Recently, overseas AI applications have been catalyzed: 1) Open AI has upgraded its multimodal capabilities of pictures and voice, and will soon be applied to the latest ChatGPT; 2) Microsoft announced an update to the AI Copilot system at the end of this month to fully integrate Open AI model capabilities.

1.1 Open AI upgrades the multimodal capabilities of pictures and voice in ChatGPT

On September 25, Open AI announced the upcoming release of new multimodal capabilities, including image reading and understanding, voice conversation, and speech generation. ChatGPT will open a series of new features to Plus users and enterprise users within two weeks, among which the graphic capability (as shown below) is open to all platforms, and the ability to talk to Chatgpt voice is only available to iOS and Android clients.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

Dialogue capability: Chat GPT directly through voice, while GPT can directly reply to customers by voice, optional 5 kinds of custom voices, support iOS and Android mobile applications; Image-to-text capability: ChatGPT can understand the image information uploaded by customers in addition to text. GPT is able to understand photos, screenshots of pictures, documents containing images, etc. Customers can upload one or more pictures to the system, and even use brushes to mark the key content for the system to read and understand, which can be used to guide students with homework, search for daily recipes and other directions.

Voice and images offer more ways to use ChatGPT in your life. For example, taking photos of landmarks while traveling and conducting real-time conversational Q&A about them; Take photos of the fridge and pantry to determine what to eat for dinner (and ask follow-up questions for step-by-step recipes); Get answers by taking photos of homework directly or analyzing complex charts of work-related data.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

PREVIOUSLY, OPEN AI ALSO UPGRADED THE CAPABILITIES OF THE DALL・E 3 MODEL. New DALL· The E model is combined with the ChatGPT capability to make the painting more delicate, and at the same time, it can accurately restore details without prompt, and add text to the picture. Plus and Enterprise users can directly generate various types of images in ChatGPT through text, which not only enhances the image generation experience of prompt words, but also enhances the ability of the model to understand user instructions, and the image effect is also improved.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

Better grasp of every description proposed by users. For example, in the picture above, "pedestrians enjoying the night life", "the glow of the full moon", "steampunk telephone", "bargaining with angry old businessmen" and many other details that are difficult to reflect, are reflected in the pictures.

At the same time, multiple rounds of natural language dialogue editing can be performed on the generated content. For example, let the DALL-E model generate multiple hedgehog pictures, choose one of them to be named Larry, and ask the model to generate more Larry pictures, or even ask the model "Why is Larry so cute", the model can make text answers, during which 5 rounds of dialogue and modification are completed.

1.2 GPT-4V usage method, working mode, task capability

After the release of Open AI, Microsoft released a detailed review of GPT-4V, "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)".

5 ways to use it: images, sub-images, texts, scene texts, and visual pointers. That is, it supports pure image input, image and text interactive input, and can also make directional prompts for pictures (such as drawing arrows and circles). Basically covers every scenario of graph-text multimodality.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

3 supported abilities: instruction following, chain-of-thoughts, and in-context few-shot learning.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph
AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

In addition, Microsoft has demonstrated a number of basic capabilities of GPT-4V: 1) visual-language capabilities; 2) Interaction with humans: visual reference cues; 3) time and video comprehension; 4) Others, including IQ test, emotional intelligence test, and innovative scenario application.

1) Visual-language ability: In addition to common recognition of people, landmarks, etc., GPT-4V can also understand the relationship between people and objects, count, generate subtitles and descriptions, explain jokes, answer scientific questions, generate LaTeX code based on handwritten mathematical equations, etc.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

2) Interaction with humans: visual reference cues. In human-computer interaction with multimodal systems, pointing to a specific spatial location is an essential capability, such as conducting vision-based conversations.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph
AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

3) Time and video understanding: multi-image sequence, video understanding, visual reference prompts based on time understanding. Enter a few keyframes of the video to understand the event.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

4) Visual reasoning, IQ, emotional quotient test, etc., in addition, GPT-4v can also be used in industry, medicine, auto insurance, embodied intelligence, GUI interaction, etc.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph
AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

Overall, GPT-4V:1) shows strong hybrid input capabilities, and can better support the test-time technology observed in LLM, including instruction following, thought chain, and context-less sample learning.

2) Strong completion and versatility among characters in different fields, including open world visual understanding, visual description, multimodal knowledge, common sense, scene text understanding, document reasoning, coding, temporal reasoning, abstract reasoning, emotional understanding, etc.;

3) Pixel-level editing capabilities extend the usage boundaries of 4V;

4) After the emergence of 4V, the application space of artificial intelligence has been further opened, including industrial, medical, financial, embodied intelligence and other products have seen the application possibilities.

1.3 Microsoft AI Copilot system update, Office Copilot office capabilities will be released soon

AI Copilot will be released September 26, and Office Copilot will be widely available from November 1. 1) On September 21, Microsoft updated the AI Copilot function, and announced that the Copilot function will be free to run on multiple apps and devices from September 26, with the updated Windows 11 in the form of an initial version; 2) Office Copilot will be widely available on November 1, after Microsoft said in July that it would price Copilot at $30 per person per month, an additional fee on top of the price of a traditional Office 365 subscription.

This Win 11 release is updated with more than 150 new features, in which AI Copilot can be displayed both on the taskbar at all times and launched via Win+C's shortcut keys. New features include Copilot for Windows PCs and apps like Paint, Photos, Clipchamp, and more. Bing will add support for OpenAI's latest DALL・E 3 model.

We believe the AI Copilot/Office Copilot highlights of this announcement include:

1. Significantly improved image capabilities: The DALL・E 3 model was officially added, adding functions such as image and text generation, image understanding, and AI editing of P-maps.

Previously, Open AI released the third-generation AI drawing tool DALL・E 3, which integrates ChatGPT, and users do not need to spend more time on prompt to generate images. Compared to the previous generation, DALL・E 3 provides more detailed rendering, and can also better understand the requirements and provide more accurate images.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

Microsoft also integrates this AI design tool Microsoft Designer. With Designer, users can directly add original quality images to their designs with simple operations such as dragging and prompting.

For example, use local images to design covers, and directly perform operations such as removing the background, or use AI to create image content to extend the image.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

In addition, based on DALL・E 3, Microsoft updated the AI capabilities of the Bing search engine and Edge browser. For example, in shopping, search for product details with graphics, and help find suitable products and best prices based on customer reviews on the Internet, combined with coupons and promotional discount codes.

At the same time, Microsoft cryptographically added "Content Credentials" to all AIGC images in Bing. That is, an invisible digital watermark that includes the original creation time and date.

2. AI Copilot upgrades multi-terminal and team collaboration capabilities.

With the support of AI Copilot, Outlook for Windows can connect to multiple (cloud) accounts at different companies such as Google and Apple. File Explorer gives direct access to important and relevant content with direct access to important and relevant content, allowing you to collaborate without having to open a file. Backup Backups seamlessly transfers most files, applications, personalizations, and more from one Windows computer to another.

Copilot can also get content from the user's mobile phone (such as SMS) and import it into the Win11 system. Assuming a user wants to send a flight schedule to a family member, Copilot will import the data into the desktop upon request, and the message can be sent without having to take out the phone.

3. The collection shows the copilot ability in word, excel, ppt, and OneNote.

The office software plug-in capabilities displayed at this conference are not much different from previous releases. Still includes: Word: document summary, rewrite content, adjust tone, generate tables from copy, etc.

Excel: Visualize data, add calculation formulas, and more with natural language prompts.

OneNote: Ask more comprehensive questions about note, generate summaries, quickly edit articles, etc.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

Based on the above, this time the office software AI assistant function has been added: Microsoft 365 Chat. Organize information across data domains at work, including email, meetings, chat logs, documents, and network information. Microsoft 365Copilot Enterprise will extract users' enterprise data to help compose emails, plan campaigns, and more.

We believe that the relatively unexpected points of this conference include: 1) demonstrating the global management capabilities of AI capabilities in the Windows operating system; 2) Integrate the DALL・E 3 foundation of the large image model, upgrade from plain text capability to text-image multi-modality, and the level of image AIGC far exceeds the previous generation; 3) Make it clear that the Win11 update is free and can enable more people to experience AI Copilot; 4) Give a definite time for Office Copilot release.

At the same time, however, we believe that there are controversial points in the current release, including: 1) Office Copilot's demonstrated capabilities, especially language and text comprehension, do not have significant advantages over the March release; 2) The Office Copilot pricing of $30/month is debatable whether it can reflect incremental value; 2) In some Win systems, calling through AI operations requires a large number of prompts, and convenience needs to be verified.

2. Multimodal principle analysis: from literacy diagram to tushengwen

After 2022, with the development of Transformer technology, Transformer is also used in the CV field and formed Vision Transformer technology. After 2023, multimodal large models based on Transformers appear, and new spaces for AI large model applications will open.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

2.1 Venson: The first mature AIGC application, the core is in CLIP

DALL· E: Based on CLIP, you can generate corresponding images according to text descriptions. DALL· E is a multimodal-literate graph model released by OpenAI in 2021. E is based on GPT-3, trained on a text-image dataset, with 12 billion parameters.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

The innovation of the Dall-E generation: CLIP forms a contrast between text and images.

1) In the text input part, a transformer language model similar to GPE-3 is still used, and the number of parameters is greatly reduced.

DALL· E has a 12B parameter, which is significantly lower than the 175B of GPT-3, and the model is trained on a dataset of 250M image-to-text pairs. The trained model generates several samples (up to 512) based on the provided text, which are then sorted by CLIP.

2) CLIP, a text-image correspondence tool under the aesthetics of violence, the biggest innovation point of DALL-E.

CLIP (Contrastive Language-Image Pre-Training) is used to map related text and images, the idea behind is simple, Open AI crawls from the web, grabbing text-image datasets that have been described, but the size of the dataset reaches 400 million.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

The comparison model is then trained on the dataset. Contrast models can produce high similarity scores for images and text from the same pair, and low scores for mismatched text and images. As shown in the figure below, the left contrasting unsupervised pre-training.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

--- End of excerpt from the report Read more about the original report---

Report Collection Topic List X Regular Compilation and Updates by [Report Party].

(Special note: This article is derived from public information, the excerpt is for reference only and does not constitute any investment advice, please refer to the original report if you need to use it.) )

Featured report source: Reporter

Technology / Electronics / Semiconductors /

Artificial Intelligence | AI Industry | AI chip | Smart Home | Smart Speaker | Intelligent Voice | Smart Home Appliances | Smart lighting | Smart toilet | Smart Terminal | Smart door lock | Smartphone | Wearables |semiconductors| Chip Industry | Third generation semiconductors | Bluetooth | Wafer | Power semiconductors | 5G | GA RF | IGBT | SIC GA | SIC GAN | Discrete | Compounds | Wafer | Package Packaging | Display | LED | OLED | LED package | LED chip | LED lighting | Flexible folding screen | Electronic Components | Optoelectronics | Consumer Electronics | Electronic FPC | Circuit board | Integrated Circuits | Metaverse | Blockchain | NFT Digital Collectibles | Virtual currency | Bitcoin | Digital currency | Asset Management | Insurance | InsurTech | Property Insurance |

Read on