Report producer: Shenwan Hongyuan

The following is an excerpt from the original report

------

1. Overseas AI application update, focusing on multimodal capabilities

Recently, overseas AI applications have been catalyzed: 1) Open AI has upgraded its multimodal capabilities of pictures and voice, and will soon be applied to the latest ChatGPT; 2) Microsoft announced an update to the AI Copilot system at the end of this month to fully integrate Open AI model capabilities.

1.1 Open AI upgrades the multimodal capabilities of pictures and voice in ChatGPT

On September 25, Open AI announced the upcoming release of new multimodal capabilities, including image reading and understanding, voice conversation, and speech generation. ChatGPT will open a series of new features to Plus users and enterprise users within two weeks, among which the graphic capability (as shown below) is open to all platforms, and the ability to talk to Chatgpt voice is only available to iOS and Android clients.

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

Dialogue capability: Chat GPT directly through voice, while GPT can directly reply to customers by voice, optional 5 kinds of custom voices, support iOS and Android mobile applications; Image-to-text capability: ChatGPT can understand the image information uploaded by customers in addition to text. GPT is able to understand photos, screenshots of pictures, documents containing images, etc. Customers can upload one or more pictures to the system, and even use brushes to mark the key content for the system to read and understand, which can be used to guide students with homework, search for daily recipes and other directions.

Voice and images offer more ways to use ChatGPT in your life. For example, taking photos of landmarks while traveling and conducting real-time conversational Q&A about them; Take photos of the fridge and pantry to determine what to eat for dinner (and ask follow-up questions for step-by-step recipes); Get answers by taking photos of homework directly or analyzing complex charts of work-related data.

PREVIOUSLY, OPEN AI ALSO UPGRADED THE CAPABILITIES OF THE DALL・E 3 MODEL. New DALL· The E model is combined with the ChatGPT capability to make the painting more delicate, and at the same time, it can accurately restore details without prompt, and add text to the picture. Plus and Enterprise users can directly generate various types of images in ChatGPT through text, which not only enhances the image generation experience of prompt words, but also enhances the ability of the model to understand user instructions, and the image effect is also improved.

Better grasp of every description proposed by users. For example, in the picture above, "pedestrians enjoying the night life", "the glow of the full moon", "steampunk telephone", "bargaining with angry old businessmen" and many other details that are difficult to reflect, are reflected in the pictures.

At the same time, multiple rounds of natural language dialogue editing can be performed on the generated content. For example, let the DALL-E model generate multiple hedgehog pictures, choose one of them to be named Larry, and ask the model to generate more Larry pictures, or even ask the model "Why is Larry so cute", the model can make text answers, during which 5 rounds of dialogue and modification are completed.

1.2 GPT-4V usage method, working mode, task capability

After the release of Open AI, Microsoft released a detailed review of GPT-4V, "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)".

5 ways to use it: images, sub-images, texts, scene texts, and visual pointers. That is, it supports pure image input, image and text interactive input, and can also make directional prompts for pictures (such as drawing arrows and circles). Basically covers every scenario of graph-text multimodality.

3 supported abilities: instruction following, chain-of-thoughts, and in-context few-shot learning.

In addition, Microsoft has demonstrated a number of basic capabilities of GPT-4V: 1) visual-language capabilities; 2) Interaction with humans: visual reference cues; 3) time and video comprehension; 4) Others, including IQ test, emotional intelligence test, and innovative scenario application.

1) Visual-language ability: In addition to common recognition of people, landmarks, etc., GPT-4V can also understand the relationship between people and objects, count, generate subtitles and descriptions, explain jokes, answer scientific questions, generate LaTeX code based on handwritten mathematical equations, etc.

2) Interaction with humans: visual reference cues. In human-computer interaction with multimodal systems, pointing to a specific spatial location is an essential capability, such as conducting vision-based conversations.

3) Time and video understanding: multi-image sequence, video understanding, visual reference prompts based on time understanding. Enter a few keyframes of the video to understand the event.

4) Visual reasoning, IQ, emotional quotient test, etc., in addition, GPT-4v can also be used in industry, medicine, auto insurance, embodied intelligence, GUI interaction, etc.

Overall, GPT-4V:1) shows strong hybrid input capabilities, and can better support the test-time technology observed in LLM, including instruction following, thought chain, and context-less sample learning.

2) Strong completion and versatility among characters in different fields, including open world visual understanding, visual description, multimodal knowledge, common sense, scene text understanding, document reasoning, coding, temporal reasoning, abstract reasoning, emotional understanding, etc.;

3) Pixel-level editing capabilities extend the usage boundaries of 4V;

4) After the emergence of 4V, the application space of artificial intelligence has been further opened, including industrial, medical, financial, embodied intelligence and other products have seen the application possibilities.

1.3 Microsoft AI Copilot system update, Office Copilot office capabilities will be released soon

AI Copilot will be released September 26, and Office Copilot will be widely available from November 1. 1) On September 21, Microsoft updated the AI Copilot function, and announced that the Copilot function will be free to run on multiple apps and devices from September 26, with the updated Windows 11 in the form of an initial version; 2) Office Copilot will be widely available on November 1, after Microsoft said in July that it would price Copilot at $30 per person per month, an additional fee on top of the price of a traditional Office 365 subscription.

This Win 11 release is updated with more than 150 new features, in which AI Copilot can be displayed both on the taskbar at all times and launched via Win+C's shortcut keys. New features include Copilot for Windows PCs and apps like Paint, Photos, Clipchamp, and more. Bing will add support for OpenAI's latest DALL・E 3 model.

We believe the AI Copilot/Office Copilot highlights of this announcement include:

1. Significantly improved image capabilities: The DALL・E 3 model was officially added, adding functions such as image and text generation, image understanding, and AI editing of P-maps.

Previously, Open AI released the third-generation AI drawing tool DALL・E 3, which integrates ChatGPT, and users do not need to spend more time on prompt to generate images. Compared to the previous generation, DALL・E 3 provides more detailed rendering, and can also better understand the requirements and provide more accurate images.

Microsoft also integrates this AI design tool Microsoft Designer. With Designer, users can directly add original quality images to their designs with simple operations such as dragging and prompting.

For example, use local images to design covers, and directly perform operations such as removing the background, or use AI to create image content to extend the image.

In addition, based on DALL・E 3, Microsoft updated the AI capabilities of the Bing search engine and Edge browser. For example, in shopping, search for product details with graphics, and help find suitable products and best prices based on customer reviews on the Internet, combined with coupons and promotional discount codes.

At the same time, Microsoft cryptographically added "Content Credentials" to all AIGC images in Bing. That is, an invisible digital watermark that includes the original creation time and date.

2. AI Copilot upgrades multi-terminal and team collaboration capabilities.

With the support of AI Copilot, Outlook for Windows can connect to multiple (cloud) accounts at different companies such as Google and Apple. File Explorer gives direct access to important and relevant content with direct access to important and relevant content, allowing you to collaborate without having to open a file. Backup Backups seamlessly transfers most files, applications, personalizations, and more from one Windows computer to another.

Copilot can also get content from the user's mobile phone (such as SMS) and import it into the Win11 system. Assuming a user wants to send a flight schedule to a family member, Copilot will import the data into the desktop upon request, and the message can be sent without having to take out the phone.

3. The collection shows the copilot ability in word, excel, ppt, and OneNote.

The office software plug-in capabilities displayed at this conference are not much different from previous releases. Still includes: Word: document summary, rewrite content, adjust tone, generate tables from copy, etc.

Excel: Visualize data, add calculation formulas, and more with natural language prompts.

OneNote: Ask more comprehensive questions about note, generate summaries, quickly edit articles, etc.

Based on the above, this time the office software AI assistant function has been added: Microsoft 365 Chat. Organize information across data domains at work, including email, meetings, chat logs, documents, and network information. Microsoft 365Copilot Enterprise will extract users' enterprise data to help compose emails, plan campaigns, and more.

We believe that the relatively unexpected points of this conference include: 1) demonstrating the global management capabilities of AI capabilities in the Windows operating system; 2) Integrate the DALL・E 3 foundation of the large image model, upgrade from plain text capability to text-image multi-modality, and the level of image AIGC far exceeds the previous generation; 3) Make it clear that the Win11 update is free and can enable more people to experience AI Copilot; 4) Give a definite time for Office Copilot release.

At the same time, however, we believe that there are controversial points in the current release, including: 1) Office Copilot's demonstrated capabilities, especially language and text comprehension, do not have significant advantages over the March release; 2) The Office Copilot pricing of $30/month is debatable whether it can reflect incremental value; 2) In some Win systems, calling through AI operations requires a large number of prompts, and convenience needs to be verified.

2. Multimodal principle analysis: from literacy diagram to tushengwen

After 2022, with the development of Transformer technology, Transformer is also used in the CV field and formed Vision Transformer technology. After 2023, multimodal large models based on Transformers appear, and new spaces for AI large model applications will open.

2.1 Venson: The first mature AIGC application, the core is in CLIP

DALL· E: Based on CLIP, you can generate corresponding images according to text descriptions. DALL· E is a multimodal-literate graph model released by OpenAI in 2021. E is based on GPT-3, trained on a text-image dataset, with 12 billion parameters.

The innovation of the Dall-E generation: CLIP forms a contrast between text and images.

1) In the text input part, a transformer language model similar to GPE-3 is still used, and the number of parameters is greatly reduced.

DALL· E has a 12B parameter, which is significantly lower than the 175B of GPT-3, and the model is trained on a dataset of 250M image-to-text pairs. The trained model generates several samples (up to 512) based on the provided text, which are then sorted by CLIP.

2) CLIP, a text-image correspondence tool under the aesthetics of violence, the biggest innovation point of DALL-E.

CLIP (Contrastive Language-Image Pre-Training) is used to map related text and images, the idea behind is simple, Open AI crawls from the web, grabbing text-image datasets that have been described, but the size of the dataset reaches 400 million.

The comparison model is then trained on the dataset. Contrast models can produce high similarity scores for images and text from the same pair, and low scores for mismatched text and images. As shown in the figure below, the left contrasting unsupervised pre-training.

--- End of excerpt from the report Read more about the original report---

Report Collection Topic List X Regular Compilation and Updates by [Report Party].

(Special note: This article is derived from public information, the excerpt is for reference only and does not constitute any investment advice, please refer to the original report if you need to use it.) ）

Featured report source: Reporter

Technology / Electronics / Semiconductors /

AIGC in the computer industry: How GPT-4v achieves powerful multimodality, from literacy to graph

Featured report source: Reporter

Read on

What certificates with high gold content can be tested in the computer industry?

Applying for the college entrance examination is an investment, and the ratio of computer production is required to calculate the rate of return

The ancient supercomputer "abacus" is more intelligent than you can imagine

The brain of the constellation is like a computer, constantly weaving whimsical ideas

The classic game DOOM is coming to quantum computers! Quandoom takes you to experience the FPS of the quantum realm

The wonderful association between brainwashing, brain-burning and quantum computers

Physics "doesn't exist"? The Nobel Prize in Physics was awarded to a computer scientist?

Accidentally! The 2024 Nobel Prize in Physics is awarded to two computer scientists who use physics tools to make artificial intelligence "smart" [with an analysis of the current situation of the artificial intelligence industry]

The hacker shows giving a computer root access with a lighter

Google takes a closer look at low-noise phase change processors to reveal why quantum computers can beat supercomputing

How much do you know about hidden consumption in the catering industry? Netizen spicy comment: I took the computer and counted it

Computer rose 35% to lead the industry: some retail investors outflowed, and some institutions were very "calm"

Computers led the rise of 35%: some retail investors outflowed, and some institutions were very "calm"

The computer industry has become the "standard-bearer" of the bull market, and Transwarp Technology (688031.SH) has strengthened its pro-cyclical layout

Liam Payne lost control in the lobby of the hotel, destroyed the computer before his death, and received a warning letter from his old love three days ago

21.737 billion yuan of main funds withdrew from the computer sector this week

China film machinery industry market supply and demand situation and prospect strategy research and judgment report

In-depth monitoring of the market and future prospect planning report of China's film post-production industry

Analysis of the development dynamics of China's film projection screen industry and market prospect planning report

A comprehensive market survey and investment strategy research report on China's cinema projector industry

China Film and Television Industry Market Development Prospect and Investment Scale Forecast Report

Analysis report on the market supply and demand model and competitive strategy of China's film O2O industry

China electric silver paste industry market monitoring and development trend research and judgment report

China electric pressure cooker industry market research and investment strategy research report

Research on the development status of China's electric pressure cooker industry and market prospect planning report

Is it difficult to do all kinds of business now? Today, I went to the barber shop I frequented to get a haircut, and I haven't walked this street in a month. There is a middle school and an elementary school on this street. But one

Under the policy dividend, real estate companies are cutting prices and stealing the market

Is the wave of layoffs coming? 5 major industries have become hard-hit areas? Many people are still confused!

The post-90s genius with an annual salary of one million actually committed suicide by jumping off a building, why did China's most profitable industry force him to death

The wave of Internet café closures: 50,000 stores closed in 4 years, and the once most profitable industry is about to disappear?

Silicone dolls are alive! How invincible is the adult products industry with the blessing of AI?

12,888 closed down in a year! The once overcrowded profiteering industry has now been abandoned by young people