In the past few days, the AI circle has been like a holiday, showing that the products released at OpenAI's press conference on Monday night are similar to the previous news; Then it was last night's Google IO conference, which also released new large model products. In this article, we focus on GPT-4o.

GPT-4o has updated a total of 7 items this time, and the author has selected 4 key points to analyze with us from the perspective of human-computer interaction to see what differences are from everyone's understanding.

Let's talk about ChatGPT-4o from the perspective of human-computer interaction

The ChatGPT-4o press conference in the middle of the night before yesterday should be the hottest topic in the AI circle in the next few days, and several of the updates can actually be interpreted from the perspective of human-computer interaction.

First of all, let's take a look at the main content of the GPT-4o update (only focus on the interactive perspective interpretation can be skipped):

Multimodal interaction capability: ChatGPT 4.0 supports image and text input, and can output text, with the ability of multimodal interaction. This means that it can understand image content and generate actions such as subtitles, classification, and analysis.
Improved natural language comprehension: There has been a significant improvement in natural language understanding, which allows ChatGPT 4.0 to better understand the user's input and provide more accurate responses based on the user's context.
Increased context length: ChatGPT 4.0's increased context length allows the model to perform better when handling long-form conversations, better understanding the context and context of the entire conversation to give more accurate and appropriate answers.
Data analysis and visualization graphing function: ChatGPT 4.0 can use natural language interaction to analyze data and visualize graphing based on the knowledge base and online retrieved data by directly connecting relevant functional modules.
DALL· E 3.0 features: ChatGPT 4.0 introduces DALL· E 3.0 feature, which allows users to upload images and make queries on them, can be browsed directly from Bing, and directly using the DALL· E 3.0 function to create on images.
Advancements in model architecture and training data: In this release, developers have introduced more advanced model architectures, more training data, and more language data, taking the performance of chatbots to a new level.
API opening and price concessions: The new version of GPT-4 Turbo supports 128k contexts, and the knowledge is updated until April 2023. E3, text-to-speech TTS and other functions are all open to the API, and the API price is also discounted by 3-5%.

Points 1, 2, 3, and 5 can all be talked about from the perspective of human-computer interaction.

Point 1: Multimodal interaction capabilities

Today, the author also read some articles about the GPT-4o update, and some people only understand the multimodal interaction ability as we can communicate with GPT more than just text, which is too much to underestimate the multimodal interaction ability.

You know, human beings express themselves through words and sounds, even if they are exactly the same text. The information contained also varies greatly. Words are just static information, while sound contains more dimensional information. For example, voice, intonation, volume, speaking rate, pauses, stress, and so on.

In the same way as "Hello", words can only express 1 meaning, while sounds may be able to express 4-6 meanings. For programs, multimodal interaction means getting information from more sources (visual, auditory, textual, environmental, etc.). It also means getting more information (e.g. voice, intonation, volume, rate, pause, stress in the sound dimension just mentioned).

By obtaining information from multiple sources and obtaining more information, GPT can shorten the process of reasoning and judgment, and reply to users more quickly. This is like the user automatically describing the question in more detail and explaining their requirements more clearly, so that the speed and quality of GPT's feedback will naturally be improved. (Of course, there are also improvements in the model)

In addition to sound, GPT-4o's multimodal interaction capabilities include visual comprehension, such as being able to recognize faces in images and analyze information such as gender, age, and expressions. Again, that's what we've just said about getting information from more sources and getting more information.

The above is the significance of the human-to-GPT input process in the multimodal interaction capability, and the other part of human-computer interaction: the stage of GPT output to the human is also of great significance.

GPT-4o can respond in the most appropriate mode as needed, before GPT can only reply in text, but after that, it can be text, sound, and image. The meaning of sound modality is to support more communication scenarios and the inclusion of accessible interactions. Needless to say, the significance of images can be said, whether it is a graphical interface that replaces the command line interface, or a PPT prepared when you are promoted to defense, it can reflect the advantages of images over text.

Point 2: Improvement of natural language comprehension skills

If we say that multimodal interaction capability represents the two stages of input and output. Then the ability to understand natural language represents the stage of [processing]. When GPT obtains information from multiple sources, the next step is to understand the information and only then can it give a response. The improvement of natural language understanding means that GPT-4o is more accurate in identifying user intent. Then the content of the natural follow-up reply and the modality of the reply will have higher quality results.

Point 3: Increase the context length

The significance of this point is first reflected in the long dialogue, we can analogize the way of communication between people, two friends who have known each other for many years, maybe they can contain a very large amount of information in a short conversation, such as:

Zhang San said to Li Si: Your design plan last time was really awesome!

The information that is not mentioned in this sentence itself, but which both Zhang San and Li Si can understand, may be:

If you want to clearly express the specific information contained in 1, 2, and 3 above, it may take thousands of words or ten minutes of dialogue to explain it clearly, but because this information has been stored in people's memory, the two can omit a lot of detailed descriptions and preconditions in the process of communication, and express a lot of information in one sentence.

For GPT-4o, the increase in the memorable context length means that it becomes a more familiar program to you, so when users communicate with GPT-4o, they can communicate more with less information in the same way that Zhang San and Li Si communicated, while maintaining the quality of communication.

It should be noted that I have just used a description of a program that you are more familiar with, rather than a description of a friend that you are more familiar with, and there are two key differences, the first is the so-called context length, which can be compared to the time of acquaintance and the total amount of information exchanged between people, and the degree of understanding. The second aspect is:

We can imagine if the new generation of children now started using AI tools from a very young age, and the AI tools were attached to portable smart devices to perceive their surroundings at the same time as the user in multiple modalities, coupled with the memorable context length of GPT-4o that can last for decades. Such AI may become the user's most familiar friend, even far more than parents and family. If you give this AI the corresponding hardware, it can almost be regarded as a smart machine family~

Point 5: DALL· E 3.0 Function

The ability to generate images and the ability to edit pictures intelligently are already available in many other products, but this update of GPT-4o helps users save the conversion of different data types that were previously operated by users, and replace them with GPT-4o, which is also an improvement in operation efficiency for users. Just like if we saw a new concept on a certain image, we may need to convert the image into text by typing or OCR before continuing to use it. GPT-4o will save users this process in the future.

Not to mention the significance of other things like creative work, advertising production, product design, educational presentations, etc., there are already many similar products on the market.

Another point that made users wow during the entire conference was that GPT-4o's response time was only 232 milliseconds (an average of 320 milliseconds), which is almost at the level of real-time dialogue with humans, which is significantly better than the latency performance of the previous generation model.

In fact, we can think about it from the above interpretation, why has the response time of GPT-4o been so greatly improved?

Point 1: It means that GPT-4o obtains information faster and has more information.
Point 2: It means that GPT-4o understands this information faster
Point 3: It means that GPT can get more information from the context that the user does not directly express

Based on the above three prompts and the improvement of its own model capabilities, it is easy to understand that the response time of GPT-4o reaches 232 milliseconds.

When the response time of GPT-4o reaches the level of human-to-human dialogue, there is more room for imagination in many application scenarios. Specifically, the author thought of the following aspects:

Enhanced real-time interactivity: This responsiveness makes human-machine conversations nearly seamless, virtually eliminating the perceived delay between processing a request and providing a reply by traditional AI assistants. When users communicate with GPT-4o, they will feel like they are having a natural and fluid conversation with another real human, which can greatly improve the realism and satisfaction of the interaction.
User experience optimization: Faster response time reduces the psychological burden of users waiting for feedback, making the communication process more comfortable and efficient. This is especially important for scenarios that require fast feedback, such as information inquiries in an emergency, instant decision support, or fast-paced business communication. And it's more similar to people-to-person communication, imagine that when we are chatting with friends, we usually don't wait for 3 seconds before talking, right?
Application Scenario Expansion: With its ability to process audio, visual, and text information in real time, GPT-4o opens the door to more application scenarios. For example, in areas such as customer service, educational tutoring, telemedicine, virtual assistants, and game interactions, real-time interaction capabilities are key to improving service quality and efficiency.

Columnist

Du Zhao, WeChat public account: AI and user experience, everyone is a product manager columnist, practical designer, currently in a mobile phone company responsible for mobile phone OS interaction design, responsible for the product covering hundreds of millions of users, mainly research the integration of AI and human-computer interaction design and the impact of human factors on user experience.

This article was originally published by Everyone is a Product Manager and is prohibited from reprinting without permission.

The title image is from Unsplash and is licensed under CC0.

The views in this article only represent the author's own, everyone is a product manager, and the platform only provides information storage space services.

Let's talk about ChatGPT-4o from the perspective of human-computer interaction

Point 1: Multimodal interaction capabilities

Point 2: Improvement of natural language comprehension skills

Point 3: Increase the context length

Point 5: DALL· E 3.0 Function

Read on

OpenAI Launches ChatGPT's New Interface, Canvas, More Efficient Writing and Coding, A New Era of AI Collaboration!

Generative AI such as ChatGPT challenges and responses to academic integrity

AI Daily: Conch AI Launches Picture Generating Video Function; Tiangong AI search has added color pages, and it also intends to take you to make money; ChatGPT's new version of the gray test

ChatGPT幕后大佬、o1推理模型作者官宣离职！OpenAI大洗牌

Is the Nobel Prize in Literature going to ChatGPT? Ultraman's clamor for winning the award is high, and Hinton angrily denounces him as unworthy

After reading the Nobel Prize in Chemistry, I began to fantasize that ChatGPT would win the Literature Prize

ChatGPT新能力要做Copilot?

The Nobel Prize in Physics was awarded to the AI boss, and the father of generative AI angrily denounced: they don't deserve the prize! Netizen: ChatGPT is expected to win a literary prize?

ChatGPT predicts: Messi will win the World Cup in 2026, and Mbappe is expected to win another championship in 2034

ChatGPT combined with big data analysis to analyze the research hotspots of embryonic stem cells in China

Tesla's ChatGPT moment is coming?

The 38-year-old Mac "returned to work" and was transformed to the Internet! With a speed of only 400B/s, it can chat with ChatGPT and code with Claude

Kai-Fu Lee responded to the dilemma of the AI Six Little Tigers: There are funds to train the model, financing and chips are not a problem; Ali said that the new AI translation tool beats Google and ChatGPT丨 AI Intelligence Agency

The AI background, technical doorway and business application behind ChatGPT (10,000 words long article, recommended collection)

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

JD Finance responds to run rumors; Yu Chengdong talks about FSD's entry into China; ChatGPT is coming to Windows | Evening