ChatGPT users are ecstatic, and AI startups shake their heads.

Just in time for the Google I/O conference, in the early morning of May 14, OpenAI released a new model - GPT-4o.

Yes, not search, not GPT-5, but a new multimodal large model of the GPT-4 series. According to OpenAI CTO Muri Murati, GPT-4o — the "o" stands for omni — is capable of accepting input and output in any combination of text, audio, and images.

The new GPT-4o model responds faster, processes faster, and is more efficient, which also makes a qualitative change in human-computer interaction to a certain extent.

In fact, in the less than 30-minute press conference, the most talked about was not the GPT-4o model itself, but the interactive experience of ChatGPT with the support of GPT-4o. Not only is the human-machine voice dialogue experience closer to the real-time dialogue between people, but the advancement of visual recognition capabilities has also made AI more able to interact with voice based on the real world.

In short, it's more natural human-computer interaction. It's very reminiscent of the AI virtual assistants from Her, including OpenAI CEO Sam Altman:

GPT-4o's first experience: a leapfrog upgrade of vision and hearing

Figure / X

But for many, it may be more important that GPT-4o will be available to free users (excluding the new voice mode), which will be officially launched in the coming weeks. Of course, ChatGPT Plus paying users obviously still have "privileges" and can try out the GPT-4o model in advance starting today.

Photo/ ChatGPT

However, the desktop app in OpenAI's demo has not yet been launched, and the ChatGPT mobile app (including Android and iOS) has not yet been updated to the version demoted at the conference. In conclusion, ChatGPT (GPT-4o)'s new voice mode is not yet available to ChatGPT Plus users.

Figure / X

So in a way, the GPT-4o that ChatGPT Plus users enjoy at the moment is basically the experience of ChatGPT free users in the coming weeks.

But how does GPT-4o actually perform? Is it worth it for free users to get back started with ChatGPT? At the end of the day, you still need to get hands-on experience. At the same time, through the current text- and image-based conversations, we may also be able to get a glimpse of what the new ChatGPT (GPT-4o) is capable of.

From a picture of Genshin Impact, GPT-4o understands images better

All the upgrades of the GPT-4o model can actually be summarized as a comprehensive improvement of the native multimodal ability, not only the input and output of any combination of text, audio and images, but also the obvious improvement of their respective comprehension capabilities.

Especially image comprehension.

In this picture, there are partially obscured books, and a mobile phone running a game, GPT-4o can not only accurately recognize the text on the book, but also correctly identify the complete title of the book according to the knowledge base or the Internet, and the most amazing thing is that you can directly see the game that the phone is running - "Genshin Impact".

Photo/ ChatGPT

Frankly speaking, players who are familiar with the game "Genshin Impact" can probably see the main body at a glance, but based on this picture alone, many people who have never played the game and don't know the game characters basically can't recognize "Genshin Impact".

When Xiao Lei asked him how he could tell it was Genshin Impact, GPT-4o's answer was also logical: it was nothing more than the visual content, the game interface, and the visual style.

Photo/ ChatGPT

But the same picture and question, but when I asked Tongyi Qianwen (owned by Alibaba) and GPT-4, they gave unsatisfactory answers.

Similarly, after reading the meme that Musk just posted, GPT-4o can also understand the jokes and irony more accurately.

Photo/ ChatGPT

On the mobile ChatGPT app, GPT-4o's description of the scene is also accurate through a very partial photo, and it also roughly deduces the range of residential areas or office buildings.

Photo/ ChatGPT

All of the above examples illustrate GPT-4o's progress in image understanding to a certain extent. It should be mentioned that according to OpenAI's new policy, users of the free version of ChatGPT will also be able to take photos directly or upload images to GPT-4o in a few weeks.

In addition, users of the free version can use GPT-4o to help summarize, compose and analyze by uploading files. However, in terms of the number and size of files, ChatGPT may still be less bold than Kimi or other domestic AI chatbots, and the limitations are obvious.

Of course, there are still advantages, after all, GPT-4o has the top "intelligence" of GPT-4.

The new model is not yet here, but the voice experience has taken it to the next level

But compared to image comprehension, in Xiao Lei's view, the most important ability upgrade of GPT-4o this time has to be voice.

Although the new voice mode has not yet been implemented, and many of the experiences in the demo cannot be experienced, you can find that the voice experience of GPT-4o has been significantly upgraded by opening the existing voice mode and chatting a few words.

First, not only is the timbre and pitch very close to the voice of a normal person, but more importantly, the AI can also be proficient in various modal words, such as "um", "ah", etc., and there will be a certain amount of intonation and frustration in the dialogue. In contrast, it is clear that the response of the voice pattern under GPT-4o is closer to the general sense of "emotional".

Compared to voice assistants such as Siri, it is a huge improvement, and even compared to the current bunch of generative AI voice chats, GPT-4o makes voice appear more fidelity and natural.

Second, in the past, in a conversation in voice mode, it often took a long time for ChatGPT to realize that I was done, and then start uploading, processing, and outputting the answer, so much so that many times I would choose to control it manually. But with GPT-4o, ChatGPT is more sensitive to realize that I'm done and start processing, with a lot less manual intervention.

At the moment it is still the old voice mode and interface, Figure/ChatGPT

However, there are also shortcomings, some Xiaolei estimates that it will be difficult to have obvious improvements when it is officially launched, such as the "hallucination" problem that has been discussed, and no obvious improvement has been felt; But some may change qualitatively at the rollout, such as delays in conversations.

Judging from the experience of the current version, even if the network connection is normal in chat mode, it will take a long time to connect at the beginning of the voice mode, or even fail to connect. But even when connected, the latency of the conversation is still high, and often I have to wait for a few seconds for a voice response after I finish speaking.

In fact, the old speech model actually transcribes the user's voice into text through OpenAI's Whisper model, then processes and outputs it through GPT-3.5/GPT-4, and finally transcribes the text to speech through the text-to-speech model. After this pass, it is not difficult to understand the reason for the slow voice response and poor voice interaction experience of ChatGPT before.

At the same time, this is the core reason why the new voice model is exciting. According to OpenAI, GPT-4o is a new model trained end-to-end across text, vision, and audio, with all inputs and outputs processed by the same neural network in the new speech mode. Even more than just text and voice, the new voice mode can also conduct conversations based on the real-time view of the phone's camera.

New voice mode and interface, Figure/OpenAI

To put it simply, it turns out that ChatGPT's response to your voice must be processed and output by three "brains" (models) in sequence. In the upcoming new model, ChatGPT only needs to go through a "brain" (model) that supports text, speech, and even images at the same time, and the efficiency improvement is naturally conceivable.

As for whether the ultra-low latency response in OpenAI's demonstration can be achieved, it will still have to wait for the implementation of the new model in the next few weeks, and Xiaolei will also experience it as soon as possible.

Write at the end

It is true that in the year since the release of GPT-4, the world's largest models are still frantically emerging and iterating, and the gap between GPT-4 and GPT-4 is also narrowing, and even surpassing (Claude 3 Opus) for a time. However, judging from authoritative benchmarks, battle PK rankings, and feedback from a large number of users, GPT-4 is still one of the top large models in the world.

More importantly, technology shapes capabilities, and products shape experiences. GPT-4o once again proves OpenAI's absolute strength in technology and products, and GPT-4o's iteration of the voice interaction experience may once again eliminate a number of startups in the direction of AI chat and AI voice assistants.

But on the other hand, we also see the hope of a qualitative change in human-computer voice interaction again.

GPT-4o's first experience: a leapfrog upgrade of vision and hearing

From a picture of Genshin Impact, GPT-4o understands images better

The new model is not yet here, but the voice experience has taken it to the next level

Write at the end