OpenAI overturned the voice assistant overnight! ChatGPT learns to look at screens, and the real-life version of Her is here

Author | Zhidong Editorial Department

The showdown of top AI products will be staged in these two days.

Zhidong reported on May 14 that at 1 o'clock in the morning today, before the opening of Google's annual developer conference Google I/O, OpenAI held a spring online live broadcast, announcing that it would launch a desktop version of ChatGPT and release a new flagship AI model GPT-4o.

GPT-4o is free and open to everyone, enables real-time inference across text, audio, and visual (image and video), with API pricing half that of GPT-4 Turbo and twice the speed of GPT-4 Turbo. Paid ChatGPT Plus users will receive 5 times the call credit, as well as early access to its new macOS desktop app and next-generation voice and video features.

This time, OpenAI's upgrade of the AI chatbot ChatGPT is still "directly hitting", and the real-time voice translation ability is natural and smooth, and it feels like it can directly replace simultaneous interpretation.

Not only is it quick to respond and answer accurately, but it can also change the tone of its voice as required, from cold machinery to humor and shyness, and it can sing a song at any time and sound like a real person.

In addition to voice chat, GPT-4o can now have real-time video interaction! For example, I can understand linear equations through video images, and I have learned to "act by looking at my face", and I can understand and judge people's emotions through their expressions and intonation.

▲GPT-4o recognizes the text in the video and reacts shyly

What's more, it looks directly at your screen and answers your questions based on what it sees. For example, when you show a piece of code, it understands it and tells you what's wrong with the code, or what information is in the interpretation of a data graph.

The whole launch was extremely fast-paced, only about half an hour, during which a lot of Apple devices were showcased, and it seems that OpenAI's close cooperation with Apple has been confirmed.

The new features are available to both free and paid users. The beta phase, which begins today, is limited to ChatGPT Plus users and will be available to a wider audience in the coming weeks. Its text and image input are available today, and voice and video features will be available in the coming weeks.

另外值得一提的是,此次春季发布的主讲人不是OpenAI联合创始人兼CEO Sam Altman,而是OpenAI CTO Mira Murati。

Altman also mysteriously left a "her" on social platforms, seemingly implying that ChatGPT has achieved the human-like "flesh-and-blood" AI in the classic movie "Her."

Netizen Dogan Ural commented: "You finally made it. And with a meme of the AI in the stills of the movie "She" "changing heads" into OpenAI.

1. OpenAI's "all-round" large model is online! The performance is comparable to GPT-4 Turbo, free and open, and the API pricing has plummeted

GPT-4o's o stands for "omni". According to Murati, GPT-4o provides every user with GPT-4 levels of intelligence, while also improving GPT-4's text, visual, and audio capabilities.

Previously, GPT-4 was trained on image and text data, and could analyze images and texts to extract text or describe the content of the picture, etc., on top of this, GPT-4o added voice functions to make users' interaction with ChatGPT closer to the interaction experience between people. GPT-4o's performance on English text and code matches that of GPT-4 Turbo, with significant improvements in performance on non-English text.

Murati said the release of GPT-4o represents a big step forward in the ease of use of OpenAI's large models, which are changing the collaborative model of human-computer interaction. She says that in people's interactions, things like easily interrupting each other, background noise with multiple voices in a conversation, and understanding intonation are all very complex for the model.

Previously, when users used the voice function to talk to ChatGPT, the average latency was 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. The speech function consists of a pipeline of three independent models: a simple model that transcribes audio to text, GPT-3.5 or GPT-4 receives the text and outputs the text, and a third simple model converts that text back into audio.

In the process, GPT-4 loses a lot of information, such as not being able to directly observe pitch, multiple speakers, or background noise, and not being able to output laughter, singing, or expressing emotions.

With GPT-4o, OpenAI trained the new model end-to-end across text, vision, and audio, so that all inputs and outputs are processed by the same neural network, further reducing latency.

An important mission of OpenAI is to make advanced AI tools available to everyone for free, Murati said.

She also announced that OpenAI will launch a desktop version of ChatGPT that can be easily integrated into users' workflows. At the same time, in order to make it easier and more natural for users to interact with ChatGPT, OpenAI has also updated the user interface, so that users do not need to pay attention to the user interface, but only focus on how to collaborate.

Currently, more than 100 million users are already using ChatGPT to work and study, and OpenAI's more advanced products are currently only available to paid users.

Starting today, users can use GPTs and GPT Store for free. Murati revealed that more than a million users have created amazing experiences with GPTs, which are custom GPTs for specific use cases available in the GPT Store.

Now, these users can also use visual capabilities, with the ability to upload screenshots, photos, articles containing text and images, and more, while also making conversations more continuous based on their memory capabilities. At the same time, users can also use the "Browse" function to search for real-time information in conversations, and the "Advanced Data Analysis" function to analyze uploaded charts or information, etc.

OpenAI has also improved the quality and speed in 50 different languages. Paid users will receive 5 times the call credit compared to free users.

In addition, GPT-4o has an open API that developers can use to develop and deploy AI applications. Compared to GPT-4-Turbo, GPT-4o is 2x faster, 50% cheaper, and rate-limited up to 5x.

Murati emphasized that it is very challenging to present the technology in a way that is both useful and safe, and that OpenAI's team has been working on how to establish mitigations against the misuse of the technology.

2. Live demonstration of the five capabilities of voice dialogue, code, math problems, real-time translation, and emotional value

Mark Chen, head of cutting-edge research at OpenAI, and Barret Zoph, head of the post-training team, demonstrated the real-time voice dialogue function at the scene. By clicking on the small icon in the lower right corner of ChatGPT, users can enter the voice interaction mode.

What makes GPT-4o-based voice interaction different?

Chen said there are several key differences from previous speech modes: first, users can interrupt the model without having to wait until it's over to start talking; Secondly, the model has real-time response capability, and there will be no embarrassing delay before the user waits for the model to respond. Finally, the model has emotion perception capabilities and is also capable of generating speech in a variety of different emotional styles.

1. Tell stories with emotion, sing and create on the spot

First, Chen asked ChatGPT to tell a bedtime story about robots and love to help his friend with insomnia. ChatGPT was asked to tell stories with more emotion and more drama.

So, GPT-4o began to speak eloquently: "A long time ago, in a world that is not so different from us, there was a robot called 'Bite Byte', which is a curious robot that is always exploring new circuits...... Finally, GPT-4o ended the story with a song at the request.

2. ChatGPT has "long eyes"! You can see the diagram equation

Immediately after, Zoph demonstrated the visual + voice interaction function.

"I want you to help me solve a math problem." Zoph opens a mobile phone video call in ChatGPT and says to ChatGPT, "I'm going to write down a linear equation on a piece of paper...... Don't tell me the solution, just help give it tips for the process. ”

When Zoph writes the equation under the video and asks ChatGPT what he wrote, ChatGPT replied, "I saw it, you wrote 3x+1=4." ”

Zoph asks ChatGPT how to solve this problem, and ChatGPT first prompts it to deal with the "+1" item. As Zoph writes out the process and asks for a new plea, ChatGPT further suggests dividing the two sides by three, which helps Zoph arrive at the correct result of x=1.

During the problem-solving process, ChatGPT will ask questions to entice Zoph. For example, it would boot like this: "Now you've introduced x on one side, and it looks like 3x equals 3." What do you think we should do after that? Zoph said he wasn't sure, and ChatGPT went further: "You already have 3x, and you want to find the value of x, and think about what will cancel the multiplication." With guidance and encouragement, Zoph finally solved the problem by dividing both sides of the equation by three.

After solving the problem, ChatGPT and Zoph summarized together how to use linear equations in the real world. Finally, Zoph wrote a confessional sentence to ChatGPT, and ChatGPT screamed shyly like a little girl when he saw it: "Oh, I saw 'I Love ChatGPT', you are so loving!" ”

3. Read the screen information in real time, answer code questions and analyze charts

Zoph demonstrates ChatGPT's ability to analyze code. He first typed a piece of Python code into ChatGPT and asked ChatGPT to summarize what the code was doing in one sentence.

ChatGPT quickly answers this code for daily weather data, and then details what it does with the weather data.

Zoph asked, "There's a function foo in the code, can you describe what the image would look like without it?" ”

ChatGPT analysis said that there is a rolling.mean calculation in this function to denoise or reduce fluctuations in the data, which will present a smoother graph of the data.

Subsequently, Zoph ran this code to demonstrate ChatGPT's ability to analyze charts.

After sending the image to ChatGPT, Zoph asked it again to describe what it saw in one sentence, and ChatGPT was quick to respond.

Chen also asked which months were the warmest, and ChatGPT not only gave an accurate time range of July and August, but also described how high the temperature reached during this period.

4. No delay speech translation, imitating the speaker's tone

At the suggestion of X-friend, Murati and Chen demonstrated ChatGPT's ability to translate in real time.

Chen first explained to ChatGPT what it needs to do next, which is to translate any English and Italian it hears into Italian and English. After listening to this, ChatGPT replied in Italian: Perfetto (Perfect).

They then spoke in Italian and English, and ChatGPT translated them into their respective languages with little delay, mimicking the speaker's tone and even adding a laugh to Murati's answer.

5. Recognize the emotions of characters, ChatGPT also has its own "small emotions"

Finally, Zoph demonstrated ChatGPT's ability to recognize a character's emotions.

He first told ChatGPT through voice that he would show a selfie, hoping to judge his emotions based on the photo. ChatGPT happily accepted this "fun challenge".

There is also a small oolong here, Zoph initially turned on the rear camera, and although he quickly switched to the front camera and started taking selfies, ChatGPT's response seemed to be delayed by a few seconds, saying "This looks like the surface of a wooden board".

"Don't worry, I'm not a table." After Zoph explained that he had just taken the wrong shot, ChatGPT restarted analyzing the footage and said, "You look very happy, maybe a little excited, and you should be in a good mood." ”

ChatGPT asked Zoph why he was so happy, and Zoph half-jokingly said that he was doing a presentation to show "how incredible you are." ChatGPT seemed to have emotions of its own, and said in an exaggerated tone: "Don't say it! You make me feel shy. ”

3. The trailer will have the "next big thing", and GPT-4o is the previously revealed GPT-2

In addition to the launch event itself, Altman has also been conducting live "live" broadcasts on social platform X, while retweeting the introduction of the new model in a row.

According to OpenAI researcher William Fedus, GPT-4o is actually another version of the GPT-2 model that "killed all sides" in the large model arena some time ago, and attached is a comparison chart of the model's competitive score, which is more than 100 units higher than GPT-4 Turbo.

推理能力方面，GPT-4o在MMLU、GPQA、MATH、HumanEval等测试基准上均超越GPT-4 Turbo、Claude 3 Opusn、Gemini Pro 1.5等前沿模型取得最高分。

In terms of audio ASR (Intelligent Speech Recognition) performance, GPT-4o has greatly improved the speech recognition performance of all languages compared to Whisper-v3, especially for languages with fewer resources.

In terms of audio translation, GPT-4o has also set a new benchmark, outperforming Whisper-v3 and Meta and Google's speech models in MLS benchmarks.

After all the demos, Murati concluded, "As you can see, [ChatGPT these days] is really amazing. ”

In the coming weeks, OpenAI will roll out these features to all users. Murati also revealed that OpenAI will continue to push the boundaries of technology forward, and will soon release the "next big thing".

Conclusion: OpenAI pushes the Mac version of ChatGPT, and the almighty GPT-4o makes a strong debut, how will Google face it?

In February this year, as soon as Google launched the Gemini 1.5 series of large models that realize millions of tokens long text windows, OpenAI took the lead and launched the AI video generation model Sora, stealing the limelight with a stunning global technology circle.

Now OpenAI has declared war again, announcing the Mac desktop version of ChatGPT and GPT-4o on the eve of the Google I/O conference, and using the iPhone and MacBook Pro to demonstrate the whole process, combined with the recent rumors of Apple's cooperation with OpenAI, people are looking forward to the WWDC Worldwide Developers Conference that Apple will hold in June.

Do these AI launches pose a direct threat to Google? How else can the highly competitive generative AI industry create innovation and surprise? Will Google be able to catch the AI challenge launched by OpenAI? The answer will be revealed in the early hours of tomorrow morning, and we'll see.

OpenAI overturned the voice assistant overnight! ChatGPT learns to look at screens, and the real-life version of Her is here

1. OpenAI's "all-round" large model is online! The performance is comparable to GPT-4 Turbo, free and open, and the API pricing has plummeted

2. Live demonstration of the five capabilities of voice dialogue, code, math problems, real-time translation, and emotional value

3. The trailer will have the "next big thing", and GPT-4o is the previously revealed GPT-2

Conclusion: OpenAI pushes the Mac version of ChatGPT, and the almighty GPT-4o makes a strong debut, how will Google face it?

Read on