OpenAI is really holding back.

OpenAI fired the lives of Siri and simultaneous interpretation overnight, and GPT-4o's five core competencies exploded!

At 10 a.m. local time on May 13 in the United States (1 a.m. Beijing time on May 14), OpenAI's spring conference arrived as scheduled, without GPT-5, without a search engine, but launched a new flagship model: GPT-4o.

"o" is an abbreviation for Omni, which means "all-round", which accepts any combination of text, audio, and images as input, and generates text, audio, and image outputs.

Judging from the live demonstration, GPT-4o's multi-modal and real-time interaction capabilities are amazing enough to make people call the sci-fi movie "her" really a reality.

It is worth mentioning that all the capabilities of GPT-4o and ChatGPT Plus will be available to all users for free!

However, GPT-4o's new voice mode will be prioritized for ChatGPT Plus members in the coming weeks.

In addition, GPT-4o has also opened up its API to developers. Compared to GPT-4 Turbo, GPT-4o is half the price, but it is 2x faster and has 5x higher rate limit. OpenAI said it will provide support for new audio and video features to select API partners in the future.

How strong is GPT-4o? Last night, the "number one AI player" watched the live broadcast throughout the whole process, now let's review the details together.

GPT-4o was launched amazingly, and the core capabilities were fully inventoried

01. Zero-delay real-time voice interaction, natural, real and emotional

The first is zero-latency real-time voice interaction, in which GPT behaves like an emotionally charged real person.

During the live demo, the presenter, Mark, said to GPT-4o, "I'm doing a demo and I'm a little nervous. He then began to wheezing very rapidly on purpose, and GPT-4o quickly recognized his breathing and told him, "Oh, oh, oh, don't be nervous, slow down, you're not a vacuum cleaner." And instructed him to adjust his breathing.

Throughout, GPT-4o's tone is so natural, real, and emotional that you can interrupt it at any time and ask it to adjust its tone and tone.

Another presenter asked GPT-4o to tell a bedtime story about "robots and love." As soon as he said a word, Mark Chen interrupted it, saying that the tone of its storytelling was not emotional enough. After GPT-4o was adjusted, Mark Chen quickly interrupted it and asked it to be more emotionally full and dramatic, and then GPT-4o's mood went to a new level, and it can even be said to be exaggerated.

The presenter then asks it to switch to a robotic voice, and GPT-4o's voice and tone immediately become cold and mechanical.

This is not over, the presenter asked GPT-4o to finish the story by singing, GPT-4o adapted the story into a song on the spot, and sang it directly, and the program effect was full.

In comparison, ChatGPT's voice patterns have an average latency of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4), which undoubtedly disrupts the immersion of the conversation.

In addition, because the model needs to transcribe speech into text first, GPT-3.5 or GPT-4 receives, processes and outputs the text, and then converts the text into speech, so GPT-3.5 or GPT-4 cannot directly learn information such as tone, tone, background noise, etc., nor can it output laughter, singing, or expressing emotion.

02. Visually convey content through the camera and solve equations online

In addition to voice interaction, it can also interact with GPT-4o in a multimodal way through visual + voice forms, such as real-time video, uploading pictures, etc.

At the press conference, OpenAI demonstrated the complete process of GPT-4o helping users solve math problems through multimodal capabilities.

, duration 02:05

Visual analysis of graphical reports is also handy. In OpenAI's official blog, users draw pictures on the tablet while talking to GPT-4o, and solve geometric math problems based on the voice information.

03. Smarter nanny-level programming assistant

Real-time programming with GPT-4o is also more interactive than the previous plain text form or uploading pictures for text dialogue.

In the official demo, OpenAI uses GPT-4o on the desktop to inspect the code, which can not only explain what the code does, but also tell the user what to expect if they tweak a particular code.

Through step-by-step real-time Q&A communication, GPT-4o can help users improve programming efficiency, and the whole process is quite silky.

Super real-time voice + visual interaction ability, used in programming assistance, what programmer encouragement teacher will be needed in the future.

04.

Video calls, real-time analysis of facial emotions

The presenter also collected feedback from netizens in real time on X, with one of them challenging to turn on the camera and see if GPT-4o could analyze facial emotions in real time.

The presenter first turned on the rear camera and captured the table in front of him, and GPT-4o immediately analyzed: "You look like a table." ”

After switching to the front-facing camera, the presenter's face appears on the interface that interacts with GPT-4o, giving the impression of a video call.

GPT-4o immediately said, "You look very happy, you have a big smile, do you want to share what makes you so happy?" There was even a hint of curiosity and temptation in his tone.

The presenter replied, "Because I'm doing a live demo to show people how good you are." ”

GPT-4o laughed and said, "Please, don't make me blush." ”

Seeing this, the "number one AI player" couldn't help but echo in his mind the whispers of lovers between Samantha and Theodore.

"Her" has really become a reality.

05. Simultaneous interpretation, support multiple Chinese

Currently, ChatGPT supports more than 50 languages. According to reports, GPT-4o's language capabilities have been improved in both quality and speed.

In the official presentation, one person spoke English and the other spoke Spanish, and the two achieved smooth communication through voice instructions GPT-4o real-time translation. GPT-4o basically only pauses for 1~2 seconds at the beginning, and there is no pause or jamming in the sentence.

, duration 01:07

However, what is a bit buggy is that because GPT-4o is the intermediary communication medium, the two interlocutors do not look at each other directly, but both look at their phones. In the future, there may be new devices that use AI technology to allow people who speak different languages to communicate more naturally.

Revolutionizing human-computer interaction, but not GPT-5 yet

Mira Murati, CTO of OpenAI, introduced in the livestream that GPT-4o is an iterative version of the iconic GPT-4 model: it offers GPT-4-level intelligence, but at a faster speed, and improves its capabilities in text, speech, and vision.

OpenAI CEO Sam Altman posted that the model is "natively multimodal," with a new model trained end-to-end on text, visuals, and audio, with GPT-4o all inputs and outputs processed by the same neural network.

According to the official blog, in the benchmark, GPT-4o achieved GPT-4 Turbo level performance in text, reasoning, and coding intelligence, while setting new standards in multilingual, audio, and visual capabilities.

Source: OpenAI official website

OpenAI plans to gradually roll out GPT-4o's capabilities in the coming weeks. Among them, text and image capabilities will be rolled out in ChatGPT from the day after the launch, and Plus users will have early access and have up to 5 times more message limits than free users. A new version of voice mode with GPT-4o will also be available in ChatGPT Plus in the coming weeks.

For free users, OpenAI also focuses on a "rain and dew evening", and will launch GPT-4o and its related features for all users in the next few weeks:

1. Experience GPT-4 level intelligence

2. Get responses from models and web pages (networked)

3. Analyze the data and create charts

4. Support uploading photos for interaction

5. Upload documents to help summarize, write, or analyze

6. Discover and use GPT and GPT stores

7. Customizable ChatGPT "memory" (with stronger "memory")

Players who have already experienced GPT-4o tell us that the experience is so silky! The follow-up "number one AI player" will also follow up with more detailed gameplay evaluations.

Of course, it's not just OpenAI that is blowing up, in May, which is known as the global "AI Month", it is visible to the naked eye that we will also usher in important events in the AI field such as Google's I/O developer conference, Microsoft's Build annual developer conference, and the release of NVIDIA's first quarterly report.

In addition, Apple's WWDC conference on June 10 is expected to launch a new AI app store, and may upgrade the Siri voice assistant to introduce a new generative AI system.

Guess, if Apple does successfully reach a cooperation with OpenAI, GPT-4o will be introduced to iPhone devices to replace (or upgrade) Siri, which seems to be a logical thing.

In general, compared with the graphical user interface, GPT-4o's near-real-time voice and video interaction experience marks a new revolution in human-computer interaction, a more natural and intuitive interactive experience, which is very close to the artificial intelligence we see in science fiction films, and the science fiction movie "Her" directed by Spike Jones is also frequently mentioned.

Fu Sheng, chairman and CEO of Cheetah Mobile, even recorded a video overnight praising OpenAI for "when everyone is fighting for the parameters and performance of large models, OpenAI has made a comeback and seriously started integration and application".

After reading today's OpenAI release, it is hard to imagine how much of a killer feature Google will have to come up with tomorrow to get rid of the fate of "AI Wang Feng".

OpenAI fired the lives of Siri and simultaneous interpretation overnight, and GPT-4o's five core competencies exploded!

GPT-4o was launched amazingly, and the core capabilities were fully inventoried

Revolutionizing human-computer interaction, but not GPT-5 yet

Read on

Wired survey: A large number of developers did not receive dividends from the OpenAI GPT Store, but they had the opportunity to make money

After AI won the Nobel Prize in a row: Diss OpenAI, the godfather of AI, Musk took the opportunity to step on it, how to go end-to-end

Will it be profitable in 2029? Time is running out for OpenAI

OpenAI Releases Real-Time API, How to Seize the Opportunity in the Era of AI Real-time Voice?

OpenAI Shocking Plagiarism! The 20-year-old founder revealed that the code structure was plagiarized, and the multi-agent was mired in controversy

From a nonprofit to a $157 billion subsidiary, here's how OpenAI did it

Microsoft's AI veteran defects, but OpenAI faces a new threat: former CTO or entrepreneurial poaching!

Game Science leaps to the top of the Steam publisher revenue charts; Adobe launches AI video generator to compete with OpenAI and Meta; The pre-sale price of Xpeng P7+ starts at 209,800 yuan, and the order has exceeded 30,000

Depth: OpenAI Purge

OpenAI's behind-the-scenes entry into defense: with an annual income of $16 billion, it has won a large order from the United States government

4 months ahead of OpenAI? How this product brings a whole new experience to professional creation

英伟达开源新王登基！70B刷爆SOTA，击败GPT-4o只服OpenAI o1

Microsoft will end Azure OpenAI Service for individuals in China, which is only available to enterprise customers

Google's most out-of-the-circle AI product also amazed the CEO of OpenAI

The past and future of OpenAI o1 and artificial intelligence

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China