GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".

Geek Park

2024-05-14 09:27Posted on the official account of Beijing Geek Park

Text | Li Shiyun

Editor | Jing Yu

On May 13, OpenAI once again stirred up the entire AI industry with a spring conference.

在 Sam Altman 缺席的情况下，OpenAI CTO Mira Murati 担纲介绍了新的 ChatGPT 产品 GPT-4o。

The short 26-minute press conference is almost a realistic interpretation of the sci-fi movie "Her". When you open ChatGPT, you're no longer confronted with a tool that will only help you generate content or engage in rigid voice chats—you'll be evoking a voice assistant that can do anything, or a "species" that is getting closer and closer to humans.

It has "eyes" that can see you through the camera, for example, it can judge the mood of a researcher by the upturned corners of his mouth, what he is doing by the background of his surroundings, and even give styling suggestions; It can "see" your computer desktop and directly help you see what is wrong with the code you have written.

It has a more sensitive "ear" that understands not only speech, but also the researcher's rapid breathing, and guides him to breathe slowly and steadily.

GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".

OpenAICTO Mura Murati 宣布推出 GPT-4o | 图片来源:OpenAI

It has a more flexible "mouth", there is no longer a delay in the conversation, you can interrupt it at any time, it can catch your words at any time. Its voice can bring emotion, such as a little calmer, a little more passionate, and even a little sarcasm. It can also sing.

It also has a smarter "brain". It helps researchers solve inequalities step by step, and it can also be used as a simultaneous translator so that you can communicate with people who speak different languages.

Behind these powerful capabilities is the new model GPT-4o launched by OpenAI. The biggest improvement over existing models is that GPT-4o can make inferences about audio, visual, and text in real-time — in other words, it allows ChatGPT to achieve truly multimodal interactions.

This is not only the pursuit of technological progress, but also the pursuit of application popularization. One of OpenAI's missions is to make AI accessible to everyone, and it's critical that users can use AI smoothly. In the era of "model as application", this interactive experience ultimately depends on the improvement of model capabilities. OpenAI says GPT-4o ("o" stands for "omni") is a step toward more natural human-computer interaction.

At the launch, Mira Murati announced that GPT-4o will be available to all users for free, while paid and enterprise users will be able to get early access to the experience.

The movie Her, released in 2013, tells the story of a human who falls in love with an artificial intelligence voice assistant. Judging by the capabilities presented by ChatGPT today, such an imagination is becoming a reality at an accelerated pace.

01 ChatGPT's amazing progress: Transform into a human "super assistant" without even human involvement

On OpenAI's official website, more amazing application scenarios of ChatGPT as a personal voice "super assistant" are presented.

The first is for individual users, and like people, it mainly provides "emotional value" and "cognitive value". For example, it can tell jokes, sing happy birthday songs, play games, tease puppies, hypnotize people, make people relax, and so on; It acts as an interviewer and gives people interview advice; It also provides environmental observations to a blind person, telling him what he sees and reminding him of the road conditions when crossing the street.

Blind users use GPT-4o to "see" the whole world Image source: OpenAI

Then there is the multi-user approach, which provides more of a "synergistic value". For example, acting as an interpreter for two people who do not speak the same language, so that they can communicate without barriers; Make a game judge for two people to "rock-paper-scissors", first shout the password to let the game start, and then accurately determine which person wins; Act as a "tutor" and help a father with his child's homework; Even as a "meeting third party", he presided over and recorded multi-person meetings.

The most interesting thing is the conversation between different ChatGPTs. This kind of communication without human involvement is not only full of science fiction, but also makes people start to imagine a future where machines will replace human collaboration without human collaboration. In a demo, a user asked ChatGPT on one phone to apply for after-sales service from ChatGPT on another phone on his behalf, and the two ChatGPTS chatted for two minutes without hindrance, and successfully helped the user "exchange the goods". And Greg Brockman, president of OpenAI, gave a mischievous demonstration where he had two ChatGPTs interact and sing.

OpenAI 总裁 Greg Brockman 演示两个 GPT 的互动｜图片来源：OpenAI

A former executive from a major company who started making "AI voice assistants" 10 years ago told Geek Park that he had envisioned that the ultimate form of AI assistants should be "multimodal and omnipotent", but the technology did not support it at the time, and he thought that ChatGPT would accelerate the possibility of realizing this idea - but he didn't expect that this process would come so quickly.

He believes that a key indicator of achieving AGI is whether the machine has the ability to learn, iterate, and solve problems on its own. This breakthrough may seem distant, but when the two ChatGPTs start chatting with each other, the chasm seems a little shallow.

02 The technological progress and security of the GPT-4o multi-modal large model

These amazing product performances are fundamentally due to the technological progress of the GPT-4o multi-modal large model. The latter is divided into three parts: text, voice, and images, and GPT-4o has improved in all three, especially the latter two.

In terms of text, according to OpenAI's technical report, GPT-4o surpassed GPT-4T, GPT-4 (originally released in March '23), and competitors Claude 3 Opus, Gemini Pro 1.5, Llama3 400b, and Gemini Ultra 1.0 in MMLU (language), GPQA (knowledge), MATH (mathematics), and HumanEval (programming). For example, on the 0-shot COT MMLU, GPT-4o achieved a new high score of 88.7%.

GPT-4o's scores on text are quite good|Image source: OpenAI

Crucially, audio, multilingual, and visual advancements are required.

In terms of audio, ChatGPT's audio flaw in the past was that it needed to go through three separate models, which caused latency and could not carry rich information. It transcribes the audio to text by the first model, then receives and outputs the text by GPT-3.5 or GPT-4, and finally the third model converts the text back to audio – on the one hand, it causes a delay in the transmission of audio, with an average delay of 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. On the other hand, the model loses a lot of information, making it impossible to directly observe pitches, multiple speakers, or background noise, and to output laughter, singing, or expressing emotion.

GPT-4o's solution is to train a new model end-to-end across text, visual, and audio, meaning that all inputs and outputs are processed by the same neural network. OpenAI says it's the latest move to push the boundaries of deep learning. Currently, GPT-4o can respond to audio input in as little as 232 milliseconds, with an average of 320 milliseconds, similar to the response time of humans. At the same time, GPT-4o has better performance than Whisper-v3 (OpenAI's speech recognition model) in terms of audio ASR performance and translation performance.

The M3Exam benchmark test can be used for both multilingual and visual assessments. It consists of multiple-choice questions, including graphs and charts. GPT-4o is stronger than GPT-4 in benchmarks across all languages. In addition, GPT-4o achieves state-of-the-art performance in visual perception benchmarks in terms of visual comprehension evaluation.

GPT-4o is also good at visual comprehension Image source: OpenAI

A large model trainer once told Geek Park that the technical leadership of the model has never relied on the ranking score, but on the user's most real feelings and experiences. From this perspective, GPT-4o's technological leadership will be easily visible.

OpenAI said that GPT-4o's text and image capabilities will be launched in ChatGPT on the day of the launch. It's available to free users, but Premium users can enjoy up to 5 times the message capacity. In the coming weeks, OpenAI will launch a new version of the voice mode, GPT-4o alpha, in ChatGPT Plus.

Developers can now access GPT-4o's text and visual models in the API. Compared to GPT-4 Turbo, GPT-4o is 2x faster, half the price, and 5x faster rate limiting. In the coming weeks, OpenAI plans to roll out GPT-4o's new audio and video features to a small group of trusted partners.

One of the most worrying things about a powerful technology is its safety and controllability. This is also one of OpenAI's core considerations.

OpenAI says GPT-4o has built security into cross-modal design by filtering training data and refining model behavior through techniques such as post-training. It also created a new security system to protect voice output. To guarantee better security, OpenAI says it will spend the coming weeks and months working on the technical infrastructure, usability after training, and the security needed to release additional models.

03 OpenAI has never disappointed the outside world and once again leads the future of the tech industry

As the initiator and leader of this wave of AI, every release and update of OpenAI is related to the rise and fall of its huge user base, the advance and retreat of the company's competition, and the attention and direction of the entire industry.

Before this press conference, there were a lot of rumors and doubts about OpenAI. A week ago, foreign media reported that OpenAI was going to release a search engine - at the most important news release of the year, the company did not launch GPT-5, which attracted a lot of doubts about its innovation. And without enough innovative technologies and products, the company will struggle to reinvigorate user growth and meet the expectations of the market as a whole.

Since the launch of ChatGPT in late 2022, the company's user base has experienced a lot of ups and downs. Similarweb estimates that its global visits peaked at 1.8 billion in May 2023. However, in the second half of 2023, the number of users has declined, and it has not equaled the peak of global traffic in May last year.

ChatGPT's traffic growth globally and in the U.S. since November 2022 | Image source: Similarweb

This conference is directly related to the growth of its user base.

The outside world is still quite concerned about search engines, and Similarweb said that the news caused ChatGPT traffic to soar that day. However, two days before the press conference, OpenAI CEO Sam Altman clarified that neither GPT-5 nor search engine will be released this time, "but we have been trying to develop something new that we think people will like!" It feels like magic to me." He described it even smaller.

Perhaps, people watched this OpenAI conference with low expectations. But in the end, what they received was a strong shock. This may be the contrast that OpenAI wants.

Whether it's GPT-3.5 at the beginning, GPT-4 at this time last year, GPTs at the end of last year, or Sora at the beginning of this year — OpenAI has proven once again that it won't disappoint. Although competitors such as Google, Claude, Character AI, and Perplexity are grabbing more new users and capital, OpenAI still proves that it has the ability to lead the "high ground" of technological innovation.

Comparison of visits to ChatGPT and other chatbots | Image source: Similarweb

After OpenAI launched a GPT-4o-based "super intelligent assistant", it seems that this will also become the direction that major technology companies will catch up with in the future.

According to foreign media reports, Google has been testing the use of artificial intelligence to make phone calls recently. It is rumored to launch a multi-modal personal assistant called "Pixie" as an alternative to "Google Assistant". It can view objects through the camera and carry out human commands. We'll find out tomorrow at Google I/O.

Recently, foreign media also reported that Apple is about to reach an agreement with OpenAI - at Apple's annual Worldwide Developers Conference held in June, Apple may introduce a "chatbot" powered by ChatGPT in iOS 18, which may have a subversive impact on Apple's personal voice assistant Siri.

It's hard to imagine that in just a year and a half, OpenAI has pushed the technology to the point where it is today, allowing the imagination of "super assistants" to happen so quickly in front of our eyes. However, this is only the tip of the technological prowess unleashed by OpenAI. After all, we're only talking about the GPT-4 update today, not GPT-5 yet. It is unknown how much shock OpenAI will bring us, and how much worry it will cause.

If you look back in the tunnel of the future of technology, the birth of GPT-4o "Super Voice Assistant" today may become a landmark moment in the history of science and technology. But perhaps, as OpenAI's COO Brad Lightcap said a few days ago, "In the next 12 months, we should feel ridiculously bad about the [AI] systems we use today."

View original image 424K

GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".
GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".
GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".
GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".
GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".
GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".
GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".

GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".

GPT-4o: One small step for OpenAI, one giant step for the human "AI assistant".

Read on