laitimes

GPT-4o opens a super entrance for OpenAI, posing a challenge to Google?

author:The Paper

Based on ChatGPT or GPT-4o, the way humans obtain information is likely to change in the future, and GPT-4o may open a super entrance for OpenAI, which may have an impact on Google. Next, OpenAI needs to judge whether the ultimate experience on the product is just needed.

"GPT-4o is a huge step forward in the supernatural mode of interaction." On May 14, Xu Peng, vice president of Ant Group and head of NextEvo, told The Paper. In the early morning of May 14, 2024, OpenAI showed people its latest multimodal large model product - GPT-4o, o stands for omini, which means omnipotent.

GPT-4o opens a super entrance for OpenAI, posing a challenge to Google?

04:06

Compared to existing models, GPT-4o demonstrates excellent skills in visual and audio understanding. With the arrival of GPT-4o, the outside world has speculated that the era depicted in the American science fiction movie "She" is approaching us step by step. In 2013, the movie "Her" tells the story of a man who falls in love with a voice assistant.

Competing with Google for native multimodality?

According to Mira Murati, an engineer and CTO at OpenAI, GPT-4o can perform real-time reasoning in audio, visual, and text, accepting any combination of text, audio, and images as input, and generating any combination of text, audio, and images for output. It can respond to audio input in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to the response time of a human in a conversation.

Fu Sheng, chairman and CEO of Cheetah Mobile, said that while GPT-4o has "disappointed" AI practitioners, he also pointed out that "GPT-4o is equivalent to combining a series of engines, such as images, text, and sounds, so that users don't need to switch back and forth." The most important thing is that the voice assistant released this time, due to the use of end-to-end large model technology, it can perceive emotional changes in real time, and interject when it is time to interject, in fact, this is the future of large models. ”

Xu Peng said in an interview with The Paper that although OpenAI did not launch the GPT-5 that the public expects, GPT-4o is a huge step forward in the supernatural interaction mode. Compared with GPT-4, the biggest difference between GPT-4o is that all modalities are integrated in a model, and the multi-modal integration is more refined, with a delay of only about 300 milliseconds, and at the same time, it can perceive emotions, tones, and expressions to achieve more natural interactions, which requires data organization capabilities, focus breakthrough capabilities, and engineering optimization capabilities, and also expand people's imagination space for interaction.

Xu Peng said that OpenAI's goal is to deeply integrate speech capabilities and language understanding capabilities, and as early as the GPT-3 era, its automatic speech recognition system Whisper was a preliminary research. "Putting data in various modalities such as voice, images, video, and text under a unified representation framework is a very natural way to implement what they see as APIs (Application Programming Interfaces), because humans are also agents who understand and interact with multiple modalities."

Xu Peng believes that after Google launched the native multimodal Gemini model in December 2023, OpenAI is preparing for competition in the native multimodal field. Rather than "patching together multimodal" models, we use multiple modalities (e.g., audio, video, and images) to train a model from the start.

GPT-5 may have a difficult birth for a while?

"Image, text, speech, and video coding are programmed into a model, in which they have unified representations, and these data are sent to the model for training, and the model will learn each modality, as long as their information is related, the internal representations are actually very close, so that it will be more flexible when generated." Xu Peng said that the internal representations have been fused, so GPT-4o can output the generated speech at the fastest speed to achieve low-latency silky interaction. "OpenAI's engineering capabilities are truly amazing, with so many modalities, a very large number of input tokens, and the ability to output with a delay of two or three hundred milliseconds, which is a rare progress in engineering."

For the GPT-4o model, at present, OpenAI executives have not disclosed what kind of data was used to train the GPT-4o model, nor did they disclose whether OpenAI can use less computing power to train the model. Xiong Weiming, a technology investor and founding partner of China Growth Capital, told The Paper that although OpenAI did not disclose too many technical details about GPT-4o model training at the conference, it can be guessed that the implementation of this kind of end-to-end large model technology relies on strong computing power support. "It must be a miracle, the computing power market in the United States is indeed much more mature, and the capital market also supports large-scale computing power investment." Xiong Weiming said.

Fu Sheng believes that if the parameters are superimposed regardless of the cost and the so-called large model capability is improved, this road will definitely encounter difficulties. He expects that GPT-5 may have a difficult birth for some time.

The super entrance is already open?

According to OpenAI's official website, at present, GPT-4o's text and image functions have begun to be launched for free in ChatGPT, and Plus users can enjoy 5 times the call quota. The new voice mode will be rolled out to Plus users in the coming weeks, and support for GPT-4o's new audio and video features will be rolled out to a small extent in the API (Application Programming Interface).

Xiong Weiming believes that OpenAI's product strategy can attract free users on the one hand, and can collect a large amount of data from users and feed it to model training, which will help further improve the product. "On the other hand, being able to cultivate users' willingness to pay is also an attempt at commercialization.

"I think OpenAI's attempt may change the habits of some domestic users using software. People may be willing to pay for the use of AI platforms. Xiong Weiming said.

Xu Peng believes that OpenAI's free open services for users are based on GPT-4o's native multimodal capabilities, and more companies can develop more natural vertical interactive products based on GPT-4o in the future.

In the past week, there has been news in foreign media that OpenAI will launch AI search products, although OpenAI has not launched a search engine, but Xu Peng believes that based on ChatGPT or GPT-4o, the way humans obtain information is likely to change in the future, GPT-4o or open a super entrance for OpenAI, which may have an impact on Google. Next, OpenAI needs to judge whether the ultimate experience on the product is just needed.

Chen Lei, vice president and head of big data and AI at Xinye Technology, told The Paper that from a technical point of view, the release of GPT-4o has cross-era significance, truly realizing multi-modal interaction, and it is more necessary to pay attention to how it will continue to be implemented in subsequent commercialization. "Speech recognition and speech generation are not the most difficult, the most difficult is inference. GPT-4o is more difficult to do than before. Adjusting the algorithm to a certain level can achieve the fluency of interaction, but thinking, reasoning, inducting, and summarizing like a human is the embodiment of higher intelligence. ”

Chen Lei also said that when China was still benchmarking GPT-4, OpenAI launched GPT-4o. What the industry needs to think about is how to differentiate the competition in the continuous catch-up. "We are always chasing, chasing to a certain extent and finding that a new generation of products has been launched, and we always feel that we are falling behind, so we have to adjust our mentality and find another way." Chen Lei said.

Fu Sheng believes that OpenAI's release of the GPT-4o model application shows that the large model has a great future at the application level, and the capabilities of the large model will continue to iterate, but in the end, the large model can be used well, or the application.

Read on