laitimes

GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

author:Titanium Media APP
Text | The home of the big model

On May 13, at OpenAI's spring press conference, CTO Mira Murati released the flagship version of ChatGPT GPT-4o to the world on behalf of the company. During the conference, OpenAI used a lot of space to introduce the advanced degree of GPT-4o's interaction with humans on the mobile terminal, and emphasized the application of the new model in multimodal scenarios through on-site communication with GPT-4o, singing, and real-time problem solving.

Officials said that before the launch of GPT-4o, when users used voice mode to talk to ChatGPT, the delay time was 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4, respectively, and GPT-4o significantly reduced this delay to 320 milliseconds.

This is because the traditional speech mode is implemented through three models: audio-to-text, GPT-3.5/GPT-4 processing text, and text-to-audio. However, GPT-4 loses information because it cannot directly process intonation, multi-talkers, background noise, and cannot output laughter, singing, or emotion. GPT-4o, on the other hand, has trained an entirely new model end-to-end that is capable of processing text, visual, and audio inputs and outputs simultaneously. This means that all inputs and outputs are processed by the same neural network.

Multimodal upgrade, GPT wants to be a human bosom friend?

GPT-4o has been significantly optimized in terms of performance and efficiency. Thanks to the improved model architecture and training methods, GPT-4o exhibits higher accuracy and faster response times when handling complex tasks. OpenAI said that the launch of GPT-4o will bring users an unprecedented experience, significantly improving the application effect in areas such as natural language processing, dialogue systems, data analysis, and programming assistance.

According to the official website, GPT-4o has reached the GPT-4 Turbo level of performance in text, reasoning, and coding intelligence, while also reaching a new high water mark in multi-language, audio, and visual capabilities.

  • Text Evaluation:
GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

In terms of text processing, GPT-4o achieved a record score of 88.7% in tests such as 0-shot CoT (Chain of Thought) and MMLU (General Knowledge Questions). This performance shows that GPT-4o is still capable of complex reasoning and answering questions without prompt. In addition, GPT-4o also achieved a new high score of 87.2% in the traditional 5-shot no-CoT MMLU test. These evaluation results are carried out through a new evaluation library, which ensures the reliability and authority of the test. These improvements not only improve the model's inference capabilities, but also increase its applicability to a wide range of tasks.

  • Audio ASR Performance:
GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

In terms of audio processing, GPT-4o achieves a significant improvement in automatic speech recognition (ASR) performance compared to Whisper-v3, especially in low-resource languages. This means that GPT-4o is not only able to handle mainstream languages, but also provide high-quality speech recognition services in more language environments.

  • Audio Translation Performance:
GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

At the same time, GPT-4o has also set a new industry standard when it comes to speech translation, outperforming Whisper-v3 in the MLS benchmark, demonstrating its superior ability in cross-language communication and translation.

  • M3Exam Zero Sample Results:
GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

In terms of multilingual and visual assessments, GPT-4o performed well in all languages in the M3Exam benchmark. This shows that GPT-4o not only excels in a single-language environment, but also is able to handle complex tasks in a multilingual environment, fully demonstrating its strong cross-language understanding and processing capabilities.

  • Video Comprehension Assessment:
GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

In terms of visual understanding, GPT-4o achieves state-of-the-art performance across multiple visual perception benchmarks. These benchmarks include 0-shot MMMU, MathVista, and ChartQA, among others, which means that GPT-4o is able to maintain a high level of visual comprehension and reasoning ability despite no-shot learning. This capability makes GPT-4o excellent when working with images, charts, and complex visual information, further expanding its potential for real-world applications.

GPT-4o has achieved significant performance gains in several technology areas. The multimodal capability makes GPT-4o more practical in more application scenarios. In addition, OpenAI provides a convenient API interface that makes it easy for developers to integrate GPT-4o into their applications. In addition, GPT-4o supports a variety of platforms and programming languages, further enhancing its flexibility and convenience of use.

Focusing on end-side applications, OpenAI cares more about emotional value

Throughout the conference, OpenAI's performance does not want to highlight the changes that technology brings about in the industry, but to make artificial intelligence technology not only improve the quality and efficiency of enterprises in the business field, but also better serve people's daily life.

Perhaps this is one of the reasons why Sam Altman chose the more relatable female CTO Mira Murati to host the event.

GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

In addition to the real-time voice dialogue function, Barret, the head of R&D, also brought the highlight performance of GPT-4o in real-time processing of mathematical problems. Barret wrote an equation by hand and filmed it on camera to GPT-4o for online guidance. With the step-by-step guidance of the voice assistant, Barret also solved the problem very easily.

In addition, OpenAI also demonstrated GPT-4o's various problems in code, real-time translation, etc., and although there were still errors in some tests, the entire conference was carried out in a very relaxed environment. It not only gives enterprises or research teams a new research direction and reference, but also allows more C-end users to feel the new product experience empowered by AI.

GPT-4o focuses on end-to-end applications, and OpenAI cares more about emotional value

On the PC side, OpenAI has launched a new macOS app designed to streamline workflows. The app is available for both free and paid users and is seamlessly integrated into the user's actions on the computer. With simple keyboard shortcuts, users can instantly ask ChatGPT a question and take a screenshot in the app for discussion.

For Plus users, this macOS app will offer more features and services. In the coming weeks, OpenAI also plans to roll out the app to a wider user base, with plans to launch a version for the Windows platform later this year to meet the needs of different users.

The home of the large model believes that in the commercialization process in the field of artificial intelligence, the C-end market occupies a crucial position. As a leading AI company, OpenAI attaches great importance to the user experience in the C-end market, which is not only to meet user needs, but also paves the way for its further commercialization exploration in the C-end market.

In the C-end market, user needs are diverse and rapidly changing, so there is a better need for an intelligent and user-friendly user experience, not only to optimize the model algorithm, but also to improve the friendliness and ease of use of the interactive interface, so as to ensure that users can enjoy a smooth and natural interactive experience when using their products.

It is worth noting that OpenAI's choice to "cut off" the day before the 2024 Google I/O Conference not only reflects OpenAI's emphasis on the C-end multimodal large model market, but also reveals OpenAI's positive attitude and strategic vision in the large model business competition landscape.

This action has undoubtedly won more industry voice for OpenAI. In the tech industry, voice is often tied to influence, market share, and commercialization potential. Through this strategy, OpenAI has successfully attracted a large number of users and media attention, further consolidating its leading position in the field of AI technology.

What's more, OpenAI's action also directly points to the entrance of the C-end multimodal large model. With the continuous advancement of technology and the continuous expansion of application scenarios, multimodal large models have become an important development direction in the field of AI. It can not only process text, images, audio and other forms of information, but also achieve more intelligent and natural human-computer interaction, bringing users a more convenient and rich experience. Therefore, whoever can seize the entrance of the multimodal large model will have the possibility to take the lead in the future market competition.

From the perspective of the commercial competition pattern of large models, the free and open opening of GPT-4o has undoubtedly intensified the competition in the industry, and also demonstrated its "ambition" for commercial expansion. Obviously, Google, Meta and other tech giants will not sit still, in this "fully upgraded" large model business war, how can Google fight back? I believe it will be revealed soon.

Read on