Today, OpenAI officially launched its latest flagship model, GPT-4o, which is capable of real-time inference across audio, vision, and text.

GPT-4o ("o" stands for "omni") marks a major step forward in enabling more natural human-computer interaction. It is capable of receiving any combination of text, audio, and images as input, and outputs an equally diverse combination. Responds to audio input in a minimum of 232 milliseconds, with an average response time of 320 milliseconds, similar to the reaction time of a human in a conversation. It performs on par with GPT-4 Turbo in English processing and programming, with significant improvements in non-English text processing. In addition, GPT-4o runs faster in the API and costs 50% less. In terms of visual and audio comprehension, GPT-4o is significantly better than existing models.

Before GPT-4o, people could use voice mode to talk to ChatGPT with an average latency of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4). Three independent models are integrated for this speech mode: a simple model that transcribes audio to text, GPT-3.5 or GPT-4 receives the text and outputs it, and a third simple model converts the text back to audio. This process means that GPT-4 can lose a lot of information — it can't directly understand intonation, the voices of multiple speakers, or background noise, and it can't output laughter, singing, or expressing emotion.

Now, OpenAI's new end-to-end trained model, GPT-4o, covers text, visual, and audio, which means that all inputs and outputs are processed by the same neural network. Since GPT-4o is the first model to combine all of these modalities, the team is still exploring the capabilities and limitations of the model.

Exploration of model capabilities

Model evaluation

Based on traditional benchmarks, GPT-4o achieves GPT-4 Turbo level performance in text, reasoning, and programming intelligence, while setting new high standards in multilingual, audio, and visual capabilities.

Improved Inference - GPT-4o achieved a new high score of 87.2% on the MMLU (Multiple-Choice General Knowledge Question Test) in 5 attempts.

Audio ASR Performance - GPT-4o significantly improves speech recognition performance in all languages, especially in languages with fewer resources than Whisper-v3.

Audio Translation Performance - GPT-4o sets a new industry standard in speech translation and outperforms Whisper-v3 in MLS benchmarks.

M3Exam Assessment - The M3Exam benchmark covers both multilingual and visual assessments, including multiple-choice questions from standardized tests in other countries, sometimes containing graphs and diagrams. GPT-4o outperforms GPT-4 on this benchmark across all languages.

Visual Comprehension Assessment - GPT-4o achieves industry-leading performance on visual perception benchmarks.

Language tokenization

20 languages were selected as representatives of the new tokenizer's compression improvements in different language families. (The following includes Chinese compression performance)

Model safety and limitations

GPT-4o is designed with built-in security in various modalities, employing techniques such as filtering training data and improving model behavior through post-training. OpenAI has also created a new security system that provides protection for voice output.

OpenAI evaluated GPT-4o against a "Readiness Framework" and voluntary commitments. Assessments of cybersecurity, chemical-biological radiation-nuclear (CBRN), persuasion, and model autonomy show that GPT-4o does not have a higher than medium risk rating in these categories. This evaluation includes a series of automated and manual evaluations as part of the model training process. At the same time, the team tested the before and after versions of the model's security measures, using custom fine-tuning and hints to better induce the model's capabilities.

GPT-4o has also undergone extensive external red-team testing conducted by more than 70 external experts in areas such as social psychology, bias and justice, and misinformation to identify risks introduced or amplified by newly added modalities. and use these learnings to build safety interventions to improve the safety of interactions with GPT-4o.

The team also recognized that GPT-4o's audio modality could present a variety of new risks. Today, OpenAI publicly released text and image input and text output. Over the next few weeks and months, work will be made on technical infrastructure, improved usability through post-training, and the necessary security to release additional modalities. For example, at the time of release, the audio output will be limited to a set of preset sounds and will adhere to existing security policies. OpenAI will share more details about the various modalities of GPT-4o in the upcoming system card.

Through testing and iterating on the model, we observed several limitations that exist in all modes of the model, some of which are shown below.

OpenAI welcomes user feedback to help identify tasks where GPT-4 Turbo is still superior to GPT-4o in order to continue to improve the model.

Model availability

GPT-4o is OpenAI's latest step in pushing the boundaries of utility in the field of deep learning. Over the past two years, the team has worked on a number of efficiency improvements at every layer of the entire technology stack. As the first result of this research, GPT-4-level models can be made more widely available. GPT-4o's capabilities will be rolled out gradually (with expanded red team access available starting today).

GPT-4o's text and image features are rolling out in ChatGPT starting today. GPT-4o will be available to all free users and up to 5x the message limit for Plus users. In the coming weeks, a new version of the voice mode will be launched in an alpha version in ChatGPT Plus. Developers can now also access GPT-4o as a text and visual model in the API.

GPT-4o is 2x faster than GPT-4 Turbo, halved in price, and has a 5x higher rate limit. OpenAI plans to roll out GPT-4o's new audio and video features to a small group of trusted partners in the API in the coming weeks.

Hello GPT-4o

Exploration of model capabilities

Model evaluation

Language tokenization

Model safety and limitations

Model availability