OpenAI Releases New Flagship Generative AI Model, GPT-4o, Improves Text, Visual and Audio Capabilities

author：cnBeta 2024-05-14 01:25:00

OpenAI has released a new flagship generative AI model called GPT-4o, which will be rolled out "iteratively" in the company's products in the coming weeks. OpenAI CTO Muri Murati said GPT-4o offers "GPT-4" intelligence, but improves text, visual, and audio capabilities on top of GPT-4 and is free for all users to use, with paid users continuing to "have five times the capacity limit for free users."

Speaking at the keynote at OpenAI's office, Murati said: The rationale for GPT-4o spans speech, text, and vision. OpenAI will release a desktop version of ChatGPT and a brand new user interface.

"We know these models are getting more and more complex, but we want the interactive experience to be more natural and simpler, so that you don't have to focus on the user interface at all, but only on working with GPT," Murati said. This is very important because we are looking at the future of interaction between humans and machines. "

OpenAI Releases New Flagship Generative AI Model, GPT-4o, Improves Text, Visual and Audio Capabilities

GPT-4 is OpenAI's leading model before it, it is a combination of images and text, and can analyze images and text, complete tasks such as extracting text from images and even describing image content. But GPT-4o adds voice capabilities.

What does this entail? In many ways.

GPT-4o dramatically improves the experience with ChatGPT – OpenAI's viral AI chatbot. ChatGPT has long offered a voice mode that uses a text-to-speech mode to transcribe text in ChatGPT. But GPT-4o has improved on this, allowing users to interact with ChatGPT more like an assistant.

For example, a user can ask ChatGPT, which is powered by GPT-4o, a question and interrupt ChatGPT as it answers. OpenAI says the model provides a "real-time" response and can even capture emotion in the user's voice and generate "a range of different emotional styles."

GPT-4o also improves ChatGPT's visual capabilities. With a photo or desktop screen, ChatGPT can now quickly answer questions ranging from "what's going on with this software code" to "What brand of shirt is this person wearing?"

GPT-4o is available today in ChatGPT's free tier, with OpenAI's premium ChatGPT Plus and Team users having access to a "5x higher" message limit, and an enterprise option "coming soon." (OpenAI notes that ChatGPT will automatically switch to GPT-3.5 when a user reaches the usage threshold.) OpenAI says it will roll out a GPT-4o-enhanced voice experience to Plus users around next month.

OpenAI claims that GPT-4o will also be more multilingual, with improved performance in all 50 different languages. In OpenAI's API, GPT-4o is twice as fast as GPT-4 (specifically GPT-4 Turbo) and is half the price of GPT-4 Turbo with a higher rate limit.

Users can simply utter a simple "Hey, ChatGPT" voice prompt and receive an impassioned spoken response from an agent. The user then submits the query using simple spoken language and, if necessary, attaches text, audio, and/or visuals – the latter of which can include photos, live footage from the phone's camera, or anything else that the agent can "see."

When it comes to audio input, the AI has an average response time of 320 milliseconds, which the company says is similar to the human response time in human-to-human conversations. In today's presentation, there was no embarrassing lag in the agent's answers, which undoubtedly contained a lot of human-like emotions. In addition, users can interrupt the agent's answer without affecting the flow of information before and after.

In the demo, GPT-4o also acted as an interpreter for the Italian and English conversation between the two people; Help a person solve a handwritten algebraic equation; analyze certain parts of the programming code; There's even an advertising tag for a robot's bedtime story.

Currently, voice functionality is not available in the GPT-4o API for all customers. Citing the risk of abuse, OpenAI said it plans to roll out support for GPT-4o's new audio features to a "small group of trusted partners" first in the coming weeks.

Elsewhere, OpenAI will release a revamped ChatGPT UI on the web, with a new home screen and message layout that is "more conversational," as well as a desktop version of ChatGPT for Mac, where users can ask ChatGPT questions via keyboard shortcuts, as well as take screenshots and discuss them by typing or speaking. (Starting today, Plus users will get access first, and the Windows version of the app will be available later this year). In addition, free users of ChatGPT can now also access the GPT Store, a third-party chatbot library built by OpenAI based on its AI models.

GPT-4o's text and image features are now rolling out to paid ChatGPT Plus and Team users, with business users coming soon. Free users will also start using it, but there is a rate limit.

The voice version of GPT-4o will begin to be available "in the coming weeks".

Developers will be able to use GPT-4o's text and visual modes and make audio and video capabilities available to a "small group of trusted partners" in the coming weeks.

OpenAI Releases New Flagship Generative AI Model, GPT-4o, Improves Text, Visual and Audio Capabilities

Read on