OpenAI, Google "wrestling wrists" large models to equip artificial intelligence with "eyes, ears and mouth"

author：CNR 2024-05-15 21:25:00

CCTV Beijing, May 15 (Reporter Niu Guyue) At 1 a.m. Beijing time on the 15th, Google held its annual I/O developer conference. According to Google's official statistics, in the 110-minute speech, Google CEO Sundar Pichai (Sundar Pichai) mentioned AI 121 times, while launching a series of AI-centered products and services. And just the day before, Mira Murati, the chief technology officer of OpenAI, the company that developed ChatGPT, entered the live broadcast room and released OpenAI's big spring update, including a desktop version of ChatGPT, and the latest flagship large model GPT-4o that can perform audio, visual and text reasoning in real time. Google's newly released "AI Family Bucket" is regarded as a riposte and "call" to GPT-4o.

Mankind's exploration of AI is in full swing, and human-computer interaction has once again made great strides, breaking the shackles of traditional "voice assistants". The large model is like equipping the AI with "eyes, ears, and mouth", and from then on, it can "experience" your happiness and sorrow. Is the future already here?

"Reading human emotions" – a step towards more natural human-computer interaction

At OpenAI's spring conference, although everyone didn't get to see GPT-5, the appearance of GPT-4o was still amazing. According to OpenAI's official website, the "o" in GPT-4o stands for "Omni", which is a multi-modal large model based on GPT-4.

OpenAI says it's a step toward more natural human-computer interaction because it accepts any combination of text, audio, and images as input and generates output content with any combination of text, audio, and images.

It is worth noting that GPT-4o is able to interact with users in a variety of tones and accurately capture users' emotional changes. At the press conference, Mark Chen, head of cutting-edge research at OpenAI, asked GPT-4o to listen to his breathing, and the chatbot detected his rapid breathing and advised him not to "breathe like a vacuum cleaner" and to slow down. Mark then takes a deep breath, and GPT-4o says this is the right way to breathe. At the same time, researcher Barret Zoph also demonstrated how GPT-4o observes the user's facial expressions through the front-facing camera and analyzes their emotions.

"GPT-4o is not only able to understand the user's tone, but also to respond appropriately." Liu Xingliang, president of the DCCI Internet Research Institute, sighed, "Imagine GPT-4o being able to comfort you when you are nervous, make you take a deep breath, and even make a little joke to relieve your stress." This emotion recognition ability makes human-computer interaction more natural and intimate, as if we have a caring friend around us who understands our feelings. ”

A day later, at Google's I/O developer conference, Google followed suit and released a Google AI assistant called Project Astra. This general-purpose model captures and analyzes its surroundings through a smartphone's camera, and is also able to have a real-time conversation with the user. In the demo video, the user holds a mobile phone, points the camera at different corners of the office, and interacts with the system through language. For example, when the user gives the command "Please tell me where the smart glasses are", Astra is able to quickly identify the object and communicate with the user in real time. At the same time, when the user looks out the window, the assistant immediately states the user's full address: "This is obviously the King's Cross area in London." It can also understand drawings and images, such as giving comments on a system flowchart written on a whiteboard, and adding a cache between the server and the database can improve speed.

Demis Hassabis, co-founder and CEO of DeepMind, said Project Astra is the prototype of the AI assistant he's been looking forward to for decades, and is the future of general-purpose AI, "AI personal assistants can process information faster by continuously encoding video frames, combining video and voice input into an event timeline, and caching that information for effective recall." ”

"While OpenAI's GPT-4o is strong in natural language processing capabilities, Google has also shown strong competitiveness in multimodal understanding, data richness, and developer support," Liu said. Both have significant strengths in their respective areas of expertise and continue to push the boundaries of AI technology. ”

Quick response! - Response times close to those of human conversations

From "it can experience your happiness and sadness" to "it can experience your happiness and sadness and give a timely response", the shortening of the response time of the artificial intelligence model makes the human-computer interaction more silky.

At the OpenAI conference, people saw GPT-4o's faster response time: it can respond to audio input in as little as 232 milliseconds, with an average response time of 320 milliseconds, which is almost the same as the speed of human response in a conversation. "We can say that GPT-4o is the 'Flash' of the AI world, and it is outrageously fast." Liu Xingliang analyzed, "In contrast, traditional voice assistants such as Siri, Alexa and various 'classmates' need to go through a tedious process of converting audio to text and back to audio when processing voice input. GPT-4o, on the other hand, directly processes all inputs and outputs through end-to-end training, achieving a true millisecond-level response. ”

The reporter learned that before GPT-4o, when using ChatGPT's voice mode, it required the relay processing of multiple models: first convert the audio to text, then "input and output text" processing, and finally convert the text back to audio. This often results in a large loss of information, such as the inability to capture intonation, recognize multiple speakers or background noise, and fail to generate laughter, singing, or other emotional expressions. GPT-4o is OpenAI's first model that integrates text, visual, and audio multimodal inputs and outputs. By training a new, unified model end-to-end, all input and output processing is done by the same neural network.

Zhou Hongyi, founder and chairman of 360 Group, sighed in the video he published: "This brings a new experience, with a delay of about 300 milliseconds, reaching the response speed of human and human conversations, so that you can not only understand the emotions in your words, but also accompany happiness, sadness, disappointment, excitement or more complex feelings when outputting answers." ”

Guotai Junan's research report on the 15th said that GPT-4o, as a basic tool, provides growth soil for more innovative applications. The research report believes that the understanding of images and videos and efficient real-time interaction have the possibility of replacing other single-function AI software to a certain extent, and GPTStore is expected to launch more convenient applications and enrich the application store ecology. At the same time, with the continuous penetration of GPT on desktop and mobile, AI assistant applications are expected to be popularized faster, and new business models may be gradually introduced.

For more exciting information, please download the "Central Radio Network" client in the application market. Welcome to provide news clues, 24-hour hotline 400-800-0088; Consumers can also complain online through the "Woodpecker Consumer Complaint Platform" of CCTV. Copyright Notice: The copyright of this article belongs to CCTV, and it may not be reproduced without authorization. Please contact [email protected] for reprinting, we will be held accountable for not respecting the original.

OpenAI, Google "wrestling wrists" large models to equip artificial intelligence with "eyes, ears and mouth"

Read on