The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

Previously, Google was undoubtedly the undisputed leader in the field of AI, and its open-source deep learning framework TensorFlow was the cornerstone of the AI world, but all this came to an abrupt end in the fall of 2022, and OpenAI's ChatGPT was born to quickly eclipse Google. And what the outside world did not expect was that just a year later, OpenAI once again "cut off" Google.

A few days ago, without warning, OpenAI released an announcement titled "ChatGPT can now see, hear, and speak", announcing that it will launch ChatGPT's voice and image functions for Plus and enterprise users in the next two weeks.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

According to information revealed by OpenAI, the multimodal version of ChatGPT was trained 10 months ago. So why did it have been hidden before, but now it is suddenly released? The outside world speculates that perhaps because Google cannot be allowed to take the lead.

Recently, it has been rumored that Google is about to release the multimodal model Gemini, and it may become a game changer in the AI industry. According to Sundar Pichai, Gemini integrates multiple technologies, supports the simultaneous output of text and images, and also uses tools and APIs. Therefore, in the eyes of the outside world, in the face of the menacing Google, OpenAI naturally has to use practical actions to return the color.

Therefore, in this round of updates, ChatGPT can not only understand the text entered by the user, but even has the ability to recognize and understand voice and image information. The voice recognition capability is simple, because this function allows ChatGPT to obtain capabilities similar to Siri and Xiaoai, and will provide five different voices for users to choose from, while supporting functions such as voice audio to generate text and translate podcast content into other languages. In fact, as early as May this year, ChatGPT already supported the speech-to-text function, so it is not so surprising that the text-to-speech function is now further online.

As for ChatGPT's ability to read images, it was exposed as early as this spring when OpenAI demonstrated GPT-4. OpenAI co-founder Greg Brockman drew a draft of his vision on paper and took a photo to upload to GPT-4, which immediately generated the site's HTML code. It's just that this ability was overshadowed by GPT-4's more dazzling reasoning and judgment ability at the time.

While these two new capabilities that OpenAI has given ChatGPT may seem a bit mundane, they actually take the ChatGPT experience to the next level.

First of all, ChatGPT can understand what the user is saying and directly use the voice to talk to the user, it is reported that OpenAI and professional voice actors provide ChatGPT with five different synthetic voices of Juniper, Sky, Cove, Ember and Breeze. In fact, the voice ability of ChatGPT is really not surprising, so its essence is speech synthesis TTS (Text-to-Speech).

After more than ten years of development, today's TTS technology is actually quite mature, AI will first divide the input text content into words, divide sentences, mark the tone of speech, and then determine the text structure and semantic information, and then combine the speech synthesis model to generate an acoustic model including pitch, volume, speech rate, and prosody, and finally add waveform synthesis to allow AI to speak. The advantage of ChatGPT is that it can have a natural and smooth conversation, almost imitating the way humans talk, which will allow users to have a dialogue with it and a human, not a machine, across the screen experience.

If the voice ability makes ChatGPT more like a "person", then the ability to read images can be said to be the most surprising part of ChatGPT's multimodal capability. Previously, OpenAI's ability to demonstrate from draft to website on GPT-4 was called "code interpreter" (later renamed advanced data analysis), but it targeted extremely limited scenarios. Today, ChatGPT's image reading capabilities are closer to users' daily lives, after all, a random photo can get ChatGPT's response.

According to the example given by OpenAI, users can now take a picture of the refrigerator and let ChatGPT recommend the recipe; Take a photo of a landmark while traveling and let ChatGPT tell the story of the attraction's fun; You can also take a photo of a math problem and let ChatGPT solve it; You can also take a K-line chart when speculating in stocks, and let ChatGPT watch the market for you. But it is worth mentioning that OpenAI also actively limits the image reading ability of ChatGPT.

If you want ChatGPT to trace the origin of a movie with a screenshot of a movie, ChatGPT will ignore you, and if you want to use a photo of a celebrity to get ChatGPT to rate the person, it will also refuse. Simply put, ChatGPT rejects all issues that could raise legal and ethical risks. In fact, this is also very understandable, after all, OpenAI, which is on the cusp of the storm, does need to cherish its feathers to avoid falling into more whirlpools.

In user tests that have received updates, ChatGPT's image reading is not a traditional "image search". Some netizens use the picture generated by Midjourney, but ChatGPT can still accurately identify the content of this picture, which means that ChatGPT has the ability to understand the image in a real sense. Of course, ChatGPT's image reading function is not a panacea, and OpenAI pointed out in related papers that ChatGPT will produce "illusions" in scenes such as spatial sense, multi-layer blending, contextual reasoning, and occlusion textures.

If that's all, ChatGPT's image reading capabilities may not be particularly exciting, and its real "king bomb" lies in recognizing captchas. Now many users are tired of the verification code, this is obviously an indisputable fact, in the face of a variety of strange verification codes, especially such as 12306 verification code recognition map recognition This makes the machine helpless, more difficult for many users, the future can also let ChatGPT do it for you. However, this technology will also bring certain drawbacks.

You know, the captcha, a technology widely used on the Internet today, was actually born to distinguish between human and machine operations. ChatGPT can accurately identify the verification code, which is equivalent to shaking the entire verification code system. After all, the captcha, as a reverse "Turing test", does isolate the robot to a certain extent. Once ChatGPT's ability to recognize verification codes is applied by hacks, perhaps robots on social platforms such as X, Instagram, Weibo, and Zhihu will become more rampant.

Perhaps, this is the pain caused by new technology.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

Read on

The United States intends to restrict the use of AI software by Chinese and Russian companies, involving generative AI technologies such as ChatGPT

OpenAI released the first version of the "Model Specification" to restrain ChatGPT from crossing the line and breaking the law

鹏· ChatGPT|揭秘美股三大指数的真相

Multimodal AI accelerates across the board! ChatGPT-5 is about to be released, and the leading manufacturers are all sorted out

OpenAI's big move is coming! AI voice assistant snipes Google Apple, GPT-See you 5 years ago

iPhone 或内嵌 ChatGPT/小米 MIX Flip 折叠屏爆料

OpenAI catches the ghost with a strange trick, Ultraman plays everyone: GPT searches for pigeons! Upgraded GPT-4 instead

ChatGPT was revealed to be stationed on the iPhone, Xingji Meizu changed its commander, and Zeekrypton landed on the New York Stock Exchange Tech Week

If you want to do it, you have to be ChatGPT's Excel specialty

What big move will OpenAI make next week? Ultraman may have spoiled the story in this interview

APPLE IS CLOSE TO STRIKING A DEAL TO BRING OPEN AI'S CHATT-POWERED GENERATIVE AI TO IOS 18

The "AI Month" in the technology industry is about to begin! OpenAI, Google, and Microsoft are going to make a big move

遥遥领先！苹果与OpenAI达成初步协议，ChatGPT即将登陆iPhone

It is rumored that the real reason why Apple gave up building a car Executives are worried about falling behind after experiencing ChatGPT

iPhone Introducing ChatGPT? Ultraman: iPhone is the greatest technology product in human history [with AIGC industry application]

Foreign media talk about the real reason why Apple gave up making cars: executives are shocked or threaten the status of iPhone by experiencing ChatGPT, and the annual revenue of $24 billion is gone [with the prospect of the generative AI industry]

Preview of the Global Industry Morning Post: OpenAI plans to announce updates to ChatGPT and GPT-4 on Monday

Morning Post |OpenAI CEO says iPhone is the greatest tech product / Tesla will spend $500 million to build charging network / Former Blizzard president calls out Microsoft Xbox

Blockbuster schedule of the week: China's economic data, US CPI, OpenAI, Google press conference, BATJ earnings report

This week's blockbuster schedule: China's economic data, US CPI, OpenAI, Google press conference, BATJ earnings report

ChatGPT登录iPhone！Siri引入最强大脑，能看图能识字，iOS18必升

OpenAI will officially announce new products, and Arm plans to launch AI chips next year... The AI race is heating up again

杠上Google I/O？OpenAI抢先一天直播，ChatGPT或将具备通话功能

Ultraman urgently refutes rumors: OpenAI will not push AI search next week! But Google is still on the move

Microsoft secretly developed the first 100 billion model, which was actually operated by OpenAI's opponents!

The "AI Month" in the tech industry is on! What disruptive experiences will giants such as OpenAI, Google, and Microsoft bring?

Wang Yi held talks with South Korean Foreign Minister Cho Doi-yeol; OpenAI releases GPT-4o

OpenAI will release the ChatGPT search engine next week to challenge Google search!