laitimes

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

author:Three easy life

Previously, Google was undoubtedly the undisputed leader in the field of AI, and its open-source deep learning framework TensorFlow was the cornerstone of the AI world, but all this came to an abrupt end in the fall of 2022, and OpenAI's ChatGPT was born to quickly eclipse Google. And what the outside world did not expect was that just a year later, OpenAI once again "cut off" Google.

A few days ago, without warning, OpenAI released an announcement titled "ChatGPT can now see, hear, and speak", announcing that it will launch ChatGPT's voice and image functions for Plus and enterprise users in the next two weeks.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

According to information revealed by OpenAI, the multimodal version of ChatGPT was trained 10 months ago. So why did it have been hidden before, but now it is suddenly released? The outside world speculates that perhaps because Google cannot be allowed to take the lead.

Recently, it has been rumored that Google is about to release the multimodal model Gemini, and it may become a game changer in the AI industry. According to Sundar Pichai, Gemini integrates multiple technologies, supports the simultaneous output of text and images, and also uses tools and APIs. Therefore, in the eyes of the outside world, in the face of the menacing Google, OpenAI naturally has to use practical actions to return the color.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

Therefore, in this round of updates, ChatGPT can not only understand the text entered by the user, but even has the ability to recognize and understand voice and image information. The voice recognition capability is simple, because this function allows ChatGPT to obtain capabilities similar to Siri and Xiaoai, and will provide five different voices for users to choose from, while supporting functions such as voice audio to generate text and translate podcast content into other languages. In fact, as early as May this year, ChatGPT already supported the speech-to-text function, so it is not so surprising that the text-to-speech function is now further online.

As for ChatGPT's ability to read images, it was exposed as early as this spring when OpenAI demonstrated GPT-4. OpenAI co-founder Greg Brockman drew a draft of his vision on paper and took a photo to upload to GPT-4, which immediately generated the site's HTML code. It's just that this ability was overshadowed by GPT-4's more dazzling reasoning and judgment ability at the time.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

While these two new capabilities that OpenAI has given ChatGPT may seem a bit mundane, they actually take the ChatGPT experience to the next level.

First of all, ChatGPT can understand what the user is saying and directly use the voice to talk to the user, it is reported that OpenAI and professional voice actors provide ChatGPT with five different synthetic voices of Juniper, Sky, Cove, Ember and Breeze. In fact, the voice ability of ChatGPT is really not surprising, so its essence is speech synthesis TTS (Text-to-Speech).

After more than ten years of development, today's TTS technology is actually quite mature, AI will first divide the input text content into words, divide sentences, mark the tone of speech, and then determine the text structure and semantic information, and then combine the speech synthesis model to generate an acoustic model including pitch, volume, speech rate, and prosody, and finally add waveform synthesis to allow AI to speak. The advantage of ChatGPT is that it can have a natural and smooth conversation, almost imitating the way humans talk, which will allow users to have a dialogue with it and a human, not a machine, across the screen experience.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

If the voice ability makes ChatGPT more like a "person", then the ability to read images can be said to be the most surprising part of ChatGPT's multimodal capability. Previously, OpenAI's ability to demonstrate from draft to website on GPT-4 was called "code interpreter" (later renamed advanced data analysis), but it targeted extremely limited scenarios. Today, ChatGPT's image reading capabilities are closer to users' daily lives, after all, a random photo can get ChatGPT's response.

According to the example given by OpenAI, users can now take a picture of the refrigerator and let ChatGPT recommend the recipe; Take a photo of a landmark while traveling and let ChatGPT tell the story of the attraction's fun; You can also take a photo of a math problem and let ChatGPT solve it; You can also take a K-line chart when speculating in stocks, and let ChatGPT watch the market for you. But it is worth mentioning that OpenAI also actively limits the image reading ability of ChatGPT.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

If you want ChatGPT to trace the origin of a movie with a screenshot of a movie, ChatGPT will ignore you, and if you want to use a photo of a celebrity to get ChatGPT to rate the person, it will also refuse. Simply put, ChatGPT rejects all issues that could raise legal and ethical risks. In fact, this is also very understandable, after all, OpenAI, which is on the cusp of the storm, does need to cherish its feathers to avoid falling into more whirlpools.

In user tests that have received updates, ChatGPT's image reading is not a traditional "image search". Some netizens use the picture generated by Midjourney, but ChatGPT can still accurately identify the content of this picture, which means that ChatGPT has the ability to understand the image in a real sense. Of course, ChatGPT's image reading function is not a panacea, and OpenAI pointed out in related papers that ChatGPT will produce "illusions" in scenes such as spatial sense, multi-layer blending, contextual reasoning, and occlusion textures.

The multimodal function is online, and OpenAI allows ChatGPT to speak and read pictures

If that's all, ChatGPT's image reading capabilities may not be particularly exciting, and its real "king bomb" lies in recognizing captchas. Now many users are tired of the verification code, this is obviously an indisputable fact, in the face of a variety of strange verification codes, especially such as 12306 verification code recognition map recognition This makes the machine helpless, more difficult for many users, the future can also let ChatGPT do it for you. However, this technology will also bring certain drawbacks.

You know, the captcha, a technology widely used on the Internet today, was actually born to distinguish between human and machine operations. ChatGPT can accurately identify the verification code, which is equivalent to shaking the entire verification code system. After all, the captcha, as a reverse "Turing test", does isolate the robot to a certain extent. Once ChatGPT's ability to recognize verification codes is applied by hacks, perhaps robots on social platforms such as X, Instagram, Weibo, and Zhihu will become more rampant.

Perhaps, this is the pain caused by new technology.

Read on