Only Google's injured world has been achieved, but should the "all-round model" be followed?

Brain polar body

2024-05-19 14:30Posted in Henan Science and Technology Creators

Among the recent high-profile news in the field of AI, OpenAI and Google's new product launch conference undoubtedly occupied the most headlines.

Of course, our team is no exception, not only watched the press conference of the two companies for the first time, but also experienced GPT-4o, which is known as "subverting the world" and "science fiction again".

In a word:

OpenAI press conference, disappointed;

Google press conference, boring.

It's not that we're pretending to be amazing. In fact, AI industry professionals generally share similar opinions.

Some people engaged in AI engineering projects in China said, "I don't care, because I can't use it." And many AI scientists and experts also bluntly said, "Looking at it and falling asleep, almost everything in Google is benchmarking and catching up, without too much freshness."

Only Google's injured world has been achieved, but should the "all-round model" be followed?

Once again, in the battle against OpenAI, a world in which only Google is injured has been reached.

Although the new technology direction of the two AI giants is still worth paying attention to, it is clear that with the deepening of the industrialization process of large-scale AI models, domestic and foreign players are becoming more and more calm and focusing more on their own AI strategies and rhythms.

Some people have compared these two press conferences to a game of fighting landlords, with OpenAI playing a one-on-two and Google playing with four kings. So, the core of this contest - multimodal large model, should the domestic AI industry follow up? If you want to follow up, what problems should you consider in advance?

Every time a new product comes out, it is difficult to make progress if you only follow the news "shock". You might as well join us in settling the accounts for GPT-4o.

The all-round model, where is the "amazing"?

Google's counter to OpenAI's launch event, known as a "diarrhea update", launched more than a dozen new products and upgrades in one go. The reason why people see falling asleep is because everyone has been "amazed" by GPT-4o the day before.

And the other products demonstrated at the Google Developer Conference have already been released by OpenAI. Gemini Astra, which benchmarks GPT-4o, performs slightly worse, so it's no wonder that everyone's interest is lacking. Obviously, this is a precision sniping attack on Google. Previously, Google has released a warm-up video of the voice assistant demo demonstration, and the most amazing thing about GPT-4o is the "ceiling level" human-computer natural voice interaction.

So, what is the magic of OpenAI's multi-modal large model that has been exhausted and Google has prepared?

The "o" in GPT-4o stands for "omni", which means "omni", and this is the version number, highlighting the multi-functional characteristics of GPT-4o, which can be understood from three aspects:

1. Multimodality.

GPT-4o accepts any combination of text, audio, and images as input, and makes inferences about audio, visual, and text in real time to generate the corresponding output. Compared with ChatGPT's Wensheng Wen and Wen Sheng Diagram, Sora's Wensheng Video, etc., GPT-4o is a native multimodal fusion. This can also be achieved with Google's Gemini Astra, which supports multimodal inference. In the demo video, Google's smart assistant can understand the world captured by the phone's camera (videos, images) and tell it in detail.

Of course, multimodal large models are nothing new. Not only these two AI giants, but also some research and development in the field of multimodal large models in China. Previously, some alumni of Zhejiang University open-sourced the multi-modal large model LLaVA, which is benchmarked against OpenAI's GPT-4V. Since multimodal large models are not unusual, why is GPT-4o "amazing"? The answer lies in the second point.

2. Low latency.

GPT-4o is an end-to-end, full-link multimodal large model.

Previously, speech products generally consisted of three separate models: SLM1 transcribed audio to text – LLM output text as text – and SLM2 converted generated text into audio. The network latency at each step adds up, and the result is that the speed of AI inference cannot keep up with the speed of human speech. Everyone may have had a similar experience, after they have finished speaking, the AI model has not yet been fully recognized, the interaction is always interrupted, and sometimes a lot of information is lost, and even the basic text cannot be heard clearly, let alone analyze people's emotions from laughter, pauses, sighs and other factors, and of course people have no interest in talking.

The end-to-end work of GPT-4o eliminates the need for intermediate processing steps, and the same neural network receives and processes input data from different modalities (such as text, vision, and audio), and directly outputs the results. In this way, the response delay of voice interaction can be controlled within 232 milliseconds, which is faster than the human response.

After OpenAI's demonstration of GPT-4o, everyone said that the future of "Her" in the science fiction movie and the machine is about to come true. Google doesn't think so, though.

(Screenshot from Ultraman's social media)

At Google's press conference a day later, Google's Gemini 1.5 Flash response was actually very fast, and it was also able to interact with humans smoothly with almost no delay, but it was still longer than GPT-4o. However, Google claims that both of its demo videos were "shot in a single shot and recorded in real time."

We guess that this is a hint that OpenAI is "leading in loans" again, and GPT-4o may not actually be able to land soon, after all, OpenAI has a history of misleading marketing, and Sora once broke out that the video edited by the artist was used as the original video for promotion, and the demonstration effect was not completely generated by AI.

Whether the demonstration effect is true or false remains to be verified by time, but the end-to-end work of OpenAI and Google proves that the ultra-low latency of human-computer voice interaction is achievable, reaching a level comparable to human communication. This has laid a new technical foundation for the multi-scenario application of voice interaction.

3. Multiple scenes.

Everyone should remember the effect that shocked the world when ChatGPT came out. The strong comprehension and generalization of large language models can enable NLP to have a disruptive impact on a variety of text tasks, which are found in almost every industry.

Looking at GPT-4o, the multimodal large model is particularly excellent in audio and video understanding, and it is also a very ubiquitous general-purpose technology. It is no exaggeration to say that GPT-4o has achieved the "ceiling level" of the voice interaction experience, which can almost bring changes to the voice scene.

For example, OpenAI's tutoring children's math problems can replace parents' homework and let families live a harmonious life; The scene of falling in love with an intelligent voice robot in the movie "Her" allows everyone to have their own online romance/online emotional soothe. By extension, mobile phone voice assistants that have been ridiculed as "artificial retardation" before, customer service robots in banking, telecommunications and other industries, remote schools that lack sufficient teachers, NPC paper people who interact with players in games, and precision marketing that can recognize user emotions......

With the evolution and implementation of end-to-end multi-modal large models, more natural, more realistic, and emotional human-computer interaction is possible.

From this point of view, the technological forward-looking represented by GPT-4o is indeed worthy of the word "almighty o". In that case, why is Google the only one hurt?

Unhurried, only Google Hurt World reached

As soon as OpenAI releases a new product, the expectations and nervousness of the domestic public are as high as Google's Sundar Pichai, which is almost a common practice.

In anticipation of the prediction of domestic audiences, many Chinese media also concocted a series of news that "subverted the world" and "exploded" on the morning after the OpenAI spring new product launch. Some people say that it wants to change the life of Google, the life of Siri, the life of simultaneous interpretation, and the life of 1V1 counseling such as psychological counselors, emotional counseling, and personal trainers......

There may still be people who don't know the truth who believe it, and Google did fight back, but most of the people in the domestic AI industry laughed. This may be the first time that in the face of OpenAI's attack, only Google has reached a world where it is injured.

Why do domestic AI practitioners generally have a mediocre response to GPT-4o and Gemini Astra, which benchmarks GPT-4o, and even fall asleep watching the press conference?

The first reason, of course, is that the new product does not meet expectations.

Many people were expecting OpenAI to release GPT-5, and even if it didn't, it would have to be as amazing as Sora, but GPT-4o is more of an iterative upgrade within the framework of existing technology. And Google's previously released Gemini also has multimodal capabilities. It can be said that although both sides have made improvements and enhancements in multimodal processing, they have not achieved a fundamental technological leap. So some people say that what everyone is looking forward to is a "nuclear bomb", and OpenAI has come up with a "drop cannon" this time.

Another reason is that OpenAI has played "wolf" too many times.

It is the consensus that OpenAI will market, and many people have said after Sora's rollover, "Tired of OpenAI's exquisite demo marketing". Investor Zhu XiaohuOpenAI CEO Sam Altman Ultraman has always accurately timed the PR publicity, showing that he is in the "atmosphere", but it has not been opened to the public for several months.

More and more people are realizing this, and they have become distrustful and impatient with OpenAI's "demo conference".

(Screenshot from social media, netizens' comments on OpenAI)

Of course, the most important thing is that after more than a year of large-scale model implementation, the upstream and downstream of the domestic AI industry chain may have "disenchanted" OpenAI and large models.

This is like playing cards, facing the dragon gate array placed by others, just sitting at the table, not familiar with the rules and strategies of the game, naturally to observe and imitate the other party's strategy first, quickly get the large language model out first, and will also subconsciously listen to the suggestions of the onlookers. Obviously, I am the one who does AI, but when I hear media analysts or netizens say that I am "lagging behind", I am immediately anxious, and I am busy benchmarking ChatGPT and GPT-4, and it is easy to "overturn" and cause a crisis of public opinion. When you first enter the poker table, it is inevitable that you will follow OpenAI step by step.

However, more than a year has passed, and many people and companies who have really made large models and industrial landings may not have completely figured out what to do with the industrialization and commercialization of China's large models, but a consensus is very clear - it can't be done like OpenAI and Google. The simplest, GPT-4o can get NVIDIA's most advanced graphics card for the first time, which is a resource that is difficult for domestic manufacturers to have.

In addition, in the ToB field, the requirements for model controllability and privatization deployment, etc., the intelligence of domestic enterprises should start from basic work such as data cleaning and knowledge base, rather than directly calling the API of the most advanced model......

These problems have led to the domestic AI industry's interest in catching up with OpenAI's "explosive new products", becoming smaller and smaller, and finding its own rhythm and strategy to make large models.

The combination of these backgrounds has led to Google, which is only catching up with OpenAI, being the most hurt by GPT-4o.

What is the benefit ratio of multimodal large models?

Of course, no longer blindly chasing the rhythm of OpenAI does not mean that the technical direction that OpenAI and Google are working on is not important, so you can not care.

However, on the basis of keeping an eye on the trend, it is necessary to coordinate the game and calculate the return ratio, when and what order to play, and the potential return-risk ratio for the commercialization of the large model is the highest.

So, what are the potential benefits and risks of end-to-end multimodal large models such as GPT-4o and Gemini Astra for enterprises?

Let's start with the benefits.

At present, combined with a rich software and hardware ecosystem, it can be implemented faster and maximize value.

For example, although Google's Gemini Astra is inferior to GPT-4o in terms of comprehension ability and latency, Google's stock price has also risen with the support of a strong application ecosystem.

In terms of hardware, the multi-modal capability of Google Gemini is integrated with XR glasses, so that the "Google Glass", which has been hindered in commercialization, has been reshaped again;

In terms of software, GPT-4o is rumored to be bound to Apple to accelerate the AI process of IOS. Google, on the other hand, integrates multimodal capabilities into search, allowing users to interact with search engines through voice, images, etc., and support searching for video content.

(Screenshot from social media, netizens' comments on GPT-4o)

However, these are all outlooks. In the process of actual implementation and integration with software and hardware, AI companies may lose some chips, and the potential risks include:

Long-term losses. Even OpenAI has encountered a traffic crisis and began to change the scale of users for free, which means long-term investment in computing power and personnel. AGI is a long-term task, which may take 10 or 20 years, if at each stage, the scale of commercialization cannot be successfully achieved, and if you want to make a big bet and rely on non-linear growth in the later stage to turn losses into profits, it is very likely that "the great cause will not be completed and the middle way will collapse".

Homogeneous competition. OpenAI competes with Google's large model and bites very tightly, while it is impossible for the technology field to be completely closed, which means that the underlying model capabilities will soon converge, and users will transform into price-sensitive and enter a brutal price war. If there is no differentiated revenue model, blindly following the absolute leadership of the underlying model, the profit will become more and more meager.

Some people may say that it is really vulgar to think about commercialization and making money before making a domestic version of GPT-4o.

It must be explained that from the perspective of the optimal return ratio, OpenAI can be regarded as proficient in deciding how to play cards. In fact, ChatGPT was launched to grab attention with chatbots, and GPT-5 has been delayed, in addition to the ability to speculate that it is not up to expectations, there are also considerations about the timing of its release. Altman has expressed many times, "GPT-5 is powerful, but we haven't decided how to bring these products to market."

In order to achieve long-term healthy development, domestic AI companies must also learn to step on the rhythm of the market and make more sensible business strategies with higher revenue ratios. Netizens' large-scale "double standards" are unacceptable.

Judging from some practical experience of domestic LLMtoB (large model for the B-end market), there are still some practical problems that hinder the implementation of GPT-4o.

For example, controllability, the text and images generated by AIGC are relatively easy to control the content, and other models or human experts can control the content quality and compliance risks. If hallucinations, nonsense, or even illegal content appear in serious scenes such as tutoring homework, doctor consultations, and psychological counseling, how can we prevent them in time?

Altman mentioned in GPT-4o's technical blog that the model is "no more than medium" in risk dimensions such as cybersecurity, that is, it can only achieve medium and below security capabilities at present. Whether it is a C-end user or a B-end government and enterprise customer, who can rest assured that the joys, sorrows, and private information will be told to the multi-modal large model? How to dispel users' security concerns also needs to be fully and meticulously polished in terms of data source, model training, rule mechanism design, and product functions.

What's more, the efforts of startups and developers are always covered by new model capabilities, is it a kind of "backstabbing" from AI large model manufacturers? What kind of intelligent voice industry ecology can attract them to use?

If these landing problems are not solved, the so-called "Her"-like sci-fi future can only exist in OpenAI's demo forever.

In practice, keeping up with the technology is not a real problem. Calculating the pay-to-return ratio, figuring out the cards in your own hands and those of your opponents, as well as the commercial games of the multimodal large model, is the more difficult and urgent problem.

In the wave of GPT-4o, domestic AI companies do not need to rush to the table again.

View original image 47K

Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?
Only Google's injured world has been achieved, but should the "all-round model" be followed?

Only Google's injured world has been achieved, but should the "all-round model" be followed?

Only Google's injured world has been achieved, but should the "all-round model" be followed?

Read on