laitimes

Ant Bailing large model No. 1: The release of GPT-4o is not unexpected, and the direction of native multimodality is clear

author:Quantum Position

Baijiao from the Qubit of the Concave Fei Temple | Official account QbitAI

The biggest improvement over the previous version of GPT-4o is its granularity in terms of integration, which integrates all modalities in an end-to-end model (All in One).

OpenAI's three key capabilities are worth learning from: data organization capabilities, technology focus, and engineering optimization. If we can get these key points right, it is possible to develop a model with a similar effect.

The native multimodal large model has a high probability of becoming the main competition point of the domestic large model.

OpenAI is shaking up the world with GPT-4o again, how will it affect the industry this time? Dr. Xu Peng, the No. 1 NextEvo innovation R&D and application department of Ant AI, said.

Ant Bailing large model No. 1: The release of GPT-4o is not unexpected, and the direction of native multimodality is clear

Who is Dr. Peng Xu?

Dr. Peng Xu is currently the Vice President of Ant Group, the No. 1 position of NextEvo, the AI innovation R&D and application department. He worked at Google for 11 years, where he was responsible for and led the core technology development of Google Translate and participated in the algorithm development of Google's display advertising system. NextEvo undertakes all the core technology research and development of Ant AI, including all the research and development work of the Ant Bailing large model.

According to Xu Peng, Ant judged the technical direction of native multimodality at the beginning of this year, and continued to invest, and is currently developing products related to full-modal digital humans and all-modal agents.

As OpenAI's first large model GPT-4o to integrate all modalities, it also has an amazing response speed, how exactly is this done? What are the implications for the entire industry and the enterprises that bow to the large model?

Ant Bailing large model No. 1: The release of GPT-4o is not unexpected, and the direction of native multimodality is clear

Qubit chatted with Dr. Xu Peng for the first time. On the basis of not changing the original meaning, the qubit joint large model has been organized as follows.

Dialogue with Xu Peng, the first position of the ant lark model

1、怎么看OpenAI发布的GPT-4o?

Xu Peng: The demo product that OpenAI is showing today is not unexpected. From OpenAI's perspective, they have always wanted to deeply integrate speech capabilities and language understanding capabilities. A few years ago, they launched the Whisper speech recognition model, which can be seen as their early research in this area.

They integrate data of various modalities, including speech, images, video, and text, under a unified representational framework, which is actually a natural way for them to achieve artificial general intelligence (AGI), because in their view, humans are such multimodal agents who understand and interact with each other. The same is true for the direction in which their agents will eventually develop.

In terms of effect, the biggest improvement of GPT-4o compared to the previous version is its finesse in terms of integration.

It integrates all modalities in an end-to-end model, whereas the previous GPT-4 was done by three different modules in terms of speech recognition and voice response, although these modules already provide a pretty good experience, although it may take a second or two.

With this integration, GPT 4o is able to achieve a delayed response of about 300 milliseconds and is able to perceive human emotions as well as other non-speech signals, which is a very significant improvement.

It's also reminiscent of a similar feature that could be available tomorrow in the Google I/O release, as Google has already emphasized multimodal native models as its important feature. So, while we're looking forward to the launch of GPT-5, it's completely understandable that they're launching this product, which is a huge step forward in terms of intelligent interactions, especially in the supernatural interaction mode.

2. Where is the best part of this?

Peng Xu: I think one of the most powerful measures of OpenAI is that after Google launched the native multimodal large model Gemini, they have obviously started to deal with the competition in a planned way.

In terms of integrating resources and focusing on breakthroughs, their organizational ability is indeed admirable. To develop such a product, both data preparation and end-to-end model training require processing extremely large amounts of data. Although they already have GPT-4 as a foundation, to train the model and achieve a response delay of less than 300 milliseconds is undoubtedly a test of their ability to organize data, focus on technology, and optimize engineering, which is truly commendable.

In learning from their practices, if these key points can be done well, we are likely to develop a model with a similar effect.

In the past six months, I have noticed that the industry, including some domestic companies, has made considerable investment in the field of native multimodality. While these companies may not be as fast as OpenAI, they have also made some progress in this area, especially in end-to-end speech models. Including Ant Group, which also made strategic judgments and major investments in the field of native multimodality at the beginning of this year.

3. What is the difference between multimodality and native multimodality?

Peng Xu: In my opinion, the main difference between multimodality and native multimodality is whether the system is based on the simple collaboration of multiple models or the end-to-end completion of all tasks by a single model.

Taking GPT-4 as an example, it can convert speech into text through the speech recognition model, extract image content through the image recognition model, and then use GPT-4's large language model as the central control to generate high-quality answers. Once the answer is complete, the system decides whether to return an image, a piece of text, or a piece of speech output through speech synthesis technology.

These features are all possible in GPT-4, but it is not a native multimodal model, but rather a combination of multiple models that are trained with their own independent training goals.

In contrast, native multimodal models integrate multiple codes such as images, text, speech, and even video into a single model. During the training process, the data of these different modalities are uniformly fed into the internal learning of the model. When information from different modalities is related and points to the same kind of things, their internal representations are very similar. In the generation stage, the model can be more flexible to use unified representations to generate different modalities.

Therefore, the core difference lies in whether the data of all modalities is processed at the same time during the model training process, or whether it is optimized for different goals.

4. Is it difficult to shift from the traditional large model technology to the native multimodal large model?

Xu Peng: The technology itself may not be as difficult as everyone thinks, but the real challenge lies in how to effectively aggregate data from multiple modalities in the actual operation process, and then make an end-to-end model that can integrate various capabilities.

It's not just an engineering challenge, it's also about preparing the data and what methods are used during the training process to make things go more smoothly. Because of such a model training, there will be a variety of small problems in the process, and solving these problems requires the accumulation of experience and knowledge.

5. Will it become a competitive point for domestic large-scale model companies?

Xu Peng: I think there is a high probability that it will. But whether it is a large company or a start-up, the key is to focus on capabilities, and then continue to optimize in this field.

6. How is GPT-4o's low latency related to end-to-end training?

Xu Peng: There is a direct correlation between this.

Taking the existing GPT-4 as an example, when performing speech recognition, it is necessary to wait for the user to complete a sentence before completing the whole sentence recognition. Once the recognition is complete, the entire sentence is fed into a subsequent language model, and a response is generated based on this information. Only then can the speech synthesis model be invoked to convert to speech.

A certain degree of optimization can be carried out in this process. For example, it is difficult to achieve 100% integration of speech recognition output to large model comprehension, because some utterances need to wait until they are fully expressed before they can be understood. Similarly, the more content is output during speech synthesis, the more natural the synthesized tones, intonations, etc. will be.

Now that these functions are integrated into a single model, there are fewer dependencies between the parts. Because the representations inside the model are already fused together, it can start generating speech output much faster without having to wait for all the previous information to be processed. Within this model, the information is already uniformly characterized, so it can be processed as a whole without waiting for all the information to be ready.

7. How do you see its commercial value?

Peng Xu: I don't think OpenAI's launch event is actually a direct commercialization event. At the launch, they mentioned that GPT-4o will be open to the outside world for free.

From this point of view, OpenAI seems to value the future development potential based on this ability. They expect that in the future, more companies will be able to develop and commercialize more natural, human-friendly, and professional products based on this.

In the past, due to the limitations of technical conditions, it was difficult to achieve breakthroughs in some innovative ideas in product design, and what could be done was relatively limited. However, the emergence of the GPT-4o model has raised the upper limit of development a lot, so that enterprises in different industries can more confidently hand over the interaction tasks to this natural interaction mode when designing their business.

8. Can it be understood that OpenAI wants to form a new entrance, or become a super entrance?

Xu Peng: I think that's the case, and then the cooperation between them and Apple, the self-developed search engine, is also working in this direction.

9. Today, OpenAI has shown a new form of software, how to balance the relationship between technological innovation and business model? How has the partnership with the likes of Apple and Duolingo evolved?

Peng Xu: My understanding is that technical capabilities are important, but in order to truly achieve effective practical applications, we also need to deeply understand the core needs of different business fields and industries. Only by understanding the problems faced by the industry can the application of technology bring about the transformation of business models, which is exactly what we expect from the development of AI - to promote the reform of new business models through the development of AI technology.

At present, OpenAI seems to be more focused on in-depth preparation at the technical level. Their previous launch of the GPT Store was designed to encourage developers to develop their own applications using GPT technology. For now, however, the depth and breadth of these applications may not be as deep and broad as OpenAI expects to spark industry change.

But I think OpenAI's technology demonstration this time may inspire more anticipation and exploration. More industry players may be willing to leverage their technical capabilities to explore business models in greater depth. While it's still unclear whether it will be successful commercially, I believe it will take a combination of a deep understanding of the industry and a real integration within the industry to achieve substantial change within the industry. The foundation provided by OpenAI provides a good starting point for future change.

10. If I am an entrepreneur like a product manager who wants to do application innovation based on this lark or ant ecology, what should I do? What not to do?

Xu Peng: I'm not a product manager, I can only discuss how products and technologies should work together from the perspective of technology. After all, the product is ultimately meant to serve the user. I think what product managers should do is to have a deep understanding of the current development of AI models, clarify the boundaries of their capabilities, and predict the possible direction of capability improvement. On top of that, think about how these capabilities create value for users and how they will impact user habits.

From Ant Group's point of view, we have the underlying technology and are constantly evolving this technology, and we are not worried about falling too far behind in terms of technology. I think we should invest more at the product level and think about how to create a truly valuable product, connect with users faster, and let users get services quickly through this new and ultimate experience interaction model.

This is probably the direction we need to focus on in future product development.

11. What are the technical challenges in the human-computer interaction experience? Is the path of native multimodality the best?

Xu Peng: The release of GPT-4o is really amazing, it can capture the intonation and tone of the speaker's voice to a certain extent, and can also recognize people's facial expressions and emotions through visual ability. However, the extent to which these capabilities can be achieved in practical application scenarios still needs to be further explored and verified.

However, being able to understand this person in an all-round way during the communication process is undoubtedly an important direction that needs to be broken through in the future product and technology development, and it will bring about real interactive changes at that time.

Native multimodality should be the best experience, especially in terms of interaction fluency. However, the technical challenges are not small, such as accurately understanding and responding to all visual and auditory information; In addition, single-modal data collection is relatively easy compared to multimodal data.

The next product decision to consider is whether this ultimate experience is just needed for the product. On the other hand, can we use unimodal data for synthesis to generate synthetic data that is useful for model training?

— END —

量子位 QbitAI 头条号签

Follow us and be the first to know about cutting-edge science and technology

Read on