laitimes

Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

author:Smart stuff
Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

编辑 | GenAICon 2024

The 2024 China Generative AI Conference was held in Beijing on April 18-19, and at the main venue of the conference, Mr. Hu Luhui, former chief engineering director of Meta, delivered a speech on the topic of "From Multimodal Large Models to Understanding the Physical World" at the main venue of the conference.

Hu Luhui said that the post-GPT-4 era focusing on multimodal large models presents four major trends, one is the language model to the multi-modal model, the second is the integration of data into the vector database, the third is the agent to the large model operating system, and the fourth is the model fine-tuning to the Plugin platform.

He believes that large models are a reliable way to AGI. In the application of large models, enterprises and research institutions need to face many challenges. The first is the standardization of data, and data from different sources and formats need to be converted into a unified format to facilitate model training and application.

In addition, the fragmentation of models and the complexity of application scenarios also greatly increase the difficulty of development. For example, in different physical environments, the model needs to adjust its parameters to suit specific hardware and software conditions. At the same time, the cost of computing power and the length of training time are also important factors restricting the wide application of large models.

Hu Luhui predicts that the next AI 2.0 flashpoint and landing direction will be AI for Robotics. The development of this field requires models to not only understand programming or language processing, but also to penetrate into the specific applications of the physical world. This involves understanding and designing the physical environment, which requires large models that can integrate a variety of perception data for rapid decision-making and learning in response to changing external conditions. In this process, the training and application of the model will rely more on efficient computing power and advanced hardware support.

The following is a transcript of Hu Luhui's speech:

Today I want to share "From Multimodal Large Models to Understanding the Physical World". The rapid development of large models coupled with the continuous technological evolution has changed a lot, and I would like to share some of my practical experience with you.

Today, I will mainly share 4 aspects. First of all, starting from the principle of large models, let's talk about the major changes in Silicon Valley and the world after GPT-4; Secondly, combined with the characteristics of large models and multimodalities, I will share Transformer and my relevant work experience at Meta. Today's focus is on why we should understand the physical world, relying only on language models cannot lead to general artificial intelligence, and understanding the physical world is possible to move towards it; Finally, how to approach AGI is discussed by combining multimodal large models and understanding the physical world.

1. Large models open the AI 2.0 era, and Meta is an open source leader

The rapid development of every technology is inseparable from a large amount of scientific research and innovation work behind it, which is the reason for the revival of artificial intelligence, because it is developing and iterating rapidly. The importance and significance of artificial intelligence is very prominent, and it can be said that this artificial intelligence is the fourth computing era or the fourth industrial revolution. The third computing era is the era of mobile Internet, we are in this era, according to each development, the scale of the fourth is larger than the third, and in terms of economic benefits, the impact on human society is greater.

There are two inflection points in the history of artificial intelligence, AlphaGo and ChatGPT. Although each inflection point represents only one product or technology, its impact on human beings is not only the technology itself, such as AlphaGo, it is impossible for all companies to make chess products or platforms. For society, the first time is to start the AI 1.0 era by using the technology behind the inflection point (such as CV or other technologies). This time, the AI 2.0 era began based on the ability to generalize and emerge from large models.

ChatGPT has been released for more than a year, and it still ranks relatively well in terms of performance. And now the cost or cost of large model training is getting higher and higher, GPT-4 training required about $60 million before, GPT-5 may be more expensive.

Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

At present, OpenAI is the leader of closed-source large models, and Meta is the leader of open source. OpenAI's leadership position in closed-source large models is recognized, and Meta's open-source large models Llama and Vision SAM are relatively ahead. Among them, Llama has helped the teams of many language model development companies to have a good foundation.

Now in the model, there are three closed-source and three open-source that are relatively ahead. Maybe everyone wonders why Meta's Llama is gone, Meta is doing another more meaningful thing, which is to understand the physical world, they call the world model. Recently, Llama has not been iterated yet, you can wait and see, this ranking will still change, Llama has laid the foundation for many large language models, and has helped many enterprises develop rapidly.

2. Meta has three major SOTA vision models, multimodal, visual and language integration

Meta's visual model has a lot more to contribute. Transformer was originally applied to language models, and gradually derived to vision, one of the more popular is ViT, visual Transformer.

Meta continues to iterate through ViT or Transformer, and there are three visual Transformer with great impact: one is DeTr, Detection Transformer, which has end-to-end Object Detection; The second is DINO, which opens up self-supervision in the visual field through Transformer, whether it is a large language model or other large models, it cannot rely on labeling, and it needs to be able to learn and supervise independently; The third is SAM, which is more zero-sample, and is the ability to generalize.

Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

In the field of vision, in addition to Sora, SAM has a greater influence. How do I train a SAM, how many resources do I need, or what do I need to pay attention to during training? Last year, I wrote an article on Fine-tune SAM, which talked about how to use SAM for fine-tuning, how to control resources, or how to use resources to fine-tune more efficiently.

Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

A few years ago, when it came to artificial intelligence, the two branches of vision and language came to mind, and CNN and RNN basically did not interfere with river water. The wave of people who do NLP and the wave who do CV have their own academic schools, and the methods are different, and the meetings are not the same. This time for deep learning, language models have gone from LSTM to Word2Vec, and more recently GPT and BERT. The vision model first went from classification to detection, to segmentation, and then from semantic segmentation to instance segmentation.

There are many places where they are particularly similar, and the so-called language model is nothing more than a deeper correlation and logical reasoning. It's the same with visuals, logically the two are fused, and technically Transformer. At the language level, GPT-4 and Llama are more classic; Visually, Sora and SAM are more classic examples, and the backbone behind it is based on Transformer.

Both logically and technically the Transformer Backbone are gradually converging.

This is good news. For R&D workers, NLP and CV, which used to be well water and river water, will finally merge one day. It's undergoing a qualitative change.

At present, the core technology of AI is also a relatively reliable AGI method, which can be extended from one technology and one direction to the next stage. But Meta's chief AI scientist Yang Likun doesn't think so, JEPA has its own theory from the original Image JEPA to Video JEPA. But in any case, in terms of engineering or application, its effect is really outstanding.

What are the core key capabilities for building large models? Most people would say that there are three cores, data, computing power, and algorithms. And I have summarized two other points based on some work experience.

One is the model architecture, and the difference between the current large model and the previous deep learning algorithm is the importance of the model architecture. Transfer learning or fine-tuning through the reshaping of the backbone or model architecture is not just to input domain data or domain knowledge, but to generate a new model by changing the model architecture to achieve the desired domain model.

The other is intelligent engineering. Llama is open source, OpenAI came up with GPT-3.5, that is, ChatGPT, and the singularity that changed the world happened. There is GPT-3, there is data, and computing power, but can GPT-3.5 be made? Different companies are different, and the fundamental reason is that intelligent engineering is different.

Which of these five is the most central and crucial? Many people may say that computing power is very expensive, and they can't buy H100 or A100, but neither Google nor Microsoft will lack computing power, and they currently do not have the world's leading GPT-4 model.

China likes to talk about data, and it is indeed difficult to make a good model without data, but many large manufacturers will not lack data. Algorithms are basically open source, like Transformer or some relatively new algorithms are also open source, and it is not the most critical factor. The model architecture can also be explored through some fine-tuning and different attempts.

Therefore, combined with foreign models and the current situation in China, the core ability to build large models should be intelligent engineering.

This means that some people from OpenAI came out to start a business and engage in Claude, and just now you saw that the second leading one in the ranking is Claude, which is what people in OpenAI came out to start a business and do. It shows that talent is the most valuable.

3. Predict the four major development trends of "post-GPT-4" and understand that there are seven characteristics of the physical world

Now that GPT-4 is a multi-modal model, what are the trends in the development of artificial intelligence in Silicon Valley and the world? I think there are four aspects, and this graph is based on my predictions for the legend generated by GPT-4.

Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

First, from a large language model to a multimodal large model.

Second, move towards vector databases. The current large language model or multimodal large model, no matter how large, has certain limitations, which has led to the popularity of vector databases. You can put some or most of the data in a vector database and the relevant data in a large model.

Third, from automatic agents to large models as operating systems. Agent is relatively popular, but behind it is still a large language model or a multimodal large model. The agent is automatically implemented by the software. The subsequent multimodal large model may be more central as an operating system.

Fourth, the open-source model has shifted from fine-tuning to the introduction of plug-in platforms. ChatGPT is equivalent to a platform that can not only be fine-tuned, but can also be used as a platform through plugins, so plugins may be a direction in the future.

Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

Why have models evolved so quickly, and why have we been able to support Scaling Law? A large part of this is due to the development of computing power. There is Moore's Law in the CPU era, and the GPU era is also developing faster. Last year, Nvidia released a computing power capable of supporting 100 million FLOPS, and this year they released a new DGX GB200, last year it was GH200, and now it is GB200, which is a little smaller and faster, but it is still an order of magnitude. Several DGXs are strung together on a large scale, and IBM computers were quite large nearly a decade ago, and now mobile phones can support the previous computing power, and GPUs are actually the same.

With this large model or computing power, what happens to the application? As you can see, AI 2.0 may be the same for users and scenarios compared to previous traditional software or the Internet. However, in the past, the user went from the App to the service software to the CPU, and now the user goes from multimodal to the basic model, and then to the GPU, and the middle can rely on the database or training data.

Next, regarding understanding the physical world, AI empowers smartphones, smart cars, smart homes, etc., and the core of computing around is the intelligent cloud. Now or in the future, the center will be the AI factory, its input is Token, text, vision or video, and its output is AI. In the past, there were mobile phones and cars, and in the future, there will be all kinds of robots. The car of the future is also a robot in a sense. From the perspective of architecture, AI for Robotics is a future direction, and the direction that will explode in the future is about to explode, from cloud computing, AI engineering, basic models, generative AI to AI for Robotics above.

Understanding the physical world is also challenging, and current language models can only be limited to the scope of training, and their understanding of the outside world is still quite limited.

Hu Luhui: There are four major development trends in post-GPT-4, and only by understanding the physical world can we approach AGI

What are the characteristics of understanding the physical world, and how can we shift from the existing multimodal large models to the understanding of the physical world, and then approach AGI after understanding the physical world? I think there are seven aspects, and the outermost purple is the better person, because the level of people is different, and as a better person, the level of being able to understand the physical world.

But what does GPT-4 or the latest GPT-4 Turbo look like? It is a circle inside. At present, GPT-4 Turbo is still far from people, and only by improving and developing from each dimension can we truly understand the physical world and get closer to general artificial intelligence.

Understanding the physical world is not just about understanding space or spatial intelligence, because "space" is conceptually equivalent to 3D, not including core AI such as language.

Speaking of which, you may think it's a bit abstract, and that's something Meta has been doing lately. Meta is currently "lagging behind" in terms of open-source large models or open-source multimodal large models, but Llama 3 is coming soon because it spends a lot of energy on the world model and at the same time improves the model's capabilities in 7 aspects of governance.

I recently set up a company called Zhicheng AI, which is dedicated to general artificial intelligence. "Cheng" means to gradually move towards true intelligence.

The above is a complete compilation of the content of Mr. Hu Luhui's speech.

Read on