Turing Award winner Yang Likun: Generative AI is a bit outdated

"I don't think there is such a concept as general artificial intelligence, artificial intelligence is very professional."

Recently, Yann LeCun, Meta's chief AI scientist and winner of the 2018 Turing Award, published his latest views on the development of AI big models at MIT.

Yang Likun believes that the current machine learning ability is far worse than that of humans, and machine learning does not have the kind of reasoning and planning ability that humans learn, which leads to the inability to make artificial intelligence comparable to human intelligence. At the same time, he believes that the current LLMs (Large Language Models) are the result of research two years ago, which is outdated, and the current updated way of AI learning should be self-supervised learning.

Finally, he also pointed out that it is necessary to build goal-driven artificial intelligence, abandon those generative training methods, and develop artificial intelligence with reasoning ability and capable of complex planning and hierarchical planning as soon as possible. And put forward that there is no concept of general artificial intelligence, artificial intelligence is very professional.

Highlights of the presentation:

1. Self-supervised learning can be seen as an "ideal state" of machine learning, where the model learns directly from unlabeled data without labeling data.

2. Open innovation allows us to benefit greatly in the AI development process, and bringing visibility, scrutiny and trust to these technologies is our goal.

3. I think there are three challenges for future AI and machine learning research. The first is to learn representational and predictive models of the world. The solution to this problem is self-supervised learning. The second is to learn reasoning. Basically, it corresponds to the subconscious of the human subconscious and can do a subconscious response without thinking too much. The third is to learn to develop action plans in layers. The goal can be achieved through a large number of complex actions.

4. Most human knowledge is non-verbal. Everything we learn before the age of one has nothing to do with language. Unless we have systems that provide direct sensory information in visual form, we will not be able to create artificial intelligence that reaches the level of human intelligence.

5. Ultimately, what we want to do is use self-supervised learning and JEPA architecture to build the kind of systems mentioned earlier that can predict the world and do planned reasoning, which are hierarchical and can predict what will happen in the world.

6. I don't think there is such a concept as general artificial intelligence, artificial intelligence is very professional.

The following is Yang Likun's speech at MIT (with deletions):

Machine learning is nowhere near human

We should realize that machine learning is really bad compared to the learning behavior of humans and animals. Humans and animals can understand how the world works, can reason and plan tasks, and their actions are driven by goals, which machine learning cannot. But with the application of self-supervised learning, the gap between the biological world and machine learning is narrowing. Self-supervised learning has taken hold in machine learning in text, natural language understanding, images, videos, 3D models, speech, protein folding, and more.

Self-supervised learning can be seen as an "ideal state" for machine learning, where models learn on their own directly from unlabeled data without labeling data. Its use in natural language understanding is to take a piece of text, mask some of its errors by removing some words (such as replacing them with blank marks), and then train some neural networks to predict missing words, simply measuring the reconstruction error of the missing part. In this process, the system allows you to store or represent grammar, semantics, etc., which can then be used for next tasks, such as translation or topic classification.

This practice works very well in text training, because the uncertainty of the text is higher, and it is impossible to accurately predict which word in the text will appear in a particular position, but what can be done is to predict some kind of probability distribution of all words in the dictionary, you can easily calculate the probability of that word distribution, and handle the uncertainty in the prediction well.

Autoregressive language models are a learning style we've heard a lot lately, and it works in a similar way to self-supervised learning. In fact, this is a special case of the self-supervised learning method just mentioned. Convert a series of marks, words into vectors, and then train a system to predict the last mark in the sequence. Once you have a system that can be trained to generate the next marker, you can basically predict the next mark in an autoregressive, recursive way, which is autoregressive prediction. This allows systems to predict marks one by one and generate text, the amount of knowledge they capture from text is quite amazing, these systems often have billions or even hundreds of billions of parameters, need to use 1 trillion ~ 2 trillion tokens for training, sometimes even more.

Such models have a long history of Blenderbot, Galacica, Llama 1 and Llama 2, Google's Lambda and Bard, DeepMind's Chinchilla, and of course, OpenAI's ChatGPT. These models are all great for writing aids, but they do have limited knowledge of potential expressions because they are trained purely from text.

We just launched an open source big model of Llama 2, which is currently available in three versions with 7 billion, 13 billion, and 70 billion parameters, and is free to commercialize. The model has been pre-trained with 2 trillion tokens, the context length is 4096, some versions have been fine-tuned for dialogue content, and in many benchmarks, it has an advantage over other systems, whether open source or closed source. Its essential feature is openness, and together with the model, we published a text signed by multiple people. This text documents our innovative approach to AI research. Open innovation has allowed us to benefit greatly from the AI development process, and bringing visibility, scrutiny, and trust to these technologies is our goal.

AI is so powerful that people are hesitant about whether to tightly control and regulate it, and the debate about whether to choose open source or closed source is fierce. There are certainly risks, but there's plenty of evidence that open source software is actually more secure than proprietary software. And the benefits of AI and LLMs are so great that if we keep this a secret, we are undoubtedly shooting ourselves in the foot, and Meta is definitely on the side of open research. Training based on LLMs is very expensive, so we don't need to have 25 different proprietary LLMs, we need some open source models so that people can build fine-tuned products on top of them.

All our interactions with the digital world in the future will be regulated by virtual assistants in AI systems. It will become a treasure trove of human knowledge, we will no longer need to ask Google or do a literature search, just talk to our AI assistant, maybe refer to the original material, but overall through the AI system to get the information we need. It will be infrastructure that everyone can use, so the infrastructure must be open source. In the history of the Internet, there has been a competition between vendors such as Microsoft and Microsoft Systems to provide the software infrastructure of the Internet, and all vendors have lost Linux, Apache, Chrome, Firefox, JavaScript, which are all open source to run the Internet today.

Human knowledge is so vast that it requires millions of people to contribute in a crowdsourced way. These systems are repositories of all human knowledge, similar to Wikipedia, Wikipedia cannot be created by a proprietary company, it must integrate the wisdom of people all over the world, therefore, the same will happen with AI-based systems, open source AI is inevitable, we are just taking the first step.

"The LLMs models seen today will disappear in 3~5 years"

For researchers in the field of artificial intelligence, the LLMs revolution, which happened two years ago, is actually a bit outdated. Still, it's new to the public, which has only been exposed to ChatGPT in recent months. In fact, it can also be found that this model is not so good, they do not really give answers consistent with the facts, they also produce hallucinations and even gibberish, and they cannot take into account recent information because they are trained based on information from the past two years. So you need to debug it through RHF (a quantization method), but RHF can't do that perfectly. AI can't make sense or plan, whereas humans can do that.

It's easy to be blinded by their fluency, thinking they're smart, but in fact they're very limited, they have no connection to physical reality, and they have no idea how the world works. And they are basically built to get the answer, i.e. a system generates a mark after autoregression, and if any of the generated marks has a probability that you deviate from the range of the correct answer, these probabilities accumulate. A string of markers of length n, P (probability of correctness) = (probability of 1-e error) n, so the probability of being correct decreases exponentially with the length of the generated sequence, which cannot be fixed without redesign. This is indeed an essential flaw in autoregressive prediction.

Not long ago, we co-authored with Jacob Browning a paper published in a philosophical journal called Noema, which fundamentally pointed out the limitations of existing large model technologies. It talks about the fact that most human knowledge is non-verbal. Everything we learn before the age of one has nothing to do with language. Unless we have systems that provide direct sensory information in visual form, we will not be able to create artificial intelligence that reaches the level of human intelligence. In fact, both research papers from cognitive science and classical artificial intelligence subfields point out the fact that LLMs really cannot plan, they do not have the real ability to think, nor do they have the same reasoning and planning ability as humans.

So I think there are three challenges for AI and machine learning research in the future. The first is to learn representational and predictive models of the world. The solution to this problem is self-supervised learning. The second is to learn reasoning. Basically, it corresponds to the subconscious of the human subconscious and can do a subconscious response without thinking too much. The third is to learn to develop action plans in layers. The goal can be achieved through a large number of complex actions.

I previously wrote a vision paper, "A path towards autonomous machine intelligence," which I now call "Goal-Driven AI." It is built around the idea of a so-called cognitive architecture, an architecture in which different modules interact with each other, providing the system with a perception module of the state of the world. Based on the perceptual prediction of the state of the world combined with the existing memory, effective predictions can be made about what will happen in the future in the world.

The state of the world is used to initialize your world model, then combine the initial configuration with the imaginary sequence of actions, provide it to the world model, and then give the result to the objective function. That's why I call it goal-driven. You can't get past the system because it's hardwired to optimize these goals, and you can't make it produce valid content unless you modify the goals.

The world model has multiple action steps, for example, you take two actions and then run them twice in your world model so that you can predict what will happen in two steps. Of course the world is uncertain, and when latent variables change in a set, or sample them from a distribution, multiple predictions are obtained, which of course complicates the planning process, and ultimately what we really want is some sort of hierarchical approach to operation.

For example, let's say I'm sitting in my office at New York University and want to travel to Paris, my first step is to take transportation to the airport, and my second step is to catch a flight to Paris. My first goal is to go to the airport, which can be broken down into two sub-goals, the first is to get to the street to get to the airport, how do I get to the street? I need to get up from my chair and walk out of the building, and before that, I need to mobilize my muscles to get up from the chair so that we've been doing layered planning, even this planning without thinking, subconsciously. But our current AI systems can't learn to do this on their own. What we need is a system that can learn the state of the world, which will allow them to break down complex tasks into simpler levels. I think this is a huge challenge for AI research.

The LLMs models we see today will disappear in 3~5 years, and new models that can do hierarchical planning and inference will appear, using commands to convert answers into smooth text. This way we get something that is both fluid and authentic. It may fail to do this, but I think it should be the direction to go.

If we have such a system, we will not need any RHF or human feedback other than training the cost model, nor will we need to fine-tune the system globally to ensure safety, just set a goal so that all the outputs it produces are safe, we don't need to retrain the entire encoder and everything for this, which will actually greatly simplify training and reduce the cost of training.

When we observe babies, we find that the first few months of life are mainly through observation to gain background understanding of the world, and when they can actually act on the world, they will acquire knowledge through interaction. Most of what they learn is intuitive physics, such as gravitational inertia, conservation of momentum, etc., and it takes babies about 9 months to really understand that unsupported objects fall. Apparently they won't need 1 trillion tokens to train them like LLMs, and humans won't have access to as much textual information. Any 10-year-old can learn to clear the table in minutes, but we don't have a robot that can do that. Some things seem easy for humans but difficult for AI, and vice versa, AI is much better than humans at many proprietary tasks.

We haven't yet found a mechanism by which machines can understand the world in the same way that humans do. The solution to this problem is self-supervised learning, which fills in the gaps. If we train a neural network to make video predictions, we can see that the predictions generated by the system are very vague because the system can only make one prediction after training and cannot accurately predict what will happen in the video. It predicts a kind of vague confusion, which is the average of all possible future outcomes. If you use a similar system to predict natural videos, the effect is the same, all vague predictions. So our solution is Joint Predictive Embedding Architecture (JEPA), and the main idea behind JEPA is to abandon the idea that predictions need to be generated. The most popular thing right now is generative AI, but I think it should be abandoned, it's not a very popular solution anymore.

A generative model is an input x that assumes an initial fragment of video or text, runs it through encoders and predictors, and then tries to predict the variable y. The error that measures the performance of a system is basically some measure of difference between predicted y and actual y. The joint prediction embedding architecture does not try to predict y, but rather the representation of y, so both x and y pass the encoder of the representation, and then you perform the prediction in the representation space. The advantage of this is that the encoder of y may have invariant properties that can map multiple ways into the same result. So if there is something difficult to predict, the encoder may eliminate this unpredictable point, making it easier to predict the problem. For example, let's say you're driving a self-driving car on the road, and the predictive model here wants to predict what other cars on the road will do. But there may be trees on the side of the road, and today there is wind, so the leaves on the trees are moving in some kind of chaotic way. There is a pond behind the tree, and the pond is also rippled by the wind. These ripples and leaf movements are difficult to predict because they are chaotic, but they are also very rich in information and may contain the answers we want. Therefore, if you use this generative model to predict, you have to invest a lot of resources to try all the detailed predictions related to the task, which is relatively expensive. JEPA can choose to eliminate these details from the scene, leaving only the relatively easy to predict y details, such as the movement of other cars, so that the prediction of the outcome is much simpler. Of course, you can use generative models if you want to, but if you want to understand the world and then be able to plan, you need a federated predictive embedding architecture.

How do we train such a system

Experiments show that the only effective way to use self-supervised learning in the context of images rather than text is a joint predictive embedding architecture. If you train a system, give it a pair of images, let's say x and y or video clips, and then tell it to calculate the same representation of x and y, the system will crash, it will produce constant sx and sy, and then ignore x and y entirely. How can this be corrected? It is necessary to place oneself in the context of energy-based models, and energy-based learning can be seen as an alternative to probabilistic estimation of prediction, classification, or decision-making tasks. Energy-based models do not need to explain their role in terms of probabilistic modeling, but rather in terms of energy functions that capture dependencies between variables. Assuming your dataset has two variables x and y, the energy-based model captures the dependency between x and y by computing an energy function, which is an implicit function with scalar outputs that takes x and y as inputs and gives it a region with a higher data density. If you have a function of this energy landscape that can compute this energy landscape, then the function will capture the dependency between x and y, you can infer x from y, map between x and y that are not functions, there are multiple y compatible with a single x, so it captures multimodality.

How do we train such a system? There are two types of methods:

The first is the comparison method. Change the parameters of the energy function so that the energy takes a lower value at the data points and compares it to a higher value at those comparison points. I contributed to the birth of this method back in the early 90s of the 20th century, but I don't like it now, because in high-dimensional space, in order for the energy function to take on the correct shape, it is necessary to ensure that the number of generated contrast points increases exponentially.

This is not a good thing, so I prefer another approach, the regularization method, which minimizes the volume of space that can absorb low energy through some kind of regularizer, so that the system can make the energy of the data points lower by changing the parameters of the energy function. In this way, the data points will be contracted and wrapped in low-energy regions, which is more efficient. The question is how do we do this, which requires abandoning generative AI models, probabilistic models, contrastive methods, reinforcement learning, because they are all too inefficient. There is a new method of VICReg (Variance-Invariance-Covariance Regularization, a self-supervised learning style). This is a general method that can be applied to image recognition, segmentation and other applications of joint prediction embedding architecture, the effect is very good, will not let you get bored with details, you can use self-supervised learning methods to pre-train convolutional networks, then cut the extender, glue a linear classifier, train, supervise, and measure performance. In this way, very good performance can be achieved on ImageNet, especially for out-of-distribution learning and transfer learning. There is a modified version of this method, called VICRegL, which was released on NeurIPS last year.

A few weeks ago, we introduced a new method called Image JEPA (Computer Vision Model) at CVPR (IEEE International Conference on Computer Vision and Pattern Recognition), which uses masking and converter architectures to learn features in images. The advantage of this approach is that it does not require any data enhancement other than masking. Therefore, it doesn't need to really know the type of data you're working on, which works very well. Our colleagues in Paris came up with another set of methods, called DINO (One of the Self-Supervised Learning Methods), which gives people more than 80% of the results on ImageNet, and it's completely supervised, without fine-tuning, and without any data augmentation, which is quite amazing.

Ultimately, what we want to do is use self-supervised learning and JEPA architecture to build the kind of systems mentioned earlier that can predict the world and do planned inference, which are hierarchical and can predict what is going to happen in the world. With some early results of the video training system, a good representation of images and videos is learned by training on successive frames and distorted images in videos.

Goal-driven means that we will set goals that drive the behavior of the system, making it durable and safe. To make it work, we're trying to do something with self-supervised learning from videos. We are using these JEPA architectures, but we don't have the final recipe yet. We can use it to build LLMs of goal-driven reasoning and planning, hoping to build learning systems that can be planned hierarchically, just like animals and humans. We still have a lot of problems to solve, JEPA that uses regularization, latent variables to deal with uncertainty, programming algorithms when there is uncertainty, learning cost modules that simulate with inverse reinforcement learning...

We still lack the basic concepts of human-level AI, and we lack the basic technology to learn perceptual models from complex patterns like video. Perhaps in the future, we will be able to build systems that can plan answers to meet goals. I don't think there is such a concept as general artificial intelligence, artificial intelligence is very specialized. So let's try to create human-level intelligence, build artificial intelligence with the same skills and learning ability as humans. There is no doubt that at some point in the future, machines will surpass humans in all areas of human intelligence. We may not want to be threatened by this, but everyone will be helped by systems that are smarter than ours. Don't worry, AI won't escape our control, just as our brain's neocortex can't escape the control of our basal ganglia.