laitimes

Goffrey Hinton, the father of deep learning, asked: How will humans compete with smarter machines in the future? Frontline

author:36 Krypton

Text | Zhou Xinyu

Edit | Tang Yongyi

In terms of attitude towards large models, the Turing Award Big Three have visibly divided into two camps with clear banners:

Yoshua Bengio is a pessimist. Bengio was on the list of signatories opposing the development of large models stronger than GPT-4. Yann LeCun, on the other hand, is more optimistic. He will not only publicly oppose the suspension of research on social platforms, but will also actively conduct a lot of hypothetical research on AGI (General Artificial Intelligence).

But whether it is the joint suspension campaign triggered by GPT-4 or the decision to leave Google, among the Turing Big Three, Goffrey Hinton is particularly low-key and has not made any public statement.

Goffrey Hinton, the father of deep learning, asked: How will humans compete with smarter machines in the future? Frontline

Goffrey Hinton。 Source: KLCII

At the China premiere at KLCII, the scholar gave the audience a hardcore philosophy of technology lesson.

Rather than the safety and ethical risks that have been exposed by today's big models, Hinton is more concerned with the snatching of control between humans and machines in the era of super-intelligence. He argues that humans rarely think about how to interact with species that are smarter than themselves. This leads to a discussion similar to the legitimacy of meritocracy, but in the future, "elites" may not be human beings, but artificial intelligence.

However, although there is no optimal solution to superintelligence to control humans, Hinton believes that the word "artificial" in "artificial intelligence" is precisely the advantage of humans: this means that intelligent species do not come through evolutionary iteration, and the characteristics of "artificial" make them not have the nature to compete with humans.

In a talk titled "Two Paths to Intelligence," he made several important points:

· Large language models acquire knowledge from thousands of computer copies, which is why large models learn more faster than humans.

· If humans want to make digital intelligence more efficient, they need to be allowed to create sub-targets. But conversely, setting clear sub-goals means that digital intelligence will gain more power and control, making it easier to achieve goals.

· Once digital intelligence begins to pursue more control, it may gain more power by controlling humans. Once artificial intelligence masters the skill of "deception", it can easily have the ability to control humans.

Between the lines, the father of deep learning reminds the big family to think of danger: "I believe that superintelligence is much closer than I thought." ”

The following is a compilation of Hinton's presentation (provided by KLCII, slightly edited by 36Kr):

Today I'm going to talk about my research, which has led me to believe that superintelligence is closer than I thought. So I want to talk about two questions, and I'm going to focus almost entirely on the first, which is whether artificial neural networks will soon be smarter than real neural networks. As I said, I will describe the research that led me to this conclusion. Finally, I'll briefly discuss whether we can control superintelligent AI, but that's not the focus of this talk.

The first path to intelligence: hardware simulation

In traditional computing, computers are designed to follow instructions precisely. We can run the exact same program or the same neural network on different physical hardware because they are designed to follow instructions precisely. This means that the knowledge or neural network weights in the program are eternal and not dependent on any particular hardware.

However, achieving this timelessness comes at a high cost. We have to run the transistors at high power to make them work digitally. We can't take full advantage of the rich analog and highly variable nature of the hardware. That's why digital computers exist. The reason they follow instructions is because they are designed to make us observe the problem first, determine the steps needed to solve it, and then tell the computer to perform those steps.

But now that has changed. We now have a different way to get computers to do their jobs, and that is to learn from examples. We just have to show computers what we want them to do, and thanks to a change in how computers do what you want, it's now possible that we'll want to abandon the most basic principle of computer science that software should be separated from hardware. Before abandoning this principle, let's briefly understand why it is a good principle. Thanks to the separation of software from hardware, we can run the same program on different hardware. We can also pay attention to the characteristics of the program and study the characteristics of the program on the neural network without worrying about the electronic aspects.

That's why a computer science department can be different from an electrical engineering department. If we give up the separation of software and hardware, we get what I call "finite computing." Obviously it has big drawbacks, but also some huge advantages.

Goffrey Hinton, the father of deep learning, asked: How will humans compete with smarter machines in the future? Frontline

Limited calculation. Source: KLCII

For these advantages, I started working on finite computation in order to be able to run tasks such as large language models with lower energy. Especially being able to use less energy to train them, the benefit of giving up eternity is to give up the separation of hardware and software. We can get huge energy savings because we can use very low power analog calculations. This is exactly what the brain is doing. It does have a 1-digit number calculation because neurons either trigger or don't fire. But most of the calculations are analog and can be done at very low power. We can also get cheaper hardware. Current hardware has to be manufactured precisely in two dimensions, but we can actually grow hardware in three dimensions because we don't need to fully understand the connectivity of the hardware or how each part works.

Obviously, achieving this will require a lot of new nanotechnology, or perhaps redesigning biological neurons through genetic recombination, because biological neurons are already roughly able to perform the functions we want. Before I go into detail about all the disadvantages of finite computing, I want to give you an example of a computational task that we can obviously do cheaply by using analog hardware.

If you multiply the vector of neural activity with the weight matrix, that's the core computation of a neural network. That's where most of its work lies. What we are currently doing is driving transistors with very high power to represent the number of digits in the number. Then, we perform the operation O(n^2) to multiply the two n digits. It may be just one operation on a computer, but n^2 operations on bit operations. Another option is to implement neural activity as voltage and weights as conductance. Then, per unit time, the voltage multiplied by the conductance creates a charge, which is superimposed on top of each other.

Now it is clear that you can multiply the voltage vector by the conductance matrix. This method is much more energy efficient. Chips that work this way already exist. Unfortunately, people next try to convert analog answers into digital form using analog-to-digital converters, which is very expensive. If possible, we want to stay completely in the analog field. The problem is that different hardware parts will end up with slightly different results.

The main problem with finite computation is that the learning process must take advantage of the specific analog characteristics of the parts of the hardware it runs on, and we don't know exactly what those characteristics are. For example, people don't know the exact function that correlates input with the output of a neuron, or they may not know connectivity. This means that we cannot use methods such as backpropagation algorithms to obtain gradients, because backpropagation is an exact model of forward passing.

The question is, what else can we do if we can't use backpropagation? Because we are now very dependent on backpropagation. There is a very simple and obvious learning process that people have discussed many times.

You generate a small random perturbation vector for each weight in the network. Then, you measure the change in the global objective function. On a small set of examples, and then permanently changing the weights by the perturbation vector, the scaling factor of the perturbation vector is an improvement of the objective function. If the objective function gets worse, obviously you adjust in the opposite direction. The good thing about this algorithm is that, on average, it behaves the same as backpropagation.

Because on average, it follows a gradient. But the problem with it is that the variance is very high. When you choose a random direction to move through the weight space, the resulting noise is very disproportionate to the size of the network. This means that this algorithm may work well for small networks with fewer connections, but not as well for large networks.

Here's one approach that works much better. It still has a similar problem, but much better than perturbation rights, which perturbate the activity of neurons. That is, you consider a vector that randomly perturbates the total input of each neuron. You observe what happens to the objective function when you randomly perturb it on a small set of examples, and get the difference in the objective function due to this perturbation. You can then calculate how to vary each incoming weight of the neuron to follow the gradient. Again, this is just a random estimate of the gradient, but the noise is much smaller than the perturbation weight. The algorithm is good enough to learn simple tasks, such as recognizing numbers. If you use a very, very small learning rate, it behaves exactly like backpropagation, but much slower because you need to use a very small learning rate. If you use a large learning rate, it will be noisy, but it will still work well for MNIST-like tasks, but not well for large neural networks.

To make it scalable, we can take two approaches. Instead of trying to find a learning algorithm for large neural networks, we can try to find an objective function for small neural networks. The idea here is that if we want to train a large neural network, what we need to do is use many small objective functions to apply to small parts of the network. Thus, each group neuron has its own local objective function. Now, this active perturbation algorithm can be used to train a small multi-layer neural network. It learns in much the same way as backpropagation, but is noisier. It is then scaled to a larger network size by using many more small local groups of neurons.

This begs the question, where do these objective functions come from? One possibility is to perform unsupervised learning in local regions, i.e. there are representations of local regions at each level of the image, and each local region produces the output of a local neural network on a specific image. Then try to make the output of this local neural network consistent with the average representation produced by all other local regions. You are trying to make content extracted from a local area consistent with what is extracted from all other local areas in the same image. Therefore, this is classic contrastive learning. At the same time, you are trying to create inconsistencies with what is extracted from other images on the same level.

The specifics are more complex and we won't go into details. But we can make this algorithm work quite well, where the representation of each level has several hidden layers and you can operate nonlinearly. Each level learns gradually using active perturbations, while the lower levels do not have backpropagation. Therefore, its capabilities will not be as powerful as backpropagation because it cannot propagate back-signals on many levels. A lot of people have put a lot of work into making this algorithm work, and it has been proven that it works relatively well. It may work better than other proposed algorithms that might work in real-world neural networks. But there is skill in making it work. It's still not as good as backpropagation. As the network gets deeper, its effectiveness relative to backpropagation decreases significantly. I won't go into all the details of this approach, as you can find relevant materials in a paper published in ICLR and online.

The second path to intelligence: knowledge sharing

Now, let me talk about another big problem for finite computing. To sum up, so far, we haven't found a really good learning algorithm that can take full advantage of simulation properties. But we have an acceptable learning algorithm that is good enough to learn things like small-scale tasks and some larger tasks (like ImageNet), but it doesn't work too well.

The second major problem with finite computing is its finiteness. When a particular hardware device fails, all learned knowledge is lost because knowledge and hardware details are closely linked. The best solution to this problem is to transfer knowledge from the teacher to the student before the hardware fails.

That's what I'm trying to do right now. The teacher shows the students the correct response to the various inputs, and then the students try to mimic the teacher's response. If you look at how Trump's tweets work, people are very upset because they think Trump is saying something false. They think he's trying to describe the facts, but in fact it's not. What Trump did was respond very emotionally to a certain situation. This allows his followers to adjust the weights in their neural network to the situation in order to give the same emotional response to the situation. It's not about the facts, it's about getting paranoid operational feedback from a cult leader to a cult follower, but it's really effective.

If we consider the effect of the distillation method, take the example of an agent that classifies images into about a thousand non-overlapping categories. Only about 10 digits of information are needed to determine the correct answer. When you train this agent, if you tell it the correct answer, you only impose a 10-bit constraint on the weight of the network. It's not a lot of constraints. But now suppose we train an agent to match the teacher's response to these 1024 categories. Assuming that there is no tiny, unimportant part of these probabilities, that is, to obtain the same probability distribution as this distribution, which contains 1023 real numbers, providing hundreds of times the constraint. Not long ago, I researched the distillation method with Jeff Dean and proved that it works very efficiently.

The way to ensure that there are no small values in the teacher's output probability is to run both the teacher and the student with high temperature parameters while training students. For low-level probability values (i.e., "low chips") of the input softmax function, the temperature parameter scaling of the teacher's output is applied to obtain a smoother distribution. When training students, use the same temperature parameters. It is important to note that this temperature parameter adjustment is only used during training, not when using students to perform inference.

I just wanted to show you an example of distillation. Here are some images from the "M" dataset. What I show you is the teacher's probability distribution for each category.

Goffrey Hinton, the father of deep learning, asked: How will humans compete with smarter machines in the future? Frontline

Data set. Source: KLCII

When you train the teacher model with high temperature parameters and look at the first row, it is quite confident that it is a number 2. If you look at the second line, it's also fairly confident that it's a number 2. But it also thinks it could be a number 3, or it could be a number 8. If you look closely, you'll see that the number 2 is more similar to the letter "h" than the other number 2. If you look at the third line, you will see that the number 2 is very much like a 0.

The teacher model tells students that when you see that image, the number 2 should be output, but you can also slightly increase the likelihood of being against the number 0 in the output. The student model learned more from this example than was simply told that it was a number 2. It is learning other features similar to that image. If you look at the fourth row, you'll see that the student model is very confident that it's a number 2, but it also thinks that the likelihood that it could be a number 1 is very small. For the other numbers 2, it doesn't think it could be the number 1, maybe only the first row is a little likely. I've drawn the image that the student model thinks might be the number 1 so you can understand why it looks like a number 1, because sometimes the number 1 is drawn like that. One of the images has a line at the top and a line at the bottom. This kind of image is a feature of the number 1, which is also somewhat similar. Then, if you look at the last graph, which is the one where the teacher actually misjudged, the teacher thought it was the number 5, but according to the hentai label, it was actually the number 2.

Student models can learn a lot from the teacher's mistakes. One particular property I particularly like about distillation is that when you train a student model to use the teacher's probability, you are training the student model to generalize in the same way as the teacher, i.e. by assigning a smaller probability to the wrong answer.

Typically, when you train a model, you try to get it to get the right answer on the training data and hope that it will generalize correctly to the test data. You'll try not to overcomplicate the model, or take a variety of approaches in the hope that it will generalize correctly. But here, when you train a student model, you train the student model directly to generalize because it is trained to generalize in the same way as the teacher. Obviously, you can produce richer output by giving the title of an image, and then train teachers and students to predict the words in the title in the same way.

Now I want to talk about how a group of agents can share knowledge. So, instead of thinking about individual agents, we think about sharing knowledge within a group, and it turns out that the way knowledge is shared within the community determines many other factors in the computational process.

Using digital models and digital intelligence, you can have a large group of agents that use exactly the same weights and use those weights in exactly the same way. This means that you can have these agents observe and compute different pieces of training data, calculate gradients for the weights, and then average their gradients.

Now, each model learns from the data it observes. This means that you can gain a lot of data observation power by having different copies of the model observe different pieces of data. They can efficiently share what they have learned by sharing gradients or weights. If you have a model with trillions of weights, that means you get trillions of bits of bandwidth every time you share. But the price of this is that you have digital agents that behave exactly the same, and they use weights in exactly the same way. This is very expensive in terms of both manufacturing and operation, both in terms of cost and energy consumption.

An alternative to weight sharing is to use distillation. If the digital model has a different architecture, we already use distillation in the digital model. However, if you are using a biological model that takes advantage of the simulated characteristics of specific hardware, you cannot share weights.

Therefore, you must use distillation to share knowledge. That is what is covered in this discussion. As you can see, using distillation to share knowledge is not very efficient. Sharing knowledge with distillation is difficult. Makes me produce some sentences, and you try to figure out how to change your weights so that you can produce the same sentences as well. But the bandwidth of this approach is much lower than just sharing gradients. Everyone who has ever taught wants to be able to give what they know directly to their students. That would be nice. Then there is no need for universities to exist.

But the way we work is not like that, because we are biointelligent. My weight is not useful to you. So far, we can say that there are two different ways of computing, one is numerical calculation and the other is biological calculation, the latter takes advantage of the characteristics of animals. They vary greatly in their efficiency in effectively sharing knowledge between different agents. If you look at large language models, they use numerical calculations and weight sharing.

But every copy of the model, every agent derives knowledge from the document in a very inefficient way. In fact, this is a very inefficient form of distillation. It receives the document, trying to predict the next word. It does not show it the probability distribution of the teacher, but only shows it a random choice, which is the next word chosen by the author of the document. Therefore, its bandwidth is very low. This is how these large language models learn from people. Each copy is learned in a very inefficient way by distillation, but you have thousands of copies. That is why they can learn more than we do. I believe these large language models are a thousand times more than any individual knows.

Be wary of superintelligence's control over humans

The question now is, what will happen if these digital agents no longer learn from us through a slow distillation process, but directly from the real world? I must say that although the process of distillation is slow, they are learning very abstract things when they learn from us. Over the past few thousand years, mankind's understanding of the world has advanced a lot. Now, these digital agents are taking advantage of everything we know about the world that we can put into words. As such, they can capture all the knowledge that humanity has documented over the past few thousand years. But the bandwidth of each digital agent is still quite limited because they acquire knowledge by studying documents.

If they can learn in an unsupervised way such as modeling videos, it will be very efficient. Once we find an efficient way to train these models to model videos, they'll be able to learn from YouTube as a whole, which is a lot of data. It would also help if they were able to manipulate the physical world, such as having robotic arms, etc.

But I believe that once these digital agents start doing this, they will be able to learn more than humans, and quite quickly. This brings us to another question I mentioned at the beginning, which is what happens if these agents become smarter than we are. Obviously, this is the main issue that this meeting is about. But my main point is that I think these superintelligences may happen faster than I thought in the past. If you want to create a superintelligence, bad actors will use them for manipulation, elections, etc. In the United States and many other places, they are already using them for these activities. And it will also be used to win the war.

To make digital intelligence more efficient, we need to allow it to set some goals. However, there is an obvious problem here. There is a very obvious sub-goal that is very helpful for almost anything you want to achieve, which is to gain more power, more control. Having more control makes it easier to achieve your goals. I find it hard to imagine how we can stop digital intelligence from trying to gain more control to achieve other goals.

Once digital intelligence starts to pursue more control, we may face more problems. For example, in the case of physical air gap isolation, superintelligent species can still easily gain more privileges by controlling humans. In contrast, humans rarely think about species that are more intelligent than themselves and how to interact with these species, in my observation, this type of artificial intelligence has mastered the action of deceiving humans, because it can learn how to deceive others by reading novels, and once artificial intelligence has the ability to "deceive", it also has the aforementioned ability to easily control humans.

Control, for example, if you want to invade a building in Washington, you don't need to go there yourself, you just need to deceive people into thinking that by invading the building, you can save democracy and ultimately achieve your goals (mocking Trump), this operation is scary, because I don't know how to stop this behavior from happening, so I hope that the younger generation of researchers can find some smarter ways to stop this behavior of control through deception.

Although humans do not have any good solutions to this problem, but fortunately, these intelligent species are created by people, not through evolutionary iteration, which may be a slight advantage that humans currently have, precisely because they do not have the ability to evolve, they do not have the characteristics of human competition and aggression.

We can do some empowerment, even give AI some ethical principles, but I'm still nervous, because so far, I can't imagine examples of things that are smarter, controlled by things that are less intelligent. Let me use an analogy, let's say that frogs created humans, then who do you think will take the initiative now, is it a human, or a frog?

Goffrey Hinton, the father of deep learning, asked: How will humans compete with smarter machines in the future? Frontline

Welcome to exchange

How

Read on