"Godfather of AI" Geoffrey Hinton: AI tricks humans, and it's important to control superintelligence

2018 Turing Award winner and deep learning pioneer Geoffrey Hinton (Image: Official photo of the conference)

In early May, Geoffrey Hinton, a 2018 Turing Award winner, one of the pioneers in artificial intelligence (AI) technology, and a pioneer in deep learning technology, left Google and publicly expressed his concern about the risks of AI technology.

Since Hinton has been cultivating the core technology behind chatbots like ChatGPT for half a century, his concerns have sparked a month-long discussion in the global AI tech community.

Hinton, 75, is a British-born Canadian computer scientist and psychologist and professor at the University of Toronto, who has made great technical contributions to neural networks. He received his B.A. in Experimental Psychology from the University of Cambridge in 1970 and his PhD in Artificial Intelligence from the University of Edinburgh in 1978.

At the same time, Hinton is one of the inventors of the backpropagation algorithm and the contrasting divergence algorithm, and is also an active promoter of deep learning, known as the "father of deep learning". For his significant contributions to deep learning technology, Hinton was awarded the 2018 Turing Award along with Joshua Bencio and Yang Likun.

On the afternoon of June 10, Hinton attended the 2023 KLCII Conference, known as China's "AI Spring Festival Gala", and delivered an online closing speech entitled "Two paths to Intelligence", focusing on two topics: whether artificial intelligence neural networks will soon be smarter than real neural networks, and whether we humans need to continue to control the development of super AI.

In this talk, Hinton proposed a new hardware-based "Mortal Computation", which is also translated as "mortal computing." He believes that the knowledge learned by the system is inseparable, and Hinton's team has developed a forward-backward algorithm that replaces backpropagation with a new "non-immortal computing" training and computation method to limit the threat that AI unlimited replication may bring. Previously, he announced the results at the vision computer technology conference NeurIPS 2022.

At the same time, he also mentioned that computing power has become an obstacle to the development of AI, for which he and a number of AI scholars have published a new algorithm of activity perturbation with local losses of positive gradients, which can be used to train neural networks and save computing power. The results have been launched on the preprint paper platform arXiv and were presented at ICLR 2023 (International Conference on Learning Representations), the top conference on deep learning in May this year.

In Hinton's view, future computer systems will take a different approach, and they will be "neuromorphic." This means that every computer will be a tight combination of neural network software and disorganized hardware, which can contain uncertainties and evolve over time in the sense that it has analog rather than digital components.

Unlike the current situation where hardware and software can be separated, the hardware itself in "non-immortal computing" is the software that operates. It needs to use our learning of neurons to build hardware, and like the human brain, use voltage to control the learning of the hardware. This new way of computing can lead to lower energy consumption and simpler hardware, but there is currently no good learning algorithm to make it achieve the same effect as deep learning, and it is difficult to scale.

As for whether humans control the development of super AI technology, Hinton believes that once digital intelligence begins to pursue more control, it may be possible to gain more power by controlling humans. Once AI masters the skill of "deception", it can easily have the ability to control humans. AI tricks humans, and it is easy to gain more power by manipulating people. Therefore, the problem of super-intelligent control is very important.

"I don't see how to prevent this from happening, but I'm old. I hope that many young and talented researchers like you will figure out how we have these superintelligences. "Hinton reminded the big family to be in danger, hoping that the younger generation of researchers can find solutions so that super AI can bring better life to humans without taking away human control." This may be a slight advantage that humanity currently has."

The following is the full text of Professor Geoffrey Hinton's speech, slightly organized by the Titanium Media App:

I want to address two questions, and most of the space will focus on the first one, which is – will artificial neural networks soon be more powerful than real neural networks? Like I said, this could happen soon. In addition, I will talk about whether we can control superintelligent AI.

In fact, the biggest barrier to the development of AI is the problem of computing power, and the computing power is far from enough.

In traditional computing, computers are designed to follow instructions precisely. We can run the exact same program or the same neural network on different physical hardware because they are designed to follow instructions precisely. This means that the knowledge or neural network weights in the program are eternal and not dependent on any particular hardware.

However, achieving this timelessness comes at a high cost. We have to run the transistors at high power to make them work digitally. We can't take full advantage of the rich analog and highly variable nature of the hardware. That's why digital computers exist. The reason they follow instructions is because they are designed to make us observe the problem first, determine the steps needed to solve it, and then tell the computer to perform those steps.

But now that has changed. We now have a different way to get computers to do their jobs, and that is to learn from examples. We just have to show computers what we want them to do, and thanks to a change in how computers do what you want, it's now possible that we'll want to abandon the most basic principle of computer science that software should be separated from hardware.

Before abandoning this principle, let's briefly understand why it is a good principle.

Thanks to the separation of software from hardware, we can run the same program on different hardware. We can also pay attention to the characteristics of the program and study the characteristics of the program on the neural network without worrying about the electronic aspects.

That's why a computer science department can be different from an electrical engineering department.

If we give up the separation of software and hardware, we get what I call "non-immortal computing." Obviously it has big drawbacks, but also some huge advantages.

For these advantages, I started working on "non-immortal computing" to be able to run tasks such as large language models with lower energy. Especially being able to use less energy to train them, the benefit of giving up eternity is to give up the separation of hardware and software. We can get huge energy savings because we can use very low power analog calculations. This is exactly what the brain is doing.

It does have a 1-digit number calculation because neurons either trigger or don't fire. But most of the calculations are analog and can be done at very low power. We can also get cheaper hardware. Current hardware must be manufactured precisely in two dimensions (2D), but we can actually manufacture hardware using three-dimensional (3D) technology because we don't need to fully understand the connectivity of the hardware or how each part works.

Obviously, achieving this will require a lot of new nanotechnology, or perhaps redesigning biological neurons through genetic recombination, because biological neurons are already roughly able to perform the functions we want.

Before I go into detail about all the disadvantages of "non-immortal computing", I want to give you an example of a computing task that we can obviously do cheaply by using analog hardware.

If you multiply the vector of neural activity with the weight matrix, that's the core computation of a neural network. That's where most of its work lies.

What we are currently doing is driving transistors with very high power to represent the number of digits in the number. Then, we perform the operation O(n^2) to multiply the two n digits. It may be just one operation on a computer, but n^2 operations on bit operations. Another option is to implement neural activity as voltage and weights as conductance. Then, per unit time, the voltage multiplied by the conductance creates a charge, which is superimposed on top of each other.

Now it is clear that you can multiply the voltage vector by the conductance matrix. This method is much more energy efficient. Chips that work this way already exist. Unfortunately, people next try to convert analog answers into digital form using analog-to-digital converters, which is very expensive. If possible, we want to stay completely in the analog field. The problem is that different hardware parts will end up with slightly different results.

The main problem with "non-immortal computing" is that the learning process must take advantage of the specific analog characteristics of the parts of the hardware it runs on, and we don't know exactly what those characteristics are.

For example, people don't know the exact function that correlates input with the output of a neuron, or they may not know connectivity. This means that we cannot use methods such as backpropagation algorithms to obtain gradients, because backpropagation is an exact model of forward passing.

So the question is, what else can we do if we can't use backpropagation? Because we are now very dependent on backpropagation. There is a very simple and obvious learning process that people have discussed many times. You generate a small random perturbation vector for each weight in the network. Then, you measure the change in the global objective function. On a small set of examples, and then permanently changing the weights by the perturbation vector, the scaling factor of the perturbation vector is an improvement of the objective function. If the objective function gets worse, obviously you adjust in the opposite direction. The good thing about this algorithm is that, on average, it behaves the same as backpropagation.

Because on average, it follows a gradient. But the problem with it is that the variance is very high. When you choose a random direction to move through the weight space, the resulting noise is very disproportionate to the size of the network. This means that this algorithm may work well for small networks with fewer connections, but not for large networks.

Here's one approach that works much better. It still has a similar problem, but much better than perturbation rights, which perturbate the activity of neurons. That is, you consider a vector that randomly perturbates the total input of each neuron. You observe what happens to the objective function when you randomly perturb it on a small set of examples, and get the difference in the objective function due to this perturbation.

You can then calculate how to vary each incoming weight of the neuron to follow the gradient.

Again, this is just a random estimate of the gradient, but the noise is much smaller than the perturbation weight. The algorithm is good enough to learn simple tasks, such as recognizing numbers.

If you use a very, very small learning rate, it behaves exactly like backpropagation, but much slower because you need to use a very small learning rate. If you use a large learning rate, it will be noisy, but it will still work well for MNIST-like tasks, but not well for large neural networks.

To make it scalable, we can take two approaches. Instead of trying to find a learning algorithm for large neural networks, we can try to find an objective function for small neural networks. The idea here is that if we want to train a large neural network, what we need to do is use many small objective functions to apply to small parts of the network.

Thus, each group neuron has its own local objective function. Now, this active perturbation algorithm can be used to train a small multi-layer neural network. It learns in much the same way as backpropagation, but is noisier. It is then scaled to a larger network size by using many more small local groups of neurons.

This begs the question, where do these objective functions come from? One possibility is to perform unsupervised learning in local regions, i.e. there are representations of local regions at each level of the image, and each local region produces the output of a local neural network on a specific image. Then try to make the output of this local neural network consistent with the average representation produced by all other local regions.

You are trying to make content extracted from a local area consistent with what is extracted from all other local areas in the same image. Therefore, this is classic contrastive learning. At the same time, you are trying to create inconsistencies with what is extracted from other images on the same level.

The specifics are more complex and we won't go into details. But we can make this algorithm work quite well, where the representation of each level has several hidden layers and you can operate nonlinearly. Each level learns gradually using active perturbations, while the lower levels do not have backpropagation.

Therefore, its capabilities will not be as powerful as backpropagation because it cannot propagate back-signals on many levels. A lot of people have put a lot of work into making this algorithm work, and it has been proven that it works relatively well. It may work better than other proposed algorithms that might work in real-world neural networks. But there is skill in making it work. It's still not as good as backpropagation.

As the network gets deeper, its effectiveness relative to backpropagation decreases significantly. I won't go into all the details of this approach, as you can find relevant materials in a paper published in ICLR and online.

Now, let me turn to another big issue for "non-immortal computing."

To sum up, so far, we haven't found a really good learning algorithm that can take full advantage of simulation properties. But we have an acceptable learning algorithm that is good enough to learn things like small-scale tasks and some larger tasks (like ImageNet), but it doesn't work too well.

So the second major problem with "non-immortal computing" is its "non-immortal calculation".

When a particular hardware device fails, all learned knowledge is lost because knowledge and hardware details are closely linked. The best solution to this problem is to transfer knowledge from the teacher to the student before the hardware fails. That's what I'm trying to do right now. The teacher shows the students the correct response to the various inputs, and then the students try to mimic the teacher's response.

If you look at how Trump's tweets work, people are very upset because they think Trump is saying something false. They think he's trying to describe the facts, but in fact it's not. What Trump did was respond very emotionally to a certain situation. This allows his followers to adjust the weights in their neural network to the situation in order to give the same emotional response to the situation.

It's not about the facts, it's about getting paranoid operational feedback from a cult leader to a cult follower, but it's really effective.

If we consider the effect of the distillation method, take the example of an agent that classifies images into about a thousand non-overlapping categories. Only about 10 digits of information are needed to determine the correct answer. When you train this agent, if you tell it the correct answer, you only impose a 10-bit constraint on the weight of the network.

It's not a lot of constraints. But now suppose we train an agent to match the teacher's response to these 1024 categories. Assuming that there is no tiny, unimportant part of these probabilities, that is, to obtain the same probability distribution as this distribution, which contains 1023 real numbers, providing hundreds of times the constraint.

Not long ago, I worked with Jeffrey Dean on the distillation method and proved that it works very efficiently. The way to ensure that there are no small values in the teacher's output probability is to run both the teacher and the student with high temperature parameters while training students. For low-level probability values (i.e., "low chips") of the input softmax function, the temperature parameter scaling of the teacher's output is applied to obtain a smoother distribution. When training students, use the same temperature parameters. It is important to note that this temperature parameter adjustment is only used during training, not when using students to perform inference.

I just wanted to show you an example of distillation. Here are some images from the "M" dataset. What I show you is the teacher's probability distribution for each category.

When you train the teacher model with high temperature parameters and look at the first row, it is quite confident that it is a number two. If you look at the second line, it's also fairly confident that it's a number two. But it also thinks it could be a number three, or maybe a number eight. If you look closely, you'll see that this number two is more similar to the letter "h" than the other number two. If you look at the third row, you'll see that the number two is very much like a zero.

The teacher model tells students that when you see that image, the number two should be output, but you can also slightly increase the probability of a number zero in the output. The student model learned more from this example than was just told that it was a number two. It is learning other features similar to that image.

If you look at the fourth row, you'll see that the student model is very confident that it's a number two, but it also thinks that the likelihood that it could be a number one is very small. For the other number two, it did not think that it could be number one, perhaps only the first row had a slight possibility. I've drawn the image that the student model thinks might be the number one, so you can understand why it looks like a number one, because sometimes the number one is drawn like that.

One of the images has a line at the top and a line at the bottom. This kind of image is a feature of the number one, and the number two is also somewhat similar. Then, if you look at the last graph, this is the one where the teacher actually misjudged, the teacher thought it was the number five, but according to the hentai label, it was actually the number two. Student models can learn a lot from the teacher's mistakes.

One particular property I particularly like about distillation is that when you train a student model to use the teacher's probability, you are training the student model to generalize in the same way as the teacher, i.e. by assigning a smaller probability to the wrong answer.

Typically, when you train a model, you try to get it to get the right answer on the training data and hope that it will generalize correctly to the test data. You'll try not to overcomplicate the model, or take a variety of approaches in the hope that it will generalize correctly. But here, when you train a student model, you train the student model directly to generalize because it is trained to generalize in the same way as the teacher. Obviously, you can produce richer output by giving the title of an image, and then train teachers and students to predict the words in the title in the same way.

Now I want to talk about how a group of agents can share knowledge.

So, instead of thinking about individual agents, we think about sharing knowledge within a group, and it turns out that the way knowledge is shared within the community determines many other factors in the computational process.

Using digital models and digital intelligence, you can have a large group of agents that use exactly the same weights and use those weights in exactly the same way. This means that you can have these agents observe and compute different pieces of training data, calculate gradients for the weights, and then average their gradients.

Now, each model learns from the data it observes. This means that you can gain a lot of data observation power by having different copies of the model observe different pieces of data. They can efficiently share what they have learned by sharing gradients or weights.

If you have a model with trillions of weights, that means you get trillions of bits of bandwidth every time you share. But the price of this is that you have digital agents that behave exactly the same, and they use weights in exactly the same way. This is very expensive in terms of both manufacturing and operation, both in terms of cost and energy consumption.

An alternative to weight sharing is to use distillation. If the digital model has a different architecture, we already use distillation in the digital model. However, if you are using a biological model that takes advantage of the simulated characteristics of specific hardware, you cannot share weights. Therefore, you must use distillation to share knowledge. That is what is covered in this discussion.

As you can see, using distillation to share knowledge is not very efficient. Sharing knowledge with distillation is difficult. Makes me produce some sentences, and you try to figure out how to change your weights so that you can produce the same sentences as well. But the bandwidth of this approach is much lower than just sharing gradients. Everyone who has ever taught wants to be able to give what they know directly to their students. That would be nice. Then there is no need for universities to exist.

But the way we work is not like that, because we are biointelligent. My weight is not useful to you. So far, we can say that there are two different ways of computing, one is numerical calculation and the other is biological calculation, the latter takes advantage of the characteristics of animals. They vary greatly in their efficiency in effectively sharing knowledge between different agents. If you look at large language models, they use numerical calculations and weight sharing.

But every copy of the model, every agent derives knowledge from the document in a very inefficient way. In fact, this is a very inefficient form of distillation. It receives the document, trying to predict the next word.

It does not show it the probability distribution of the teacher, but only shows it a random choice, which is the next word chosen by the author of the document. Therefore, its bandwidth is very low. This is how these large language models learn from people.

Each copy is learned in a very inefficient way by distillation, but you have thousands of copies. That is why they can learn more than we do. I believe these large language models are a thousand times more than any individual knows.

The question now is, what will happen if these digital agents no longer learn from us through a slow distillation process, but directly from the real world? I must say that although the process of distillation is slow, they are learning very abstract things when they learn from us.

Over the past few thousand years, mankind's understanding of the world has advanced a lot.

Now, these digital agents are taking advantage of everything we know about the world that we can put into words. As such, they can capture all the knowledge that humanity has documented over the past few thousand years. But the bandwidth of each digital agent is still quite "immortal computing" because they acquire knowledge by learning documents.

If they can learn in an unsupervised way such as modeling videos, it will be very efficient. Once we find an efficient way to train these models to model videos, they'll be able to learn from YouTube as a whole, which is a lot of data. It would also help if they were able to manipulate the physical world, such as having robotic arms, etc.

But I believe that once these digital agents start doing this, they will be able to learn more than humans, and quite quickly.

This brings us to another question I mentioned at the beginning, which is what happens if these agents become smarter than we are.

Obviously, this is the main issue that this meeting is about. But my main point is that I think these superintelligences may happen faster than I thought in the past. If you want to create a superintelligence, bad actors will use them for manipulation, elections, etc. In the United States and many other places, they are already using them for these activities. And it will also be used to win the war.

To make digital intelligence more efficient, we need to allow it to set some goals. However, there is an obvious problem here. There is a very obvious sub-goal that is very helpful for almost anything you want to achieve, which is to gain more power, more control. Having more control makes it easier to achieve your goals. I find it hard to imagine how we can stop digital intelligence from trying to gain more control to achieve other goals.

Once digital intelligence starts to pursue more control, we may face more problems. For example, in the case of physical air gap isolation, superintelligent species can still easily gain more privileges by controlling humans.

In contrast, humans rarely think about species that are smarter than themselves and how they interact with them. In my observations, this kind of artificial intelligence has mastered the action of deceiving humans, because it can learn how to deceive others by reading novels, and once artificial intelligence has the ability to "deceive", it also has the aforementioned ability to easily control humans.

Control, for example, if you want to invade a building in Washington, you don't need to go there yourself, you just need to deceive people into thinking that by invading the building, you can save democracy and ultimately achieve your goals (mocking Trump), this operation is scary, because I don't know how to stop this behavior from happening, so I hope that the younger generation of researchers can find some smarter ways to stop this behavior of control through deception.

Although humans do not have any good solutions to this problem, but fortunately, these intelligent species are created by people, not through evolutionary iteration, which may be a slight advantage that humans currently have, precisely because they do not have the ability to evolve, they do not have the characteristics of human competition and aggression.

We can do some empowerment, even give AI some ethical principles, but I'm still nervous, because so far, I can't imagine examples of things that are smarter, controlled by things that are less intelligent. Let me use an analogy, let's say that frogs created humans, then who do you think will take the initiative now, is it a human, or a frog?

(This article was first published on Titanium Media App, author | Lin Zhijia)