Tesla AI Director: Reproducing LeCun's neural network 33 years ago, it is found that it is not much different from now

Source 丨The Heart of the Machine

Edited 丨ji City Platform

Tesla AI Director: Reproducing LeCun's neural network 33 years ago, it is found that it is not much different from now

In 1989, Yann Lecun et al. published a paper called "Backpropagation Applied to Handwritten Zip Code Recognition." In my opinion, this paper has some historical significance, because as far as I know, it is the earliest application of end-to-end neural networks trained using backpropagation in the real world.

Thesis link: http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf

Although both the dataset and the neural network are relatively small (7291 16x16 grayscale digital images, 1000 neurons), the paper still doesn't feel outdated today, 33 years later: it shows a dataset that describes the architecture, loss function, optimization, and experimental classification error rates for training and test sets. It's so recognizable that it can be categorized as a modern deep learning paper, but it comes from 33 years ago. So I started reproducing this paper, firstly for fun and as a case study of the nature of deep learning progress.

implement

I tried to get as close to the paper as possible and reproduced every detail in PyTorch, see the following GitHub library:

Duplicate link: https://github.com/karpathy/lecun1989-repro

The original network was implemented in Lisp, using Bottou and LeCun's 1988 backpropagation simulator SN (later named Lush). The paper was in French, so I couldn't read it. But syntactically, you can specify neural networks using high-level APIs, similar to what you do in PyTorch today.

In contemporary software design, the design of a library is divided into the following 3 parts:

1) a fast (C/CUDA) universal Tensor library for implementing basic mathematical operations on multidimensional tensors;

2) an autograd engine that tracks forward graphs and generates operations on backward passes;

3) A scriptable (Python) deep learning, advanced API that includes common deep learning operations, layers, architectures, optimizers, loss functions, etc.

training

During training, we pass 23 times on a training set of 7291 samples, with a total of 167693 samples/labels presented to the neural network. The original network was trained for 3 days on a SUN-4/260 workstation (released in 1987). Today, I can run this implementation on my MacBook Air (M1) CPU in just 90 seconds (about 3,000 times faster). My conda is set up to use native amd64 builds instead of Rosetta simulations. If PyTorch can support all the features of M1, including GPUs and NPUs, the acceleration effect may be more pronounced.

I also tried to simply run code on an A100 GPU, but training was slower, most likely because the network was too small (4 layers of convnet, up to 12 channels, total 9760 parameters, 64K MACs, 1K activation), and SGDs using only one example at a time. That is, if one really wants to solve this problem with modern hardware (A100) and software infrastructure (CUDA, PyTorch), we need to replace per-example SGD with full-batch training to maximize GPU utilization. As a result, we are likely to achieve an additional training acceleration of about 100 times.

Reproduce the results of the 1989 experiment

The original paper gives the following experimental results:

eval: split train. loss 2.5e-3. error 0.14%. misses: 10
eval: split test . loss 1.8e-2. error 5.00%. misses: 102

But after the 23rd pass, my training script repro.py printed out:

eval: split train. loss 4.073383e-03. error 0.62%. misses: 45
eval: split test . loss 2.838382e-02. error 4.09%. misses: 82

So, I just roughly reproduced the results of the paper, and the numbers are not precise. It seems impossible to get exactly the same results as the original paper, because the original data set has been lost over time. So, I had to simulate it using a larger MNIST dataset, take its 28x28 digits, scale them down to 16x16 pixels with bilinear interpolation, and randomly and irreplaceably extract the correct number of training and test set examples from it.

But I'm sure there are other reasons that affect accurate reproduction, such as the paper's description of the weight initialization scheme is a bit too abstract; there may be some formatting errors in PDF files (decimal point, square root symbol erased, etc.). For example, the paper tells us that the weight initialization is taken from a uniform "2 4/F", where F is fan-in, but I guess here it is actually "2.4/sqrt(F)", where sqrt helps maintain the standard deviation of the output. There is also a problem with the particular sparse connection structure between the H1 and H2 layers, and the paper only says that it is "selected according to a scheme, and this scheme is not pressed first", so I had to make some reasonable guesses with overlapping block sparse structures.

The paper also claims to have used tanh non-linearity, but I'm concerned that this may actually be a "normalized tanh" that maps ntanh(1) = 1, and possibly adds a reduced residual join, which was very popular at the time to ensure that the flat tail of tanh has at least a little gradient. Finally, the paper uses a "special version of Newton's method that uses the hessian's diagonal approximation." But I only used SGD because it was significantly simpler. Moreover, the authors say, "this algorithm is not considered to bring about a huge increase in learning speed."

Try it out with a new approach

This is my favorite part. We live today, 33 years later, and deep learning is already a very active area of research. With our current understanding of deep learning and our 33 years of experience in research and development, how much improvement can we make from the original results?

My initial results were:

eval: split train. loss 4.073383e-03. error 0.62%. misses: 45
eval: split test . loss 2.838382e-02. error 4.09%. misses: 82

First of all, to be clear, we are working on a simple classification of 10 categories. But at the time, this was modeled as a mean squared error (MSE) regression of targets -1 (for negative classes) or +1 (for positive classes), and the output neurons also had tanh non-linearity. So I deleted the tanh on the output layer to get the class logits and swapped them in the standard (multiclass) cross-entropy loss function. This change greatly improves training errors, resulting in a complete overfitting of the training set:

eval: split train. loss 9.536698e-06. error 0.00%. misses: 0
eval: split test . loss 9.536698e-06. error 4.38%. misses: 87

I suspect that if your output layer has (saturated) tanh non-linearity and MSE errors, you have to be more careful about the details of weight initialization. Second, in my experience, a fine-tuned SGD can work just fine, but the current Adam optimizer (3e-4 learning rate) is almost always a good baseline and requires little tweak.

So, to make me more confident that the optimizations wouldn't affect performance, I switched to AdamW with while setting the learning rate to 3e-4 and reducing the learning rate to 1e-4 during training. The results are as follows:

eval: split train. loss 0.000000e+00. error 0.00%. misses: 0
eval: split test . loss 0.000000e+00. error 3.59%. misses: 72

This gives a slightly improved result based on the SGD. However, we also need to keep in mind that there is also a bit of weight decay through the default parameters, which helps to combat overfitting cases. Since overfitting is still severe, I next introduced a simple data enhancement strategy: move the input image horizontally or vertically by 1 pixel. However, because this simulated the increase in the dataset, I also had to increase the number of channels from 23 to 60 (I verified that simply increasing the number of channels in the original setup did not significantly improve the results):

eval: split train. loss 8.780676e-04. error 1.70%. misses: 123
eval: split test . loss 8.780676e-04. error 2.19%. misses: 43

As you can see from the test errors, the above method is helpful! Data augmentation is a fairly simple, standard concept in terms of adversarial overfitting, but I didn't see it in the 1989 paper, and maybe it's an innovation that came a little late? Since the overfitting is still there, I pulled another new tool out of the toolbox – Dropout. I added a weak dropout of 0.25 in front of the layer with the largest number of parameters (H3). Because dropout sets the activation to zero, it doesn't make much sense to use tanh with a range of activities [-1,1], so I'll also replace all non-linearities with simpler ReLU activation functions. Because dropout introduces more noise in the training, we also have to train for a longer period of time, with a pass count of 80. The final result is as follows:

eval: split train. loss 2.601336e-03. error 1.47%. misses: 106
eval: split test . loss 2.601336e-03. error 1.59%. misses: 32

This leaves us with only 32/2007 errors on the test set! I've verified that simply swapping tanh for relu in the original network doesn't bring substantial benefits, so most of the improvements here come from dropout. Overall, if I go back to 1989, I'll reduce the error rate by 60% (from 80 to 30) and the total error rate for the test set is only 1.5%. But this benefit is not a "free lunch" because we have almost increased the training time by almost 3 times (from 3 days to 12 days). But the reasoning time is not affected. The remaining errors are as follows:

Take it a step further

However, after completing the transitions to MSE → Softmax, SGD → AdamW, adding data enhancement, dropout, and replacing tanh with relu, I began to gradually abandon the idea of easy implementation and instead try more things (such as weight normalization) without getting substantially better results.

I also tried to reduce a ViT to a "micro ViT" that roughly matches the number of parameters and flops, but it couldn't match the performance of a convolutional network. Of course, we've seen many other innovations over the past 33 years, but many of them (e.g., residual joins, layer/batch normalization) can only play a role in large models and are mainly used to stabilize large models for optimization. At this point, further gains may come from an increase in the size of the network, but this increases the latency of inference at test time.

Get hands-on with your data

Another way to improve performance is to scale up your dataset, even though it costs a dollar for labeling. Here's another look at our original baseline:

eval: split train. loss 4.073383e-03. error 0.62%. misses: 45
eval: split test . loss 2.838382e-02. error 4.09%. misses: 82

Since the entire MNIST is now available, we can scale up the training set by a factor of 7 (from 7291 to 50,000). Judging only from the increased data, getting the baseline training run 100 pass has shown some improvements:

eval: split train. loss 1.305315e-02. error 2.03%. misses: 60
eval: split test . loss 1.943992e-02. error 2.74%. misses: 54

Further combining it with the innovation of modern knowledge (as described in the previous section) will result in optimal performance:

eval: split train. loss 3.238392e-04. error 1.07%. misses: 31
eval: split test . loss 3.238392e-04. error 1.25%. misses: 24

In summary, in 1989, simply scaling a dataset would be an effective way to improve system performance without compromising inference latency.

rethink

Let's summarize what we learned from deep learning SOTA technology in 1989 as a time traveler from 2022:

First, not much has changed at the macro level in 33 years. We are still building a micronissible neural network architecture made up of layers of neurons and optimizing them end-to-end using backpropagation and stochastic gradient descent. Everything reads very familiar, except that the network in 1989 was smaller.
By today's standards, the 1989 dataset is still a "baby": the training set has only 7291 16x16 grayscale images. Today's visual datasets typically contain hundreds of millions of high-resolution color images from the web (Google has the JFT-300M, OpenAI CLIP is trained on 400M maps) and grows to billions. Each image contains a 1000-fold increase in pixel information (384 * 384 * 3/(16 * 16)), the number of images increased by 100,000 times (1e9/1e4), and roughly calculated, the pixel data input increased by more than 100,000,000 times.
Neural networks at the time were also "babies": they had about 9760 parameters, 64K MACs, and 1K activations. The size of current (visual) neural networks reaches billions of parameters, while natural language models can reach trillions of parameters.

In those days, it took 3 days for an SOTA classifier to train on a workstation, but now it only takes 90 seconds (3000 times faster) to train on a fanless laptop, and if you switch to full-batch optimization and use the GPU, you can increase the speed by more than 100 times.
In fact, I was able to reduce the error rate by 60% by fine-tuning the model, enhancements, loss functions, and optimizations based on modern innovations, while keeping the data set and model testing time the same.
Modest gains can be made simply by expanding the data set.
Further significant gains may have to come from a larger model, which will require more computation and additional research and development to help stabilize the ever-expanding scale of training. If I had been transferred back to 1989 and hadn't had a larger computer, I wouldn't have been able to improve the system any further.

Assuming this practice session remains the same in time, what does this mean for deep learning in 2022? How would a time traveler from 2055 see the current performance of the network?

The neural network in 2055 is basically the same as the neural network in 2022 at a macro level, but on a larger scale.
Our dataset and model today look like a joke, with both being about 10,000,000 times larger in 2055.
A person can train 2022 SOTA models in a minute, and it's on their PERSONAL COMPUTER as a weekend entertainment.
Today's models are not optimized, just some of the details of the model, the loss function, enhancements, or optimizers that can reduce the error by half.
Our dataset is too small, and we can reap modest benefits by expanding it.
Further gains will not be possible without expanding the computer infrastructure and investing in the development of efficient training models of the appropriate scale.

But the most important trend I want to express is that with the advent of basic models like GPT, the setup of training entire neural networks from scratch according to certain target tasks (such as digital recognition) will become obsolete due to "fine-tuning". These basic models are trained by a small number of institutions that have a large amount of computing resources, and most applications are achieved through lightweight fine-tuning, prompt engineering, or optional steps that distill data and models into smaller, dedicated inference networks.

I think this trend will be very active in the future. Let's be bold and assume that you won't want to train a neural network at all. In 2055, you can ask a 10,000,000-fold neural network brain to perform some tasks in a talking way. If your request is clear enough, it will grant you.

Sure, you can also train a neural network yourself, but why would you?

Original link: https://karpathy.github.io/2022/03/14/lecun1989/