Why is the size of a neural network better? This NeurIPS paper proves that robustness is the basis of generalization

Reporting by XinZhiyuan

EDIT: LRS

The larger the neural network, the better has almost become a consensus, but this idea is contrary to the traditional function fitting theory. Recently, Microsoft researchers published a paper in NeurIPS that mathematically demonstrated the necessity of large-scale neural networks, even larger than expected network sizes.

As the direction of neural networks gradually shifted to hyperscale pre-training models, the researchers' goal seemed to become to make the network have a larger amount of parameters, more training data, and more diverse training tasks.

Of course, this measure is indeed very effective, as neural networks get bigger and bigger, and the models understand and master more data, surpassing humans in some specific tasks.

But mathematically, the size of modern neural networks is actually a bit too bloated, and the number of parameters often far exceeds the needs of the prediction task, a situation also known as overparameterization.

A paper on NeurIPS recently presented a completely new interpretation of this phenomenon. They argue that this kind of neural network, which is larger than expected, is entirely necessary to avoid certain fundamental problems, and the findings in this paper also provide a more general insight into this problem.

Address of the paper: https://arxiv.org/abs/2105.12806

The first author of the article, Sébastien Bubeck, manages the Machine Learning Fundamentals Research Group at MSR Redmond, which conducts cross-cutting research across a variety of topics in machine learning and theoretical computer science.

Neural networks should be so big

A common task of neural networks is to identify target objects in an image.

To create a network capable of accomplishing the task, the researchers first provided it with many images and corresponding target labels, which were trained to learn correlations between them. After that, the network will correctly identify the target in the image it has already seen.

In other words, the training process causes the neural network to remember this data.

And, once the network has memorized enough training data, it is also able to predict the labels of objects it has never seen with varying degrees of accuracy, a process called generalization.

The size of the network determines how many things it can remember.

It can be understood in graphical space. Suppose you have two data points, place them on an XY plane, and connect them with a line described by two parameters: the slope of the line and the height at which it intersects the longitudinal axis. If others also know the parameters of the line, as well as the X coordinate of one of the original data points, they can calculate the corresponding Y coordinate by observing the line (or using the parameters).

That is, the line has memorized these two data points, and the neural network is doing almost something similar.

For example, an image is described by hundreds or thousands of numeric values, with each pixel having a corresponding value. This set of many free values can be mathematically equivalent to the coordinates of a point in high-dimensional space, and the number of coordinates is also called a dimension.

Traditional mathematical conclusions are that to fit n data points with a curve, you need a function with n arguments. For example, in the example of a straight line, two points are described by a curve with two parameters.

When neural networks first emerged as a new model in the 1980s, researchers agreed that only n parameters were needed to accommodate n data points, regardless of the dimension of the data.

Alex Dimakis of the University of Texas at Austin says that this is no longer the case, and the number of parameters of neural networks far exceeds the number of training samples, which shows that textbook content must be rewritten and corrected.

Researchers are studying the robustness of neural networks, the ability of networks to handle small changes. For example, a non-robust network may have learned to recognize giraffes, but it would mistake a version with little modification as a gerbil.

In 2019, Bubeck and colleagues were seeking to prove a theorem about this problem when they realized that the problem had to do with the size of the network.

In their new proof, the researchers show that over-parameterization is necessary for the robustness of the network. They proposed smoothness to indicate how many parameters were needed to fit a data point with a curve with mathematical properties equivalent to robustness.

To understand this, you can again imagine a curve on a plane where the x-coordinate represents the color of a pixel and the y-coordinate represents an image label.

Since the curve is smooth, if you slightly modify the color of the pixel and move a short distance along the curve, the corresponding predicted value will only change slightly. On the other hand, for a jagged curve, a small change in the X coordinate (color) can lead to a huge change in the Y coordinate (image label), and the giraffe can turn into a gerbil.

Bubeck and Sellke demonstrated in the paper that smoothly fitting high-dimensional data points requires not only n parameters, but also n×d parameters, where d is the input dimension (e.g., an image input dimension of 784 pixels is 784).

In other words, if you want a network to robustly remember its training data, over-parameterization is not only helpful, it's a must. The proof relies on the fact of high-dimensional geometry that points randomly distributed on the surface of the sphere are almost always one diameter apart from each other, and the huge spacing between points means that fitting them with a smooth curve requires many additional parameters.

Yale's Amin Karbasi praised the proof in the paper as being very concise, not having a large number of mathematical formulas, and that it says something very generic.

This proof also provides a new way to understand why simple strategies for scaling up neural networks are so effective.

Other studies have revealed other reasons why over-parameterization helps. For example, it can improve the efficiency of the training process, but it can also improve the generalization ability of the network.

While we now know that excessive parameterization is necessary for robustness, it's unclear how necessary robustness is for other things. But by linking it to over-parameterization, the new proof suggests that robustness may be more important than one might think, which could also pave the way for other studies that explain the benefits of large models.

Robustness is indeed a prerequisite for generalization, and if you build a system and just slightly perturb it, and then it gets out of control, what kind of system is that? Obviously unreasonable.

So, Bubeck thinks it's a very basic and basic requirement.

Resources:

https://www.quantamagazine.org/computer-scientists-prove-why-bigger-neural-networks-do-better-20220210/

Why is the size of a neural network better? This NeurIPS paper proves that robustness is the basis of generalization

Read on

Ant Group TitanShield Team won the iscicure at ISC2021

Contributors are the best "reviewers" of ai top clubs! Chinese scholars have proposed a new mechanism for peer review

The loneliest neural network: only one neuron, but will "shadow doppelganger"

Maps, GPS is not reliable, UC Berkeley robot strange environment navigation of more than 3 kilometers

Research such as the University of Cambridge has found that stable and accurate deep neural networks do not actually exist in theory

A review of the latest literature on large-scale neural networks: training efficient DNNs, saving memory usage, optimizer design

Where have researchers gone in solving the traveling salesman problem with deep learning?

New research from PNAS: Cambridge scholars have found that some AI models cannot be computed

A single neuron can also achieve DNN, and the image classification accuracy is 98% | the Nature sub-journal

What is the risk of sudden death in 10 years? The first neural network algorithm tells you

Jeff Dean posted a review: The Golden Decade of Deep Learning

Talk about the body and soul of autonomous driving

Large inventory of the application of machine learning algorithms in autonomous driving

Stop the "TA" from listening to you, can the AI do it?

When will autonomous driving cool down, and how long will it take?

The transformational role of GPU computing and deep learning in drug discovery

Bloody work! The ZTE Axon 40 Ultra is coming: the most perfect full screen ever

HIT Ding Gong: A Cognitive Reasoning Method Based on Neural Symbols

Spending only $60 can destroy 0.01% of the dataset, significantly reducing AI model performance