laitimes

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

author:New Zhiyuan

Editor: Peach LRS

There is no need to miss MLP, the new network KAN is based on the Kolmogorov-Arnold theorem, with fewer parameters, stronger performance, and better interpretability, and the innovation of deep learning architecture has entered a new era!

Overnight, the machine learning paradigm is about to change!

The infrastructure that dominates the field of deep learning today is multilayer perceptrons (MLPs) that place activation functions on neurons.

So, do we have a new route to take beyond that?

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind
MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Just today, teams from MIT, Caltech, Northeastern University, and other institutions released a new neural network structure - Kolmogorov–Arnold Networks (KAN).

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

The researchers made a simple change to MLP, moving the learnable activation function from nodes (neurons) to edges (weights)!

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Address: https://arxiv.org/pdf/2404.19756

This change may sound unwarranted at first glance, but it has a deep connection to the "approximation theories" in mathematics.

It turns out that Kolmogorov-Arnold represents a two-layer network, with learnable activation functions on edges, not nodes.

Inspired by the representation theorem, the researchers explicitly parameterized the Kolmogorov-Arnold representation with a neural network.

It is worth mentioning that the origin of the name KAN is in honor of two great late mathematicians, Andrey Kolmogorov and Vladimir Arnold.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Experimental results show that KAN has superior performance compared with traditional MLP, and improves the accuracy and interpretability of neural networks.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Most unexpectedly, the visualization and interactivity of KAN make it potentially useful in scientific research, helping scientists discover new mathematical and physical laws.

In the study, the author used KAN to rediscover the mathematical laws of knot theory!

Moreover, KAN replicates DeepMind's results in 2021 with a smaller network and automation.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

In physics, KAN can help physicists study Anderson localization, which is a phase transition in condensed matter physics.

By the way, by the way, all the examples of KAN in the study (except for parametric sweeps) can be reproduced in less than 10 minutes on a single CPU.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

The emergence of KAN directly challenged the MLP architecture that has dominated the field of machine learning for a long time, and caused an uproar on the whole network.

A new era of machine learning begins

Some people say that a new era of machine learning has begun!

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

According to a research scientist at Google's DeepMind, "Kolmogorov-Arnold strikes again!A little-known fact is that this theorem appeared in a seminal paper on permutation invariant neural networks (deep sets), demonstrating the complex connection between this representation and the way sets/GNN aggregators are built (as a special case)."

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

A new neural network architecture was born!KAN will dramatically change the way AI is trained and fine-tuned.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind
MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Could it be that AI has entered the 2.0 era?

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind
MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Some netizens used popular language to make a vivid analogy of the difference between KAN and MLP:

The Kolmogorov-Arnold network (KAN) is like a three-layer cake recipe that can bake any cake, while the multi-layer perceptron (MLP) is a custom cake with different layers. MLP is more complex but more versatile, while KAN is static but simpler and faster for one task.
MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

The author, MIT professor Max Tegmark, said the latest paper shows that an architecture that is completely different from standard neural networks achieves greater accuracy with fewer parameters when dealing with interesting physics and mathematical problems.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Next, let's take a look at KAN, which represents the future of deep learning, and how to achieve it.

RETURN TO KAN AT THE TABLE

The theoretical basis of KAN

The Kolmogorov–Arnold representation theorem states that if f is a multivariate continuous function defined on a bounded domain, then the function can be expressed as a finite combination of multiple univariate, additive continuous functions.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

For machine learning, the problem can be described as follows: the process of learning a high-dimensional function can be reduced to a one-dimensional function that learns the number of polynomials.

But these one-dimensional functions may be non-smooth, or even fractal, and may not be learnable in practice, and it is precisely because of this "pathological behavior" that Kolmogorov-Arnold said that theorems are basically "dead" in the field of machine learning, that is, theoretically correct but practically useless.

In this article, the researchers remain optimistic about the application of the theorem to the field of machine learning and propose two improvements:

1. In the original equation, there are only two layers of nonlinearity and one hidden layer (2n+1), which can generalize the network to any width and depth;

2. Most of the functions in science and everyday life are mostly smooth and have a sparse combinatorial structure that may contribute to the formation of a smooth Kolmogorov-Arnold representation. Similar to the difference between a physicist and a mathematician, a physicist is more concerned with the typical scenario, while a mathematician is more concerned with the worst-case scenario.

KAN architecture

The core idea of the design of the Kolmogorov-Arnold network (KAN) is to transform the approximation problem of multivariate functions into a problem of learning a set of univariate functions. In this framework, each univariate function can be parameterized with a B-spline curve, where the B-spline is a local, piecewise polynomial curve with learnable coefficients.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

In order to extend the two-layer network in the original theorem deeper and wider, the researchers proposed a more "generalized" version of the theorem to support the design of KAN:

Inspired by the MLPs cascade structure to improve the depth of the network, a similar concept is introduced, the KAN layer, which consists of a matrix of one-dimensional functions, each of which has trainable parameters.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

According to the Kolmogorov-Arnold theorem, the original KAN layer consisted of internal and external functions, corresponding to different input and output dimensions, respectively, and this design method of stacking KAN layers not only expands the depth of KANs, but also maintains the interpretability and expressiveness of the network, in which each layer is composed of univariate functions, which can be studied and understood separately.

f in the following equation is equivalent to KAN

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Implementation details

Although the design concept of KAN seems simple and purely stacked, it is not easy to optimize, and the researchers have also figured out some tricks during the training process.

1. Residual activation function: By introducing the combination of the basis function b(x) and the spline function, the concept of residual connection is used to construct the activation function φ(x), which is helpful for the stability of the training process.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

2. Initialization scales: The initialization of the activation function is set to a spline function close to zero, and the weight w uses the Xavier initialization method, which helps to keep the gradient stable at the beginning of training.

3. Update the spline mesh: Since the spline function is defined within a bounded interval, and the activation value may exceed this interval during the neural network training process, dynamically updating the spline mesh can ensure that the spline function always runs in the appropriate interval.

The number of parameters

1. Network depth: L

2. The width of each layer: N

3. Each spline function is defined based on G intervals (G+1 grid points), k order (usually k=3)

So the amount of parameters for KANs is about

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

In contrast, MLP has a parameter quantity of O(L*N^2), which seems to be more efficient than KAN, but KANs can use a smaller layer width (N), which not only improves generalization performance, but also improves interpretability.

KAN比MLP,胜在了哪?

Stronger performance

As a plausibility test, the researchers constructed five known examples with smooth KA (Kolmogorov-Arnold) representations as validation datasets, and trained KANs by increasing grid points every 200 steps, covering the range of {3,5,10,20,50,100,200,500,1000}

MLPs of different depths and widths were used as the baseline model, and both KANs and MLPs were trained with the LBFGS algorithm for a total of 1800 steps, and then RMSE was used as an indicator for comparison.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

As you can see from the results, the curve of KAN is more jitter, able to converge quickly, reach a stationary state, and is better than the scaled curve of MLP, especially in the case of high dimensions.

It can also be seen that the performance of the three-layer KAN is much stronger than that of the two-layer, indicating that the deeper KANs have stronger expressive ability, which is in line with expectations.

Explain KAN interactively

The researchers designed a simple regression experiment to show that users can get the most interpretable results when interacting with KAN.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Suppose a user is interested in finding a symbolic formula, which requires a total of 5 interactive steps.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Step 1: Training with sparsity.

Starting with a fully connected KAN, the network can be made sparse by training with sparsity regularization, so that 4 out of 5 neurons in the hidden layer look like they don't work.

Step 2: Pruning

After automatic pruning, all useless hidden neurons are discarded, leaving only a KAN that matches the activation function to the known symbolic function.

Step 3: Set up the symbol function

Assuming that the user can correctly guess these symbol formulas from staring at the KAN chart, they can be set up directly

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

If the user has no domain knowledge or doesn't know which symbolic functions these activation functions might be, the researchers provide a function suggest_symbolic to suggest symbolic candidates.

Step 4: Further training

After all the activation functions in the network have been symbolized, the only remaining parameter is the affine parameter, and when you continue to train the affine parameter, you will realize that the model has found the correct symbolic expression when you see the loss drop to machine precision.

Step 5: Output the symbol formula

Use Sympy to calculate the symbolic formula of the output node and verify the correct answer.

Interpretability verification

The researchers first designed six samples in a supervised toy dataset to demonstrate the combinatorial structure of the KAN network under a symbolic formula.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

It can be seen that KAN has successfully learned the correct univariate function and shows KAN's thought process in an interpretable way through visualization.

In the unsupervised setting, the dataset contains only the input feature x, and the ability of the KAN model to find dependencies between variables can be tested by designing connections between certain variables (x1, x2, x3).

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

From the results, the KAN model has succeeded in finding the functional dependence between variables, but the authors also point out that experiments are still only carried out on synthetic data, and a more systematic and controllable method is needed to discover the complete relationship.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Pareto optimal

By fitting a special function, the authors show the Pareto frontier of KAN and MLP in a plane spanned by the number of model parameters and RMSE loss.

Of all the special functions, KAN always has a better Pareto front than MLP.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Solve partial equations

In the task of solving partial differential equations, the researchers plotted the L2 squared and H1 squared losses between the predicted solution and the true solution.

In the figure below, the first two are the training dynamics of the loss, and the third and fourth are the Sacling Law of the number of loss functions.

As shown in the results below, KAN converges faster, has lower losses, and has a steeper scaling law compared to MLP.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Keep learning without catastrophic forgetting

We all know that catastrophic forgetting is a serious problem in machine learning.

The difference between artificial neural network and brain is that brain has different modules that are placed in the local function of space. When learning a new task, structural reorganization occurs only in the local area responsible for the relevant skill, while the other areas remain the same.

However, most artificial neural networks, including MLP, do not have this concept of locality, which may be the cause of catastrophic forgetting.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

The study proves that KAN is locally plastic, and splines can be used to avoid catastrophic forgetting.

The idea is very simple, since the spline is local, the sample will only affect some nearby spline coefficients, while the distant coefficients remain the same.

In contrast, since MLP typically uses global activation (e.g., ReLU/Tanh/SiLU), any local changes can propagate uncontrollably to distant areas, destroying the information stored there.

The researchers employed a one-dimensional regression task consisting of 5 Gaussian peaks. The data around each peak is presented to KAN and MLP sequentially, rather than all at once.

As shown in the figure below, KAN only reconstructs the area where data exists in the current stage, leaving the previous area unchanged.

MLP, on the other hand, reshapes entire areas when they see new data samples, leading to catastrophic forgetting.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Discover the knot theory, and the results surpass DeepMind

What does the birth of KAN mean for the future application of machine learning?

Knot theory is a discipline in low-dimensional topology, which reveals the topological problems of three-manifold and four-manifold, and has a wide range of applications in the fields of biology and topological quantum computing.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

2021年,DeepMind团队曾首次用AI证明了纽结理论(knot theory)登上了Nature。

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Address: https://www.nature.com/articles/s41586-021-04086-x

In this study, supervised learning and experts in the human domain led to a new theorem related to algebra and geometric junction invariants.

That is, gradient significance identified the key invariants of the supervision problem, which led the domain expert to formulate a conjecture that was subsequently refined and proven.

In this regard, the authors investigate whether KAN can achieve good interpretable results on the same issue and thus predict the signature of the kink.

In the DeepMind experiment, the main results of their research on the kink theory dataset were:

1 Discovery and signature using network attribution

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

It mainly depends on the middle distance

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

and longitudinal distance λ.

2 Experts in the human domain later discovered

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

There is a high correlation with slope

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

And derived

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

To investigate question (1), the authors consider the 17 kink invariants as inputs and signatures as outputs.

Similar to the setup in DeepMind, the signature (even) is encoded as a hot vector, and the network is trained on cross-entropy loss.

It was found that a very small KAN was able to achieve a test accuracy of 81.6%, while DeepMind's 4-layer width of 300MLP only achieved a test accuracy of 78%.

As shown in the table below, KAN (G = 3, k = 3) has about 200 parameters, while MLP has about 300,000 parameters.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

It is worth noting that KAN is not only more accurate, but also more accurate. At the same time, it is more efficient than the parameters of MLP.

In terms of interpretability, the researchers scale the transparency of each activation based on its size, so it's immediately clear which input variables are important without feature attribution.

Then, KAN was trained on three important variables and a test accuracy of 78.2% was obtained.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

As follows, with KAN, the authors rediscovered three mathematical relationships in the knot dataset.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Physical Anderson localization has been solved

And in physical applications, KAN also plays a huge value.

Anderson is a fundamental phenomenon in which disorder in a quantum system leads to the localization of the electron wave function, which stops all transmissions.

In both one- and two-dimensional, scale arguments show that all electron eigenstates are exponentially localized for any tiny random disorder.

In contrast, in three dimensions, a critical energy forms a phase boundary that separates the extended and localized states, which is known as the mobility edge.

Understanding these mobile edges is critical to explaining various fundamental phenomena such as metal-insulator transitions in solids, as well as the localized effects of light in photonic devices.

Through their research, the authors found that KANs make it very easy to extract mobile edges, both numerically and symbolically.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind
MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Clearly, KAN has become a scientist's right-hand man, an important collaborator.

All in all, thanks to the advantages of accuracy, parameter efficiency, and interpretability, KAN will be a useful model/tool for AI+Science.

In the future, the further application of KAN in the field of science remains to be explored.

MLP was killed overnight! MIT, Caltech and others discovered mathematical theorems that crushed DeepMind

Read on