laitimes

Use Transformer to do line work, really fragrant!

Use Transformer to do line work, really fragrant!

Author 丨berry tincture

Editor 丨 Aoki

Linear algebra is a branch of mathematics about vector spaces and linear mapping.

The history of modern linear algebra can be traced back to England in the mid-19th century. In 1843, the Irish mathematician Hamilton discovered quaternions. In 1844, Hermann Grassmann published his book Die lineare Ausdehnungslehre, which included some of the themes of linear algebra today. In 1848, James Sylvester introduced the matrix. Arthur Gloria introduced the concepts of matrix multiplication and transposition when studying linear transformations. Importantly, Gloria uses a letter to represent a matrix, thus treating the matrix as an aggregate object. He was also aware of the connection between matrices and determinants.

This is how many students dreamed and couldn't sleep at night modern linear algebra was formed.

The old saying has clouds: the line generation abuses me thousands of times, and I treat the line generation like the first love. Searching for "line generation is too hard", Google seconds gave me 726, 000 relevant results.

Use Transformer to do line work, really fragrant!

Some students can't help but complain, doing line problems feel like a fool... (Touching the head)

Use Transformer to do line work, really fragrant!

Whether it is structural mechanics to artificial intelligence, after a deep study of science and engineering, you will find that linear algebra is everywhere. The status of linear algebra is really important, which is the biggest feeling of scientific researchers and technical people in practice. Many algorithms use knowledge of linear algebra, such as the very popular deep learning, and its underlying implementation uses a lot of knowledge of linear algebra. If the underlying foundation is not laid well and the principle is not understood, the implementation of the algorithm is really difficult to understand, let alone innovate.

On December 3, Facebook Artificial Intelligence Research Institute released the latest research, you can use Transformers to solve linear algebra problems!

Use Transformer to do line work, really fragrant!

Address of the paper: https://arxiv.org/pdf/2112.01898.pdf

Transformer is a classic model of NLP proposed by Google's team in 2017. Transformer uses a self-attention mechanism to improve the speed of model training, which abandons the traditional CNN and RNN, and the entire network structure is completely composed of the Attention mechanism. It mainly consists of two parts: encoder and decoder.

Use Transformer to do line work, really fragrant!

Originally designed for machine translation, Transformer has since been applied to a variety of problems, from text generation to image processing, speech recognition, and more. In mathematics, Transformer's applications are mostly focused on symbolic computation, which "manipulates" mathematical symbols just as it "manipulates" words in natural language.

But mathematics ≠ symbolic processing: many practical applications involve numerical calculations, exact (e.g., arithmetic) or approximation (e.g., function computing, numerical solutions to equations). There are few studies using Transformer numerical calculations, and most of the early arithmetic experiments have yielded unsatisfactory results.

But there is an unavoidable problem: most problems in mathematics and science involve symbolic calculations and numerical calculations. If we want Transformers to solve these problems end-to-end, they must be able to perform high-precision numerical calculations.

Author Fran ois Charton trains Transformers to compute solutions to linear algebra problems, which are a fundamental component of many scientific problems: basic operations on matrices, matrix inversations, eigenvalues, and singular value decomposition.

Next we'll look at four encoding schemes that represent problems and solutions as processable by Transformer, training small Transformers (up to 6 layers, 10 to 50 million trainable parameters) on the resulting random matrix dataset. The trained model computes an approximate solution to the problem (to several percentages of its L1 norm) with an accuracy of more than 90% (99% in most cases).

At the same time, generalizing the trained model can greatly improve extraterritorial accuracy through more diverse data sets (especially those trained with non-independent and identical distribution coefficient matrices).

The authors believe these results open the door to a whole new world for Transformer, paving the way for Transformer as an end-to-end solver for mathematical and scientific problems.

1

Problem modeling

Use Transformer to do line work, really fragrant!

The first step is to encode the matrix as a sequence.

Because the inputs and outputs of the problem are matrices, to be handled by transformers, they need to be converted to token sequences.

An m×n matrix is encoded first, its dimensions are encoded as two symbolic markers (VMs and Vn), followed by its mn coefficients, encoded as sequences. In this paper, four encoding schemes for matrix coefficients are used: P10, P1000, B1999, and FP15.

In a positional encoding (P10) with a cardinality of 10, it is a sequence of five markers: a symbolic marker (+ or -), a 3-digit number of the mantissa (from 0 to 9), and an exponent of the symbolic marker (from E-100 to E+100).

For example, 3.14 will be represented as and encoded as. Some examples of encoding are shown in the following figure.

The second step is to generate a random matrix.

Most experiments train the model on a uniformly distributed dataset of random matrices, [A, A] (with A = 10). Sometimes, Gaussian coefficients with the same standard deviation are also sampled.

When studying the distribution generalization of the eigenvalue problem, a random symmetric matrix with different distributions of eigenvalues (corresponding to a random matrix with a non-iid coefficient) is generated. To do this, the authors used gaussian coefficients to randomly sample symmetric matrices M and calculated their eigenvalue decomposition P as orthogonal matrices of eigenvectors. Then, replace the diagonal matrix D of M's eigenvalues with the diagonal D' sampled from another distribution.

Finally recalculated, a symmetric matrix (since P is orthogonal), the eigenvalues are distributed by selection, and the eigenvectors are evenly distributed on the unit sphere.

2

Experiments and results

Matrix transpose

Learning a transpose matrix is equivalent to learning the arrangement of its elements. The arrangement of rectangular matrices involves longer periods. The authors looked at two formulas:

1. In the case of fixed size, all matrices in the dataset have the same dimension, and only one permutation needs to be learned.

2. Variable size case, the dataset includes a matrix of different dimensions, as many permutations as possible to learn.

Four encoding schemes are used in encoders and decoders, and a Transformer with 1 layer, 256 dimensions, and 8 attention heads is trained on the dataset. The model learns to accurately predict solutions (with a tolerance of 0%) in more than 99% of test cases.

Matrix addition

Learning the addition of two m×n matrices is equivalent to learning the correspondence between the input and output positions (as in a transpose problem), and performing an algorithm that adds two numbers in a floating-point representation on an mn pair element. The authors trained a Transformer of 1 or 2 layers, 8 attention heads, and 512 dimensions.

The addition of fixed-size matrices of up to 10 in size, including n=m and n≠m, achieves 99% accuracy in the 1% tolerance range (and more than 98% in 0.5%). The FP15 model achieves 99.5% accuracy within the 0.5% tolerance of the 15×15 matrix, while the B1999 model achieves 89.7% accuracy and 1% tolerance on the 20×20 matrix.

Variable-size matrices up to 10 in dimension were predicted by 2-layer Transformer using B1999 encoding with over 99.5% accuracy and 1% tolerance. There is one layer in the encoder and a 6-layer model in the decoder that achieves 77% and 87% accuracy on the same dataset. The following figure summarizes the experimental results.

Matrix multiplication

The matrix M and vector with dimension m×n are equivalent to calculating the m-dot product between V and M.

Each dot product calculation consists of n multiplications and n1 additions, involving one of the rows in the matrix and all the coefficients in the vector. The model must understand the position of these 2n elements in the calculation, as well as the two operations (addition and multiplication).

By experimenting with models of 1 or 2 layers, more than 5×5 matrices, the authors observed that P10 and P1000 encoded models could only be trained to high accuracy. The P1000 has the best encoding performance, with little difference between the two-tier and one-tier models. For 5×5 and 10×10 square matrices, a 2-layer Transformer encoded with P1000 can achieve more than 99.9% accuracy with a tolerance of 1%. The results are summarized in the following figure.

The multiplication of matrices M and P is an advanced version of matrix vector multiplication that performs the above operations on each column of vectors in matrix P. As before, only coded models using P10 and P1000 can be trained with high-precision predictions.

Over 5×5 matrices and rectangular matrices of similar size, train the model with the same precision as vector multiplication (over 99% at 1% tolerance), but require a deeper decoder (4 to 6 layers).

eigenvalue

We turn our attention to nonlinear problems solved by iterative algorithms.

The author trains a 4- or 6-layer model in an encoder or decoder to predict the eigenvalues of a symmetric matrix.

For samples with 5×5 random matrices, 100% accuracy was achieved at 5% tolerance and 98.5% at 1% at all four encodings. For 8×8 matrices, 100% and 85% accuracy are achieved at 5% and 1% tolerances.

But bottlenecks were also encountered, and for large-scale problems, the model was difficult to learn: on a 10×10 matrix, 360 million examples could achieve 25% accuracy and 5% tolerance. In contrast, for a 5×5 matrix, the model was trained to the highest accuracy out of about 40 million samples, and for the 8×8 matrix, the model was trained to the highest accuracy out of about 60 million samples.

This limitation can be overcome by training the model on a variable-size dataset. On matrix samples with dimensions 5-10, 5-15, and 5-20, the model achieves 100% accuracy at 5% tolerance and 88%, 94%, and 45% at 1% tolerance. Using the 5-15 model, the eigenvalues of the 10×10 matrices can be predicted with 100% accuracy at a tolerance of 2% and 73% at a 1% tolerance. The result is shown in the following figure.

Eigenvectors

In addition to the eigenvalues, the authors also predicted the orthogonal matrices of the eigenvectors.

On 5×5 matrices, models encoded using P10 and P1000 achieve 97.0% and 94.0% accuracy at 5% tolerances. The FP15 model has weaker performance with 51.6% accuracy, but the asymmetric model, with a 6-layer FP15 encoder and a 1-layer P1000 decoder, has an accuracy of 93.5% at a 5% tolerance and 67.5% accuracy at a 1% tolerance. The P1000 model can predict eigenvectors of 6×6 matrices with a prediction accuracy of 81.5%.

Inverse matrix

The inverse of the 5×5 matrix is more difficult than the previous task, with an accuracy of 73.6% for the P10 model and 80.4 %for the P1000 model (5% tolerance, 6-layer encoder and 1-layer decoder).

Increasing the number of attention heads to 10 and 12 has little effect on accuracy, but speeds up training: of about 250 million examples, the 8-headed model achieves 75% accuracy. The asymmetric model achieves the highest accuracy (90.0%).

Singular value decomposition

While this task is related to feature decomposition, it turns out to be more difficult to learn: up to 6 layers of Transformers encoded using P10 or P1000 can predict singular value decomposition of 4×4 matrices. Single singular values (tolerances of 5% and 1%) have higher accuracy rates of 100% and 86.7%, respectively, and full decomposition accuracy rates of 98.9% and 75.3%, respectively.

In addition, in extraterritorial generalization and retraining, the authors generate a random n×n matrix with independent coefficients of the same distribution (iid) in order to train the model, sampling from a uniform distribution on [A, A].

If Transformer wants to solve linear algebra problems, it is important to understand how to train a model on Wigner matrices to perform on matrices with different eigenvalue distributions.

The researchers created a test set of 10,000 matrices with a different distribution than the training set. Then, a test set of matrices of different eigenvalue distributions is generated: positive eigenvalues (Wigner matrix in which eigenvalues are replaced with their absolute values), and eigenvalue distributions according to uniform, Gaussian, or Laplace's law, with a standard deviation of sum.

To improve accuracy outside the distribution, the authors trained new models on datasets with different distributions of eigenvalues and evaluated them on the test sets created earlier.

Use Transformer to do line work, really fragrant!

The result is an important result: Wigner matrices, often regarded as the default model for stochastic matrices, may not be the best option for training Transformers. Non-distributed generalization requires special attention to the generation of training data.

Read on