Breaking the "backpropagation" monopoly, "forward automatic differentiation" can also calculate gradients and reduce training time by half

Using backpropagation to calculate the gradient of the optimized objective function is the mainstream method in the current field of machine learning. Recently, a number of scholars from Oxford and Microsoft and other institutions have jointly proposed an automatic differential model called "forward gradient", which can completely abandon backpropagation for gradient calculation. Experiments have shown that in some problems, the calculation time of the forward gradient is one-half of the backpropagation.

Compile the | Zhang Qian

Edit | Chen Caixian

Backpropagation and gradient-based optimization are core technologies that have made major breakthroughs in machine learning (ML) in recent years.

It is widely believed that machine learning has evolved rapidly because researchers have used third-party frameworks (e.g., PyTorch, TensorFlow) to parse ML code. These frameworks not only have automatic differentiation (AD) capabilities, but also provide basic computing capabilities for native code. The software frameworks on which ML depends are built around AD's inverse pattern. This is mainly because in ML, when the gradient of input is massive, it can be accurately and effectively evaluated by a single evaluation of the reverse pattern.

The automatic differential algorithm is divided into forward mode and reverse mode. However, the forward mode is characterized by only one forward evaluation of a function (that is, no backpropagation is used), and the computational cost is significantly reduced. To this end, researchers from Cambridge and Microsoft have explored this model, showing that using only forward automatic differentiation can achieve stable gradient descent across a range of machine learning frameworks.

Address: https://arxiv.org/pdf/2202.08587v1.pdf

They argue that forward gradients are beneficial for changing the computational complexity of classical machine learning training pipelines, reducing the time and effort cost of training, influencing the hardware design of machine learning, and even affecting the biological rationality of backpropagation in the brain.

Two modes of automatic differentiation

First, let's briefly review the two basic modes of automatic differentiation.

Forward mode

Given a function f: θ∈R n, v∈R n, AD in forward mode computes the product F(θ) and The Jacobian vector product Jf (θ) v, where Jf (θ) ∈R m×n is the Jacobian matrix of all the partial derivatives evaluated by f at θ, and v is the perturbation vector. For f : R n R , the number of square guides corresponding to the Jacobian vector product is expressed in f(θ)-v, i.e. the gradient f at θ is a map of the direction vector v, representing the rate of change along that direction.

It is worth noting that the forward pattern evaluates both the function f and its Jacobian vector product Jf v in a single forward run. In addition, obtaining Jf v does not require computation of the Jacobian vector Jf, a feature known as matrix-free computation.

Reverse mode

Given a function f : R n R m, the values θ∈R n, v∈R m, the AD inverse mode calculates f(θ) and the Jacobian vector product v | Jf ( θ ), where Jf∈R m×n is a Jacobian matrix of all partial derivatives evaluated by f at θ, v∈R m is an adjacent vector. For the case of f : R n R and v = 1, the inverse mode calculates the gradient, i.e. the partial derivative f(θ) of f for all n inputs = h f θ1,. . . , f θn i|.

Note that v | Jf is calculated in a forward-backward evaluation without the need to calculate Jacobi Jf.

Runtime costs

The runtime of the two AD modes is bounded by a constant multiple of the time it takes to run the function f that is being differentiated.

Reverse mode is more expensive than forward mode because it involves the reversal of data flow and needs to keep records of the results of all operations in the forward process, which are needed to evaluate derivatives in the subsequent reverse process. Memory and computational cost characteristics ultimately depend on the functions implemented by the AD system, such as utilizing sparsity.

Costs can be analyzed by assuming the computational complexity of basic operations, such as storage, addition, multiplication, and nonlinear operations. By setting the time representation required to evaluate the original function f as runtime(f), we can express the time required for forward and reverse patterns as Rf×runtime(f) and Rb×runtime(f), respectively. In practice, Rf is usually between 1 and 3 and Rb is usually between 5 and 10, but these results are highly relevant to the program.

method

Forward gradient

Definition 1

Given a function f : R n R, they define the "forward gradient" g : R n R n as:

where θ∈R n is the key point at which the gradient is evaluated, v∈R n is a perturbation vector, treated as a multivariate random variable v p(v), so that the scalar component vi of v is independent, having zero means and unit variance for all i, and f(θ)-v∈R is the directional derivative of f at the point of θ in the v direction.

Briefly talk about the origin of this definition.

As mentioned earlier, the forward mode directly gives us the square wizard number f(θ) - v = P i f θi vi without calculating f. Evaluate f forward n times, take the direction vector as the standard base (single hot code) vector ei∈R n, i=1 ... n, where ei represents a vector with 1 on the ith coordinate and 0 elsewhere, in which case f can be calculated using only the forward mode. This makes it possible to assess the sensitivity of f to each input f θi separately, and the gradient f can be obtained by combining all the results.

To get a better runtime advantage than backpropagation, we need to run a forward pattern once in each optimization iteration. In a forward run, we can understand direction v as a weight vector in a sensitivity-weighted sum, i.e. P i f θi vi, although this does not distinguish the contribution of each θi in the final total. Therefore, we use the weight vector v to attribute the overall sensitivity to each individual parameter θi, proportional to the weight vi of each parameter θi (e.g., parameters with small weights contribute less in total sensitivity and parameters with large weights contribute more).

In summary, every time we evaluate a forward gradient, we only need to do the following:

A random perturbation vector v p(v) is sampled at the same size as the first parameter of f.

Running the f function in AD forward mode evaluates both f(θ) and f(θ)-v in a single forward run, without having to compute f in the process. The resulting directional derivative ( f ( θ ) -v ) is a scalar and is calculated precisely by AD (not an approximation).

Multiply the scalar square wizard number f(θ)-v by the vector v to give g(θ), or forward gradient.

Figure 1 shows the evaluation results of several forward gradients of the Beale function. We can see how the perturbation vk (orange) converts to a forward gradient (f-vk) vk (blue) in the case of k ∈ [1,5] and occasionally points to the correct gradient (red) when it is limited by pointing. The green arrow indicates that the Monte Carlo gradient is evaluated by mean forward gradient, i.e. 1 K PK k=1(f - vk)vk≈E[(f - v)v].

Positive gradient descent

They constructed a forward gradient descent (FGD) algorithm that replaces gradient f (algorithm 1) in standard gradient descent with forward gradient g.

In practice, they use small random versions where ft changes in each iteration because it is affected by every small batch of data used in the training. The researchers noted that the directional derivative dt in algorithm 1 can be positive and negative. If negative, the direction of the forward gradient gt is reversed, pointing to the expected true gradient. Figure 1 shows two vk samples that demonstrate this behavior.

In this paper, they limit the scope to FGD, simply study this underlying algorithm, and compare it to standard backpropagation, without considering various other interfering factors such as momentum or adaptive learning rate. The author believes that the forward gradient algorithm can be applied to other optimization algorithms based on gradient algorithms.

experiment

The researchers performed forward AD in PyTorch for experiments. They found no real differences in memory between the forward gradient and backpropagation methods (less than 0.1% in each experiment).

Logistic regression

Figure 3 shows the results of several runs of multi-fork logistic regression on MNIST numerical classification. We observed that the run-time costs for forward gradient and backpropagation were Rf=2.435 and Rb=4.389, respectively, compared to the base run-time, which is in line with what one would expect from a typical AD system.

The ratio of Rf/Rb=0.555 and Tf/Tb=0.553 indicates that the forward gradient is approximately twice as fast as backpropagation in terms of runtime and loss of performance.

In a simple model, these ratios are consistent because the two techniques are nearly identical in the iterative loss of spatial behavior, which means that the runtime gain is almost directly reflected in the loss of each time space.

Multilayer neural networks

Figure 4 shows two experiments using multilayer neural networks for MNIST classification at different learning rates. They used three fully connected layers with architecture sizes of 1024, 1024, and 10. In this model architecture, they observed that the operating costs of forward gradients and backpropagation relative to the underlying run time were Rf=2.468 and Rb=4.165, and the relative measured Rf/Rb averaged 0.592, which was roughly the same as with logistic regression.

Interestingly, in the second experiment (learning rate 2×10-4), we can see that the forward gradient achieves a rapid descent in each iteration loss plot. The authors argue that this behavior is due to differences in randomness between conventional SGD (backpropagation) and forward SGD algorithms, so they speculate that the interference introduced by the forward gradient may be beneficial for exploring the loss plane.

We can see from the time graph that the forward mode reduces the runtime. We see that the loss performance indicator Tf/Tb value is 0.211, which indicates that the forward gradient is more than four times faster than backpropagation during the verification of experimental loss.

Convolutional neural networks

Figure 5 shows a comparison of the forward gradient and backpropagation of a convolutional neural network to the same MNIST classification task.

In this architecture, they observed that forward AD performs best relative to the base runtime, where rf= 1.434 for forward mode represents only 43% of the overhead on top of the base runtime. The backpropagation of Rb=2.211 is very close to the ideal situation expected in a reverse AD system. Rf/Rb=0.649 represents a significant advantage over forward AD runtime over backpropagation. In the loss space, they get a ratio Tf/Tb=0.514, which shows that in experiments to verify losses, the forward gradient is twice as fast as the speed of backpropagation.

Scalability

The previous results show that:

It can also be trained in a typical ML training pipeline without backpropagation, and implemented in a competitive computing way;

With the same parameters (learning rate and learning rate decay), forward AD takes much less time than backpropagation.

Relative to the cost of the underlying runtime, we see that for most experiments, backpropagation is within the Rb ∈ [4,5] and the forward gradient is within the Rf ∈ [3,4]. We also observed that the forward gradient algorithm is beneficial to operation over the entire range. The Rf/Rb ratio remains below 0.6 within 10 layers and slightly above 0.8 at 100 layers. Importantly, the two methods have little difference in memory consumption.

conclusion

In general, several contributions of this work are mainly as follows:

They defined the "forward gradient" as an unbiased gradient estimator based on forward auto-differentiation and without any backpropagation involved.

They started from scratch in PyTorch, implemented an automatic differentiation system in forward mode, and completely did not rely on the backpropagation already in PyTorch.

They applied the forward gradient pattern to various types of stochastic gradient descent (SGD) optimizations, and the final results proved that a typical modern machine learning training pipeline can be built using only automatic differential forward propagation.

They compared the run time and loss consumption of forward gradients and backpropagation, among others, and proved that in some cases, the forward gradient algorithm is twice as fast as backpropagation.

Leifeng Network

Breaking the "backpropagation" monopoly, "forward automatic differentiation" can also calculate gradients and reduce training time by half

Read on