The west wind comes from the Wafei Temple

Quantum Position | 公众号 QbitAI

The mechanism of information flow in Transformer has been uncovered by the latest research:

Are all layers necessary? Is the middle layer doing the same thing? Does the order of the layers matter?

What happens if you skip some layers, such as Layer 4 output to Layer 6. What happens if you randomly shuffle the order of the layers, such as 4-6-5-7.

最近一项名为“Transformer Layers as Painters”的研究火了，由来自AI初创公司Sakana AI、Emergence AI的研究团队完成。

What happens if you scramble/skip the Transformer layer? The latest research unravels the mechanism of information flow

Starting from the internal working principle of Transformer, they came to the conclusion of the above problem after a series of experiments. According to the team, a deep understanding of these principles will not only improve the efficiency of existing models, but also help improve the architecture and develop new variants.

谷歌DeepMind研究员、ViT作者Lucas Beyer看过后直接点了个赞:

Great summary! Although some of the experiments have been shown in previous studies, I like the new details you have added, especially the emphasis on "reasoning" tasks that are more affected than others!

Many scholars and engineers also highly recommend it.

Bet some of these insights will eventually be used to improve Transformer.

The experiments in it reaffirm that the replication layer is helpful for creative tasks, but generally not for reasoning tasks; Changing the order of the layers doesn't work; Pruning works best in the middle layer, but still requires restorative adjustments.

So, what experiments did the research team conduct in this study? What questions were answered?

Experimental model selection and benchmarking

Let's take a look at the experimental configuration first~

实验在decoder-only和encoder-only模型上进行。

Among them, the decoder-only model is Llama2, which mainly studies Llama2-7B with 32 layers and 7 billion parameters, and the extended experiment also includes 13B (40 layers) and 70B (80 layers) models.

The encoder-only model chooses BERT, which has 24 layers and 340 million parameters.

The researchers used the standard pre-trained checkpoints of these models. In all experiments, the model was frozen, and the model parameters were not modified by fine-tuning, with the exception of a standard fine-tuning step included in the BERT evaluation.

In terms of benchmarks, Llama2 uses the following standard benchmarks: ARC (Science Exam Questions), HellaSwag (General Knowledge Questions), GSM8K (Math Questions), WinoGrande (Common Sense Reasoning), and LAMBADA (Vocabulary Prediction). Among them, LAMBADA is used to measure the confusion, which is the closest to the original token prediction used during training.

For the performance evaluation of Llama2, a standardized median of the benchmark is provided, quantifying the performance from 0 to 1 (model optimal performance).

For BERT, the GLUE benchmark was adopted and its assessment metrics were followed, including the unnormalized mean score of the benchmark. Note that the standard BERT evaluation includes a fine-tuning step, so adaptively tuning the model. In the appendix, the researchers also show the results of an evaluation that only the head of the model can be adjusted.

The motivation for the experiment originally stemmed from the question:

Is it possible to somehow merge multiple layers into a single, possibly larger layer? It is assumed that the middle layer of the neural network may use a common representation space, possibly due to the use of residual connections during training. (Not true for standard multilayer perceptrons, where there is no mechanism to promote common characterization or consistency between layers)

If the layers can share a representation space, it will have a significant impact on subsequent conditional computation or the dynamic addition of new knowledge to the pre-trained Transformer model and downstream applications.

Do layers use the same representation space?

To determine whether different layers share the same representation space, the researchers tested the robustness of Transformer for skipping specific layers or changing the order of adjacent layers.

For example, what happens if you change the normal order of the output stream from "Layer 4 - > Layer 5 - > Layer 6" in the Llama2-7B model to "Layer 4 - > Layer 6", skipping layer 5?

Or what if you send the output from layer 4 to layer 6, and then send the output from layer 6 to layer 5, and then to layer 7?

As shown in the figure below, it is found that Llama2-7B shows good robustness to skipping or changing the sequence except for the first and last layers.

That is, the middle layer shares a representational space, and the middle layer has a separate representational space from the "outer layer" (the front and last layers).

To further confirm this hypothesis, the researchers measured the average cosine similarity between the hidden state activations of different layers in different models (Llama2-7B, Llama2-13B, and BERT-Large) and compared them across benchmarks.

Figure 3 below illustrates the consistency between all the middle tiers. For example, the activation of the bottom fourth layer is similar to that of the top fourth layer. For the 40-layer Llama2-13B, you can see that the layers can be divided into 4-5 groups according to similarity: layer 0, layer 1-3, middle layer, and then the last two layers.

This suggests that the model may have three distinct representation spaces for the Start, Middle, and End layers. The researchers also found that the number of "start layers" seemed to increase as the total number of layers of the model increased.

In addition, high cosine similarity may prove that there is a shared representational space, and low similarity is more indicative that these spaces are not shared. The data for Llama2-7B in Figure 3 above is in good agreement with the performance results shown in Figure 2, which further proves that:

At least the representation space of the middle layer is shared.

Are all layers necessary?

To further verify that the representation space of the middle layer is truly shared, the researchers also performed layer skipping experiments (no fine-tuning was made in the experiment).

Specifically, the output of layer N is passed directly to the input of layer N+M (M>1), thus "skipping" layer M-1, as shown in the figure below.

Originally, the N+M layer was only trained on inputs from the N+M-1 layer, so now can it understand the activation of the Nth layer?

In this type of experiment, the researcher performs the first and last N-1 layers normally, skipping or modifying N+1 to T-N layers (T is the total number of layers of the model).

Figure 4 below shows a gradual deterioration in the performance of both Llama2-7B and BERT-Large across multiple benchmarks (the graph shows the gradual increase in the number of layers skipped, from left to right). This result revealed:

Not all layers are necessary, and omitting at least some of the middle layers won't have a serious impact on overall performance.

Do the middle tiers all perform the same function?

If the middle layers share a common representation space, are these layers redundant?

To answer this question, the researchers re-ran the previous "skip" experiment, but this time instead of skipping the middle layer, they replaced the weights of all of these middle layers with those of the most central layer, as shown in the figure below.

In fact, T-2N+1 is performed in a loop on the most central layer, where T is the total number of layers of the model (32 layers for Llama2-7B and 24 layers for BERT-Large).

As a result, the performance of the model deteriorates rapidly as the number of layers being replaced increases. And the rate of performance degradation is much worse than just skipping certain layers, and this weight substitution is extremely disruptive.

Therefore, it is not superfluous for the middle layers to perform different functions, and sharing weights between the middle layers can have disastrous consequences.

Does the order of the layers matter?

The above experiments show that although the middle layers share a representation space, they perform different operations on that space. So do these sequences of operations matter? The researchers conducted two sets of experiments.

First, the middle layers are executed in the reverse order (reverse order) of the order in which they were trained. Pass the output of the T-N layer to the T-N-1 layer, and so on up to the Nth layer, and then pass the output of that layer to the final T-N layer.

As shown below:

In the second experiment, the order of the middle layer was randomly arranged and 10 random seed results were averaged.

As a result, in the figure below, the model shows a slow performance degradation in both cases.

Here is a spoiler of the results of one of the experiments below, whether in reverse order or random order, the model performs better than skipping these layers directly, indicating that even if the layers are running on non-training order inputs, they can still produce valid output.

So, does layer order matter? The conclusion is:

Layer order adjustment has a certain impact on performance, and both random order and reverse order show a certain performance degradation.

It is worth noting that random order performs better than reverse order. This may be because the reverse order is the exact opposite of the order in which it was trained, and any random order maintains at least some sequential coherence (i.e. one layer i always follows another layer J, where i > j).

Is it possible to run these layers in parallel?

If the presence of layers, i.e. not being skipped, is more important than the order in which they are executed, is it possible to consider running these layers independently and then merging their results? This is shown in the figure below.

The researchers conducted an experiment where, instead of skipping the Nth to T-N layers, these intermediate layers were run in parallel and then their average results were passed on to the final N-layer.

As you can see in the chart below, all benchmarks exhibit slow performance degradation except for the GSM8K math problem benchmark.

Interestingly, the parallel layer performs better than the skip layer, but not as well as the reverse running layer.

In summary, is it possible to run these layers in parallel? The answer is: Yes, except for math-based benchmarks.

Is order more important for some tasks?

Most variants, including reverse order, skipping, and parallelism, exhibit the fastest performance degradation in the abstract inference ARC or mathematical inference GSM8K benchmarks.

This can be explained by the fact that stepwise inference tasks are more sensitive to changes in layer order than "semantic" tasks such as Winogrande or HellaSwag.

This is because inference tasks require a combination of structural and semantic information, whereas tasks such as HellaSwag can be accomplished with only semantics.

Through experiments, the researchers concluded that mathematical and reasoning tasks are more sequential than "semantic" tasks.

Does iteration help with the parallel layer?

If you think of the inner workings of a Transformer as the process of drawing a picture: the canvas (input) is passed between painters, some who specialize in birds, and some who are better at drawing wheels...... Each painter in turn takes the canvas from the other and decides whether to supplement the painting or pass it directly to the next painter (using a residual connection).

It is conceivable that some layers only "complement" the painting when they receive the appropriate input. For example, a painter who "draws wheels" is more likely to draw wheels if they see the body of a car first.

In Transformer, some layers may only function for forward passing when they receive the appropriate input, rather than passing the input directly through a residual connection.

In this case, iteratively executing the parallel layer should improve performance compared to executing it only once.

The researchers tested by feeding back the average output of the parallel layer to the same layer and fixing the number of iterations, as shown in the figure below:

In Figure 9 below, the researchers show the results of 3 iterations of the parallel layer, which is significantly better than executing the parallel layer only once.

The only exception is when Llama2-7B has a start layer N of 15 or BERT has a start layer N of 11. In this case, the effect of looping parallel 3 times is equivalent to repeating the middle layer only 3 times, and the parallel layer at this time is equivalent to the full model.

The researchers also repeated the experiment with different iterations.

The graph below shows that the performance of Llama2-7B varies with the number of parallel layers M and the number of iterations.

The optimal number of iterations per M is indicated by a red box. With the exception of M=29 and M=31 (almost all layers in parallel), the optimal number of iterations is roughly linearly proportional to the number of parallel layers.

Therefore, the conclusion is that iterations are helpful for parallel layers, and the optimal number of iterations is proportional to the number of parallel layers.

Which variants are the least detrimental to performance?

Finally, the researchers compared all the different variants in the experiment on the same chart.

The results showed that repeating a single layer (such as replacing the middle layer with the same number of central layers as mentioned above) worked the worst, and the performance quickly degraded to a random baseline.

Iterative parallelism and random layer sequential performance degradation are the smallest, with iterative parallelism performing best in BERT and Llama2-7B.

More experimental results are added to the appendix of the paper, and interested families can view the original paper.

Paper link: https://arxiv.org/abs/2407.09298v1

Reference link: https://x.com/A_K_Nain/status/1812684597248831912

— END —

QubitAI · 头条号签约

What happens if you scramble/skip the Transformer layer? The latest research unravels the mechanism of information flow