Reported by the Heart of the Machine
Editors: Chen Ping, Du Wei
Joint research by MIT and Microsoft: No additional training is required to enhance the task performance and reduce the size of large language models.
In the era of large models, Transformer single-handedly supported the entire field of scientific research. Since its release, Transformer-based LLMs have demonstrated superior performance on a variety of tasks, and their underlying Transformer architecture has become the most advanced technology for natural language modeling and inference, and has shown strong promise in areas such as computer vision and reinforcement Xi.
However, current transformer architectures are very large and often require significant computing resources for training and inference.
This is intentional, as a Transformer trained on more parameters or data is clearly more capable than other models. Nonetheless, a growing body of work has shown that Transformer-based models, as well as neural networks, do not require all fitting parameters to retain the assumptions they have learned.
In general, large-scale overparameterization seems helpful when training models, but these models can be pruned significantly before inference, and studies have shown that neural networks can often remove more than 90% of the weights without any significant degradation in performance. This phenomenon has prompted researchers to turn to the study of pruning strategies that can help model reasoning.
来自 MIT、微软的研究者在论文《 The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction 》中提出了一个令人惊讶的发现,即在 Transformer 模型的特定层上进行仔细的剪枝可以显著提高模型在某些任务的性能。
- Address: https://arxiv.org/pdf/2312.13558.pdf
- Paper Homepage: https://pratyushasharma.github.io/laser/
The study calls this simple intervention LASER (LAyer SElective Rank reduction), which significantly improves the performance of LLMs by selectively reducing the higher-order components of the Xi weight matrix of specific layers in the Transformer model by singular value decomposition, which can be performed after the model is trained and does not require additional parameters or data.
During operation, the reduction of weights is performed in model-specific weight matrices and layers, and the study also found that many similar matrices can significantly reduce weights, and performance degradation is often not observed until more than 90% of the components are completely removed.
The study also found that these reductions can significantly improve accuracy, a finding that appears to be not limited to natural language, but also found performance gains in intensive chemical Xi.
In addition, the study attempts to deduce what is stored in higher-order components so that deletion can improve performance. The study found that LASER answered the correct question, but before the intervention, the original model mainly responded with high-frequency words (such as "the", "of", etc.), which were not even the same semantic type as the correct answer, which meant that these components would cause the model to generate some irrelevant high-frequency words without intervention.
However, by performing a certain degree of downgrading, the model's response can be transformed into correct.
To understand this, the study also explored what the rest of the components encode separately, using only their higher-order singular vectors to approximate the weight matrix. It was found that these components described different responses or generic high-frequency words in the same semantic category as the correct answer.
These results suggest that when noisy higher-order components are combined with lower-order components, their conflicting responses produce an average answer, which may be incorrect. Figure 1 provides a visual representation of the Transformer architecture and the procedure followed by LASER. Here, the weight matrix of the multilayer perceptron (MLP) for a particular layer is replaced with its low-rank approximation.
LASER 概览
The investigator details the LASER intervention. The single-step LASER intervention is defined by a triplet (τ, l, ρ) containing the parameters τ, number of layers l, and decreasing rank ρ. Together, these values describe which matrices will be replaced by their low-rank approximation and how rigorous the approximation is. The investigators relied on parameter types to classify the type of matrix they would intervene in.
The investigators focused on matrices in W = {W_q, W_k, W_v, W_o, U_in, U_out}, which consist of matrices in MLP and attention layers. The number of layers represents the layer in which the investigator intervened (the first layer was indexed from 0). For example, Llama-2 has 32 layers, so l ∈ {0, 1, 2,・31}.
Finally, ρ ∈ [0, 1) describes which part of the largest rank should be kept when doing a low-rank approximation. For example, set
, the matrix has a maximum rank of d. The researchers replaced it with a ⌊ρ・d⌋-approximation.
Figure 1 below is an example of a LASER, where τ = U_in and l = L represent the weight matrix of the first layer of MLP to be updated in the Transformer block of the L^th layer. The other parameter controls the k in the rank-k approximation.
LASER can limit the flow of certain information in a network and unexpectedly produce significant performance benefits. These interventions can also be easily combined, such as applying a set of interventions in any order
。
The LASER method is simply a simple search for such interventions and modified to maximize benefits. However, there are many other ways to combine these interventions, which is the direction of future work by researchers.
Experimental results
In the experimental part, the researchers used a GPT-J model pre-trained on the PILE dataset, which has 27 layers and 6 billion parameters. The model's behavior is then evaluated on the CounterFact dataset, which contains samples of triples (topics, relationships, and answers) with three paraphrased prompts for each question.
The first is the analysis of the GPT-J model on the CounterFact dataset. Figure 2 below illustrates the effect of applying a different number of downgrades to each matrix in a Transformer architecture on the dataset classification loss. Each of these Transformer layers consists of a small two-layer MLP, with separate input and output matrices. Different colors indicate different percentages of removed components.
In terms of improving the accuracy and robustness of paraphrasing, as shown in Figure 2 above and Table 1 below, the researchers found that the factual accuracy of the GPT-J model on the CounterFact dataset increased from 13.1% to 24.0% when downgrading on a single layer. It's important to note that these improvements are only the result of downgrading and do not involve any further training or fine-tuning of the model.
The researchers found that the facts recovered by the reduced rank are most likely to appear very rarely in the data, as shown in Figure 3 below.
What are the higher-order components stored? Researchers use higher-order components to approximate the final weight matrix (unlike LASER, which uses lower-order components to approximate), as shown in Figure 5(a) below. When using a different number of higher-order components to approximate the matrix, they measured the mean cosine similarity of the true answer relative to the predicted answer, as shown in Figure 5(b) below.
Finally, the researchers evaluated the generality of their findings on multiple language comprehension tasks for three different LLMs. For each task, they evaluated the model's performance by generating three metrics: accuracy, classification accuracy, and loss. As shown in Table 1 above, even a large downgrade does not result in a decrease in model accuracy, but can improve model performance.