laitimes

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

author:Quantum Position

Abundant color from the temple of Wafei

量子位 | 公众号 QbitAI

An astonishing discovery from a PhD student at MIT:

A very simple pruning of a specific layer of the Transformer can significantly improve model performance while reducing the size of the model.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

The effect is mainly reflected in text comprehension tasks, which can reach up to 30%.

This was validated on 3 models (LLama2, GPT-J, and Roberta) and 8 different datasets (including cognitive reasoning, world knowledge, etc.).

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

In addition to text comprehension, it is also suitable for intensive chemical Xi.

Of course, more importantly, this operation only needs to be performed after the model is trained, and no additional parameters and data are required.

DeepMind research scientists came to praise after reading it:

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

So, how exactly does it work?

Overview of the methodology

该方法全称“ Layer-Selective Rank Reduction”,简称“LASER”。

This is an intervention to selectively remove the higher-order components of the LLM weight matrix, which is performed in a specific weight matrix and layer of the Transformer model.

The study found that even if more than 90% was completely removed, the model performance generally did not deteriorate.

Specifically, LASER replaces a specific weight matrix (W) in the Transformer model with a rank-k approximation, sometimes reducing only the matrix containing the top 1% of components, which can also achieve good results.

A single-step LASER intervention consists of three parameters:

类型(T)、层号(ℓ )和降秩(ρ,全称rank reduction)。

Together, these values describe which matrix will be replaced by its lower-order approximation, and to what extent.

where the parameter type classifies the matrix in which we are going to intervene, and the matrix W comes from the MLP and attention layers.

The layer number indicates the layer we want to intervene in (the first layer is indexed from 0). For example, Llama-2 has 32 layers, so l ∈{0,1,2,··· 31}。

Finally, ρ∈[0,1) describes the fraction that should retain the maximum rank when performing a low-rank approximation.

The following image is an example of a LASER operation that updates the first-layer weight matrix of MLP in the L-layer Transformer block.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

Experimental findings:

The downgrading effect is not uniform between different layer types, which can be observed mainly in the subsequent transformer block of the MLP layer, but is very weak in the attention layer.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

At the same time, if we run LASER on multiple layers in one go, we can further enhance the performance of the model beyond the improvements that can be achieved by a single layer.

Specifically, it can sometimes exceed 2 times the original performance of the model.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

In addition to improving the text understanding performance of the model by up to 30%, it is also effective for intensive chemical Xi.

Here, the authors evaluate the impact of LASER on a decision-making transformer model that trains and evaluates Sokoban games (pushed into a hole by moving blocks).

It turned out that with LASER, the model could solve 3% more tasks.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

Cause analysis

Why does such a simple operation lead to such an improvement in model performance?

The authors used the results of the GPT-J model to analyze (the model was chosen mainly because the training data DT rain is public), that is, by calculating the frequency of "correcting facts" in the training data to figure out which data points benefited from it.

It was found that the largest performance gains occurred on low-frequency samples.

As shown in Figure C below, the bar chart shows the amount of lift that LASER provides to the data, with the biggest improvement in accuracy coming from data points that appear less frequently in the training data.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

The authors explain that it is clear that the elimination of higher-order components "denoises" the model and helps to recover hidden, less frequent information.

In this regard, DeepMind researchers said that it makes quite sense:

LLMs have to model a lot of erroneous reasoning and inaccurate information, and it helps to weed out some of what they've learned.

Then the question arises: what exactly do the higher-order components in the matrix store that would destroy the model?

By approximating the weights matrix of these components by Xi, the authors find that:

When the original, unmodified model does not answer correctly, higher-order components occasionally answer questions with high-frequency words that have no actual meaning (e.g., "a", "the", "of"), or directly predict entities that have the same semantic type as the correct answer but are incorrect.

BY USING LASER TO REMOVE THESE HIGHER-ORDER COMPONENTS, THIS PROBLEM CAN BE SOLVED AND THE MODEL RESPONDS CORRECTLY.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

Overall, this research is useful for understanding how information is stored in LLMs, how model sizes are compressed, and more broadly for understanding the behavioral impact of large language models.

There are still many problems that need to be solved urgently, such as:

1. Why do higher-order components in the weight matrix accumulate noisy answers during training?

2. What is the impact of model architecture and structure selection on the occurrence of this phenomenon?

About the Author:

There are three authors in this paper, one is a PhD student at MIT EECS, who produced this research during her Xi at Microsoft.

No additional training is required to improve the performance of the model by 30%DeepMind scientists praise the work of MIT doctoral students

The remaining two are her supervisors for this research, and all of them are senior researchers at Microsoft Research New York, with the same guidance contributions.

One is Jordan T. Ash, a Ph.D. graduate of Princeton University who specializes in deep Xi and sequential decision making.

The other is Dipendra Misra, whose research focuses on interactive Xi, NLP and representational Xi.

Reference Links:

[1]https://arxiv.org/abs/2312.13558

[2]https://twitter.com/pratyusha_PS/status/1739025292805468212

— END —

QbitAI · Headline number signed

Follow us and be the first to know about cutting-edge technology trends

Read on