laitimes

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Reports from the Heart of the Machine

Editors: Chen Ping, Xiao Zhou

Can't afford a big model, try an ultra-high-performance pure MLP architecture?

Last year, a research team from Google Brain dug a new pit in the design of the network architecture, proposing MLP-Mixer, a visual architecture built on pure MLP. The architecture requires no convolution, attention mechanisms, and only MLP, achieving performance comparable to CNNs and ViTs on ImageNet datasets.

Researchers at Tsinghua University and other institutions have successively used pure MLP to build visual architecture and new attention mechanisms, and these studies have redirected the focus of CV research to MLP.

Many researchers have lamented that the evolution of network architecture in the CV domain from MLP to CNN to Transformer and back to MLP is simply a "renaissance" in the field of AI.

Less than a year later, a research team from IBM Research recently proposed pNLP-Mixer to apply MLP-Mixer to natural language processing (NLP) tasks.

Address of the paper: https://arxiv.org/pdf/2202.04350.pdf

Large pre-trained language models have dramatically changed the landscape of NLP, and today they are the framework of choice for handling a variety of NLP tasks. However, due to memory footprint and inference costs, using these models in a production environment, whether in the cloud or at the edge, remains a challenge.

Researchers are beginning to propose alternatives, and their recent study of efficient NLP shows that a small weight-efficient model can achieve competitive performance at a very low cost. pNLP-Mixer proposed by IBM Research is a projection-based MLP-Mixer model that can be used for NLP tasks, which achieves high weight efficiency through a new projection layer.

The study evaluated the model on two multilingual semantic analysis datasets, MTOP and multiATIS. The results show that on the MTOP dataset, pNLP-Mixer achieves performance comparable to mBERT, which has 38 times more parameters, and pNLP-Mixer is also better than the small model pQRNN, which is 3 times the former. In long sequence classification tasks, pNLP-Mixer performed better than RoBERTa without pretraining, which has 100 times the parameters of pNLP-Mixer.

pNLP-Mixer architecture

As an efficient architecture designed from scratch, pNLP-Mixer is suitable for two edge situations, namely memory and latency limitations, and exists as a backbone network for NLP pipelines.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Figure 1 depicts the architecture of the pNLP-Mixer model, which is projection-based and does not store large embedded tables like transformer-based models. pNLP-Mixer uses a projection layer that uses a non-trainable hash function to capture lexical knowledge from a single token. This projection layer can be thought of as a feature extractor that generates representations from the input text. Once the input features are computed, they are fed into a trainable linear layer called the bottleneck layer. The output of the bottleneck layer is the input to a series of MLP blocks of the standard MLP-mixer architecture (Tolstikhin et al., 2021).

Using a full MLP architecture for language processing has some advantages. Compared to attention-based models, MLP-Mixer can capture long-distance dependencies without introducing secondary costs on sequence length. In addition, using MLP alone, the model is not only simple to implement, but also has out-of-the-box hardware acceleration in a variety of devices, from phones to server-level inference accelerators.

This study shows that in NLP tasks, simple models like MLP-Mixer can serve as an effective alternative to transformer-based models, even in environments where large embedded tables are not used. The key to this is that the model provides high-quality input features.

Projection layers

The projection layer is based on local sensitive hashing (LSH), which creates representations from text. While this concept is common in other existing projections (e.g. pQRNN (Kaliamoorthi et al., 2021)), the projection method proposed in the study is completely new. MinHash is used as a hash function for its simplicity of computation and relies on subword tokenization to determine the hash input. Subword tokenization is often used in the transformer model, which ensures that any string can be represented as a combination of subword units, i.e. there are no words outside the vocabulary. In the context of this study, using the subword tokenizer has two main advantages:

Augment language knowledge by training new tokenizers or by using vocabulary from available pre-trained language models;

The representation of each sub-word unit can be cached to reduce the cost of inference.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

The projection layer computes the MinHash fingerprint F^t for each input token by multiplexing the fingerprint of a single subword unit of glossary V. fingerprint F∈ N^n is an array of n positive integers (F_0 to F_(n-1)) that uses n different hash functions h_0(x) to h_n-1(x) to map strings to positive integers for computation.

MLP-Mixer

MLP-Mixer is a simple architecture consisting only of mixer blocks, each with two multilayer perceptrons (MLPs), interleaved in transposition operations. The transposition of the output of the first MLP is given to the second MLP, and then the sequence dimensions are manipulated, effectively mixing the information between the tokens. In addition, MLP-Mixer followed the original architectural design, using skipping connections, layer normalization, and GELU nonlinearity.

In this method, the matrix C∈ R^(2w+1)m×s produced by the projection layer will pass through a bottleneck layer, that is, a linear layer, which outputs matrix B∈R^b×s, where B is the bottleneck size and s is the maximum sequence length. This matrix B is the input to the MLP-Mixer model, which in turn produces an output representation of the same dimension as B∈ R^(b×s). Apply a classification header on top of output O to generate actual predictions. In the case of semantic parsing, this classification header is a linear layer applied to each token, while for classification tasks, the method uses attention pooling.

experiment

Before evaluating the final performance of the model, the study thoroughly analyzed the proposed architecture. The experiments in this section are conducted on the validation set of the English MTOP, and the reported metric is the exact match accuracy of the best epoch. The study used pNLP-Mixer with 2 layers as the base model, with a bottleneck and hidden size of 256, an input sequence length of 64, a fixed token feature size of 1024, a window size of 1, and 80 epochs trained with a learning rate of 5e^-4 and a batch size of 256.

Projection comparison

First, the study compared the effects of different feature extraction strategies on performance, including:

BERT embedding

Binary

TSP

MinHash

SimHash

Table 1 below shows the projection scores obtained by the base model. The results show that the performance of BERT embeddings is extremely poor, because one of the main advantages of BERT is that it produces context embeddings, i.e. embeddings that contain information from the surrounding context, where each token needs to be embedded separately. Regarding hash-based projections, they all reach fractions within the same range of values. However, the best-performing projection, MinHash, has an accuracy match accuracy of 80.8% and a score of 77.6% compared to the worst projection TSP, which varies considerably between them. The difference of more than 3% highlights the importance of carefully designing the projection layer and demonstrates efforts to further study projection algorithms. Given these results, for the remaining experiments, the study treated MinHash only as a projection layer.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Model comparison

The results have shown that the MinHash projection provides a powerful representation of the language. The next question is whether MLP-Mixer is the best architecture to handle this characterization. To investigate this, the study first considered a baseline where the MLP-Mixer is removed and the output of the bottleneck layer is passed directly to the classification head. Here, the researchers consider two different projection layers, one with a window size of 1 and the other with a window size of 4. The study compared MLP-Mixer to the other two architectures by maintaining the same projection, bottleneck layer, and classification head, and specifically replacing MLP-Mixer with LSTM and transformer encoders with a similar number of parameters.

Table 2 shows that simply removing MLP-Mixer and relying solely on projection results in a significant decrease in performance. In particular, using a projection with a window size of 1 reduces the number of parameters to 820K, but at the cost of performance degradation of more than 15 points. Large projection layers, on the other hand, result in a doubling of the number of parameters, while the accuracy of the exact match is only 76.5%, which is 4.3% lower than the MLP-Mixer. From the alternative model, the performance of the LSTM is significantly lower than that of the MLP-Mixer, but with 1.8 million parameters, which is 50% more, the exact matching accuracy rate is low (73.9%). The Transformer model has about the same number of parameters as MLPMixer (1.2M), scoring 1.4% lower. The final result is remarkable: for the same number of arguments, MLPMixer is superior to transformer, while having linear complexity dependent on input length rather than quadratic. Overall, the evaluation shows that MLP-Mixer is a weight-efficient architecture for processing projected outputs, i.e. it has higher performance than alternatives with fewer parameters.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Architecture research

The study conducts extensive architectural exploration of the pNLP-Mixer model to determine the effects of different hyperparameters on downstream performance, including projection hyperparameters and MLP-Mixer hyperparameters. For projections, the study included token feature size, hash number, and window size, while MLP-Mixer looked at bottleneck size and layer size. The learning rate used is 5e^4, with a batch size of 256 and a hidden size of 256. Table 3 reports the exact match accuracy and number of parameters for each configuration.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Considering the MLP mixer, increasing the bottleneck sizes to 512 will slightly improve performance, while when using Layer 4, it will reach a similar value to Layer 2. However, these hyperparameters are not independent of the projection layer: larger projections may require a larger MLP-Mixer to process all the information. Therefore, Table 4 examines the relationship between projection size and MLP-Mixer.

The experiment reported the results of two larger models and two smaller models, the results of which the larger models had larger characteristics and bottleneck sizes, and the experiment also showed that the 4 layers achieved the best performance of all the study models. On the other hand, one of the small models achieved an exact match of 76.9% with just 200K parameters.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Table 5 shows that the large language models XLM-R and mBERT scored the highest scores. It is worth noting that from a smaller alternative, pNLPMixer X-LARGE has only 4.4M parameters, mBERT has a parameter volume of 170M, and the average exact match accuracy is only 2 and 3 points lower than mBERT and XLM-R. The LARGE model has a size similar to pQRNN, nearly 3% more accurate matching accuracy than pQRNN and 0.8% higher than pQRNN after distillation.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Table 6 is the evaluation results on the multiATIS dataset. Here, pQRNN achieves the highest intent accuracy, even 1.8% higher than mBERT. In the pNLP-Mixer family, we see that larger sizes do not correspond to better performance; because the vocabulary used in ATIS queries is relatively uniform and simple, a more expressive model is not necessarily better. In fact, the BASE model achieved the highest score of 92.1% in pNLP-Mixers, only 0.5% lower than mBERT with only 1.2M parameters, but only 60% of the pQRNN parameter. The smaller pNLP-Mixer models SMALL and X-SMALL achieved 91.8% and 90.0% competitive performance, respectively, with very small parameters.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

Long sequence experiments

Table 7 shows that in IMDB, RoBERTa and Longformer perform significantly better than pNLP-Mixer, with Longformer achieving 95.7% accuracy compared to only 82.9% for the best pNLP-Mixer. However, longformer remained the best model in the Hyperpartisan task, while pNLP-Mixers outperformed RoBERTa, with the BASE model reaching 90.6 F1, or 3.2 points higher.

After CV, the pure MLP architecture comes to NLP again, and the performance is comparable to that of pre-trained large models

The micro-pNLP-Mixer model, with parameters 1/120x and 1/100 of the Longformer and RoBERTa parameters, respectively, yielded competitive (or even better than RoBERTa) results in hyperpartisan tasks without any pre-training or hyperparameter tuning. However, pNLP-Mixer performs less on IMDB. Taken together, this result raises the question of whether large pNLP-Mixers with pre-trained can be a lightweight alternative to large Transformer models.

Read on