Mille-feuille Transformer is here! Multilingual machine translation standards refresh multiple SOTA levels

Reporting by XinZhiyuan

Editor: David Layan

Recently, researchers at Microsoft Research have created a 1000-layer transformer to refresh multiple SOTa on multilingual machine translation tasks

In recent years, the pursuit of large-scale Transformer models has become a trend.

From the beginning of the million-level model parameters, to the billion-level, and then to the trillion-level, the scale of the parameters has increased greatly. Large-scale models can perform better on a large number of tasks, and show excellent capabilities in the case of small sample and zero sample learning.

Despite the growing number of parameters, the depth of parameters has been limited by the instability of Transformer training. In 2019, scientists Nguyen and Salazar discovered that pre-norm residual connections based on post-norm connections can improve the stability of Transformers.

The gradient of the underlying Pre-LN will be larger than that of the top layer, which will lead to a slight decline in performance compared to Post-LN.

To solve this problem, the researchers tried to improve the optimization of deep transformers. This is achieved through better initialization or architecture. These methods allow Transformer to remain stable even with hundreds of layers.

But there is still no way to make the number of layers of the Transformer reach 1000.

Thesis link: https://arxiv.org/abs/2203.00555

Recently, a paper from Microsoft Research successfully achieved a breakthrough in the order of magnitude of transformer layers, reaching 1000 layers.

The researchers' goal is to continuously improve the stability of Transformer training and continue to increase the depth of the model. They investigated the causes of optimization instability and found that it was the explosive increase in the size of the model parameters that caused this instability.

Based on the above conclusions, the researchers used a new normalized function at the residual junction, DEEPNORM. Theoretically, this new function can limit the update of the model to a constant.

This method seems simple, but it actually works, and it only takes a few lines of code to change.

With the new function, the stability of Transformers has been greatly improved. Researchers can also expand the depth of the model to 1,000 layers.

In addition, DEEPNORM has successfully combined the excellent performance of Post-LN and Pre-LN. The new method is an upper-level alternative to Transformers, both for deep and large-scale models.

It is worth mentioning that compared with the most advanced 48-layer model with 12B parameters, the 200-layer model with 3.2B parameters has achieved an improvement of 5 BLEU. This improvement is mainly reflected in the large-scale multilingual machine translation benchmark.

Using the newly discovered approach on Transformer-based PostLN is not difficult. Compared to Post-LN, DEEPNORM upgrades the residual connection before it performs hierarchical normalization.

In addition, the researchers downgraded the parameters during initialization. In particular, they increased the proportion of feed-forward networks, along with the value projection and output projection of the attention layer.

And the scale and overall structure of residual connections and initializations are related.

Ultra-deep Transformer: DEEPNET

The researchers introduced ultra-deep transformers, DEEPNET. By mitigating the problems encountered in the upgrade of the model of great growth, DEEPNET can be a more stable optimization process.

First, the researchers gave the magnitude of prediction for the DEEPNET model upgrade. Later, a theoretical analysis was given and found that as long as DEEPNORM was used, the process of DEEPNET upgrade could be limited to a constant.

DEEPNET is based on the Transformer architecture. Compared to the previous vanilla Transformer, on each sublayer, the researchers' latest study of DEEPNORM was used, rather than Post-LN.

The expression for DEEONORM can be written as:

where α is constant, Gl(xl, θl) is the equation for sublayers of Layer I Transformer, and θl is the coefficient. DEEPNET also amplifies the weights inside the residuals to amplify the β.

α and β are constants and relate only to structure.

In addition, attention is an important part of Transformer.

Without losing the generality, the researchers looked at the case of 1-head. Q, K, and V refer to query, key, and value, respectively. WQ, WK, and WV are all mapping matrices of inputs. WO is the mapping matrix of output. Thus, the attention equation can be written:

The diagram below shows what happens during the early training phases when the vanilla Post-LN and DEEPNET models are upgraded. The researchers visualized 64-128-2 tiny transformers with depths ranging from 6L6L to 100L100L.

From this graph we can see that DEEPNET has a more stable update than Post-LN.

Performance: 1000-layer network, significantly improving NMT performance

Neural machine translation

We validated the effectiveness of DEEPNET on popular machine translation benchmarks, including the IWSLT-14 German-English (De-En) dataset and the WMT-17 English-German (En-De) dataset.

We compared DEEPNET to several of the most advanced deep transformer models, including DLCL, NormFormer, ReZero, R-Fixup, T-Fixup, DS-init, and more.

We reproduced baseline standard performance with open source code from the other models above and set the same hyperparameters. USING BLEU as an evaluation indicator for all experiments, the evaluation results are as follows:

The table above shows the results of the baseline and DEEPNET and the results of DEEPNET on the WMT-17 English-German translation dataset

Compared to the Post-LN model, DEEPNET is more stable and can be successfully scaled to 100L-100L, reaching BLEU of 28.9 on the test set. In contrast, when the depth reaches 50L-50L, the baseline with Post-LN leads to unstable optimizations. In addition, when the model is shallow, DEEPNET achieves performance comparable to these baselines.

Fast convergence at different depths

We take the depth of the model from 10L-10L to 100L-100L, with an interval of 10 layers. With the exception of ReZero3, all experiments were performed under mixed precision training. The figure above shows the results on the IWSLT-14 dataset.

We trained the model at 8000 steps because we found that most convergence occurs at the beginning of optimization. Overall, DEEPNET is stable from shallow to deep, converging quickly, reaching more than 30BLEU in just 8,000 steps, and most baselines do not reach this level. In addition, as the depth of the model increases, its performance continues to improve.

Higher learning rates, batch sizes, and hidden dimensions

We further extend DEEPNET to higher learning rates, batch sizes, and hidden dimensions, respectively.

Only one hyperparameter is changed in each experiment, and the other hyperparameters are fixed. The figure above shows the loss curve on the WMT-17 validation set.

The results show that at this setup, DEEPNET can be trained without difficulty. In the case of 1024 hidden sizes, the loss of DEEPNET increases after 10K steps due to overfitting. In addition, DEEPNET can benefit from a larger setup, resulting in faster convergence and lower verification losses.

Large-scale multilingual neural machine translation

We experimented with large-scale multilingual machine translation, which is a great testbed for big models.

The model was first evaluated using the OPUS-100 corpus. OPUS100 is an English-centric multilingual corpus covering 100 languages, which is randomly drawn from the OPUS set. We expanded the size of DEEPNET to 1000 floors. The model has a 500-layer encoder, a 500-layer decoder, 512 hidden sizes, 8 attention heads, and a 2048-dimensional feed-forward layer.

As shown in the figure above, experimental results show that increasing the depth of the network can significantly improve the translation quality of neural machine translation: the baseline of the 48 layers achieves an average 3.2-point BLEU score increase compared to the 12-layer model.

DEEPNET can successfully expand the depth to 1000 layers, increasing the baseline by 4.4 BLEU. And DEEPNET only trains 4 epochs, and if there is more computing budget, its performance can be further improved.

More data, more language

To explore deepnet's limitations in multilingual neural machine translation, we also used CCMatrix to scale up the training data. We expanded the data for CCAligned, OPUS, and Tatoeba4 to cover all languages of the Floras101 evaluation set. The final data includes 102 languages, 1932 directions, and 12 billion sentence pairs.

Using this data, we trained DEEPNET with a 100-layer encoder, a 100-layer decoder, 1024 hidden dimensions, 16 heads, and a feed-forward layer of 4096 intermediate dimensions, and the results are as follows:

In summary, DEEPNET improved the stability of Transformer and successfully extended it to 1000 layers. It theoretically proves stable optimization with a constant upper bound of model updates. Experimental results verify the effectiveness of our method in various benchmarks.

Current experiments focus on machine translation as a testbed. In the future, we will extend DEEPNET to support more different tasks such as language model pre-training, protein structure prediction, and BEiT vision pre-training.

Reference Links:

https://arxiv.org/abs/2203.00555

Mille-feuille Transformer is here! Multilingual machine translation standards refresh multiple SOTA levels

Read on

What kind of subgroup analysis may be the basis for drug listing?

Some of Blizzard's answers about the difficulty of the P3 phase

A single ViT model performs multimodal multitasking, and Google implements multiple SOTa with a collaborative training strategy

RacingPost Asia Horse Racing News: Australia's 6-year-old triple champion mare is auctioned for $550,000

Evolving Perception Architecture: Transformer's Application in Automated Driving

Stronger than MAE, FAIR's new method, MaskFeat, refreshes multiple SOTa with HOG

Racing Post Racing Post Asia Racing News: South Koreans buy First Class Triple Crown Stallions

Racing Post Racing Post Asia Racing News: Jockeys Modani and Kerry travel to Hong Kong for a ride

Recap 2021: The rising star companies

The parameter volume is reduced by 85%, and the performance exceeds viT across the board: the new image classification method ViR

Racing Post Racing Post Asia Racing News: Jockey Xue En Hong Kong Sha Tin Horse Racing Wins Three Consecutive

Racing Post Racing Post Asia Racing News: "Lotto Hearts" swept the First Class Racer Horse Commemoration

Racing Post Racing Post Asia Racing News: "Eight Thousand Divisions" won the Hong Kong Third Division

Racing Post Racing Post Asia Racing News: Singaporean trainer Lee Kardu won the first and second places

Use Transformer to do line work, really fragrant!

British Horse Racing Post Racing Post focuses on the 2022 Yulong International Horse Racing Open race schedule

What is OTA Why is the follow-up voyage of vehicle OTA drastically reduced?

What is an OTA? Why has the follow-up to the vehicle OTA been drastically reduced?

What are the family benefits for Singapore citizens?

AI Cracks Ancient Text on Nature Cover: Fix Missing Text, Precise Geographic Location and Writing Time

Will Transformer dominate the AI space? It is too early to draw conclusions

With only one picture + camera position, AI can brain supplement the surrounding environment| CVPR2022

How to build a knowledge graph using Neo4J and Transformer

LSTM is not "dead" yet!

He Kaiming's team's new work: only use ordinary ViT, do not do layered design can also do the target detection

No training required, the auto-expanding visual transformer is coming

Transformer papers cite more than 40,000, and the two authors left Google to start a business