10,000 words long! DeepMind scientists summarize 15 high-energy studies for 2021

Reporting by XinZhiyuan

EDIT: LRS

2021 ML and NLP are still developing rapidly, DeepMind scientists recently summarized the past year's fifteen highlight research directions, come and see which direction is suitable for your new pit!

In 2021, with more powerful computing power, data, and models, the technology for machine learning and natural language processing is still evolving rapidly.

Recently, DeepMind scientist Sebastian Ruder summarized 15 high-energy, enlightening areas of research over the past year, including:

Universal Models generic model

Massive Multi-task Learning Massive multitasking learning

Beyond the Transformer's approach beyond Transformer

Prompting tips

Efficient Methods Efficient methods

Benchmarking benchmarks

Conditional Image Generation conditional image generation

ML for Science for scientific machine learning

Program Synthesis

Bias bias

Retrieval Augmentation retrieval enhancements

Token-free Models has no Token model

Temporal Adaptation timing adaptability

The Importance of Data Data

Meta-learning meta-learning

Sebastian Ruder is a research scientist at DeepMind in London. He received his Ph.D. in natural language processing and deep learning from the Insight Data Analytics Research Center while working as a research scientist at AYLIEN, a text analytics startup in Berlin.

1 Generic model

General ai has always been the goal of AI practitioners, and the more versatile the capabilities, the more powerful the model is.

In 2021, pre-trained models are getting bigger and bigger, more and more versatile, and then fine-tuned to adapt to a variety of different application scenarios. This pre-training-fine-tuning has become the new paradigm in machine learning research.

In the field of computer vision, although the scale of supervised pre-trained models such as Vision Transformer has gradually expanded, as long as the amount of data is large enough, the effect of pre-trained models in the case of self-supervision can already be comparable to that of supervised.

In the field of speech, some models based on wav2vec 2.0, such as the W2v-BERT, and the more powerful multilingual model XLS-R have also shown amazing results.

At the same time, the researchers also discovered new large unified pre-training models that could be improved for previously under-studied modality pairs, such as video and language, speech and language.

In terms of vision and language, controlled studies also reveal important components of multimodal models by setting different tasks in the language model. Such models have also demonstrated their effectiveness in other areas, such as reinforcement learning and protein structure prediction.

Given the scaling behaviour observed in a large number of models, it has become common practice to report performance at different parameter sizes. However, improvements in the performance of pre-trained model models do not necessarily translate entirely into performance gains for downstream tasks.

In summary, pre-trained models have been shown to be well generalized to new tasks in specific domains or patterns. They exhibit strong feed-shot learning and robust learning capabilities. Therefore, the progress of this research is very valuable and can enable new real-world applications.

For the next step, the researchers believe that they will see more, and even larger, development of pre-trained models in the future. At the same time, we should expect individual models to perform more tasks at the same time. This is already the case in terms of language, and models can perform many tasks by framing them in a common text-to-text format. Similarly, we will likely see that image and speech models can perform many common tasks in one model.

2 Multitasking learning at scale

Most pre-trained models are self-supervised. They generally learn from large amounts of unlabeled data through a goal that does not require explicit supervision. However, there is already a lot of labeled data in many fields that can be used to learn better representations.

So far, multitasking models such as T0, FLAN, and ExT5 have been pre-trained on about 100 missions that primarily target languages. This large-scale multi-task learning is closely related to meta-learning. If exposed to different task assignments, the model can learn different types of behaviors, such as how to learn context.

ExT5 enables multitasking learning at scale. During pre-training, ExT5 trains inputs from a set of different tasks in text-to-text format to produce the corresponding output. These tasks include mask language modeling, summary, semantic analysis, closed-book Q&A, style transformation, dialog modeling, natural language reasoning, winograd-schema style core reference parsing, and more.

Some of the models recently studied, such as T5 and GPT-3, use text-to-text formatting, which has also become the basis for training for large-scale multitasking learning. As a result, models no longer need to manually design task-specific loss functions or task-specific layers to effectively perform cross-task learning. This latest approach highlights the benefits of combining self-supervised pre-training with supervised multitasking learning and demonstrates that a combination of the two results in a more versatile model.

3 More than just Transformer

Most of the pre-trained models mentioned earlier are based on the model architecture of transformers. In 2021, researchers have also been looking for an alternative model for Transformer.

The model architecture of perceiver is similar to the schema of transformer, using a potential array of fixed dimensions as the base representation, and adjusting the input through cross-attention, thereby extending the input to high dimensions. Perceiver IO further extends the model's architecture to handle structured output spaces.

There are also models that attempt to improve the self-attention layer in Transformer, and a more successful example is the use of multilayer perceptrons (MLPs), such as MLP-Mixer and gMLP models. In addition, FNet uses one-dimensional Fourier transforms instead of self-attention to mix token-level information.

In general, decoupling a model architecture from a pre-trained strategy is valuable. If CNNs are pre-trained in the same way as transformer models, they can get more competitive performance on many NLP tasks.

Similarly, pre-training using other pre-trained objective functions, such as ELECTRA-style, may also yield performance gains.

4 Tips

Inspired by GPT-3, prompting is a viable new paradigm for NLP models.

Prompts typically include a pattern that requires the model to make some kind of prediction, and a statementizer that converts predictions into class labels. Current methods are PET, iPET and AdaPET, using hints for Feed-shot learning.

However, hints are not a panacea, and the performance of the model may vary greatly depending on the hints. And, in order to find the best hints, you still need to label the data.

In order to reliably compare the performance of models in the feed-shot setting, researchers have developed new evaluation procedures. By using a large number of hints in the public pool of prompts (P3), one can explore the best ways to use prompts and provide an excellent overview of the general field of study.

For now, researchers have only scratched the surface of using hints to improve model learning. The hints that follow will become more elaborate, for example, including longer instructions, positive and negative examples, and general heuristics. Hints may also be a more natural way to incorporate natural language interpretation into model training.

5 Efficient methods

Pre-trained models are usually very large and often inefficient in practice.

In 2021, some more efficient architectures and more effective fine-tuning methods emerged. In terms of models, there are also several new, more effective versions of self-attention.

Current pre-trained models are so powerful that they can be effectively tuned with only a few parameters updated, and more efficient fine-tuning methods based on continuous cues and adapters have rapidly evolved. This ability can also adapt to new patterns by learning the appropriate prefixes or the appropriate transitions.

In addition, there are other routes to improve efficiency, such as creating more efficient optimizers and quantification methods for sparsity.

When models don't run on standard hardware, or when they're too expensive, the availability of models is greatly reduced. In order to ensure that model deployments can use these methods and benefit from these methods while they continue to expand, the efficiency of models needs to continue to improve.

In the next step of research, people should be able to more easily obtain and use effective models and training methods. At the same time, the community will develop more efficient ways to interface with large models and effectively adapt, combine, or modify them without having to pre-train a new model from scratch.

6 Benchmarks

Recently, the capabilities of machine learning and natural language processing models have improved rapidly and have exceeded the measurement capabilities of many benchmarks. At the same time, communities are using fewer and fewer benchmarks for assessments, and these benchmarks come from a small number of elite institutions. Dataset usage for each institution shows that more than 50% of the datasets can be considered to have come from 12 institutions.

There has been an increase in the concentration of data sets used in institutions and specific databases, as measured by the Gini Index.

So, in 2021, there can be a lot of talk about best practices and how to reliably assess the future development of these models. Notable leaderboard paradigms emerging in natural language processing communities in 2021 are dynamic adversarial evaluation, community-driven evaluation, and community-driven evaluations in which community members collaborate to create evaluation datasets such as BIG-bench, interactive fine-grained evaluations across different error types, and multidimensional evaluations that go beyond a single performance metric evaluation model. In addition, the new benchmark proposes impactful settings such as few-shot evaluation and cross-domain generalization.

New benchmarks, which focus on the evaluation of generic pre-trained models for specific patterns, such as different languages (Indonesian and Romanian), as well as multiple modalities and multilingual environments, should also be given more attention to evaluation indicators.

Machine translation meta-evaluation shows that of the 769 machine translation papers over the past decade, 74.3 percent still use only BLEU despite the 108 fingers that have been proposed to choose from, which generally have better human relevance. Therefore, recent rankings such as GEM and bidimensional have recommended joint evaluations of models and methods.

Benchmarking and evaluation are key to advances in machine learning and natural language processing science. Without accurate and reliable benchmarks, it is impossible to know whether we are making real progress or over-adapting to entrenched data sets and metrics.

To raise awareness of the benchmarking problem, the next step should be to design new datasets more thoughtfully. The evaluation of new models should also focus less on a single performance indicator and consider multiple dimensions, such as fairness, efficiency, and robustness of the model.

7 Conditional image generation

Conditional image generation, i.e. generating images based on text descriptions, has made significant progress in 2021.

Instead of generating images directly from text input, as in the DALL-E model, the most recent approach is to use a combined model of image and text embedding like CLIP to bootstrap the output of a powerful generated model like VQ-GAN.

The likelihood-based diffusion model, which gradually eliminates noise from the signal, has become a powerful new generative model that can outperform GANs. By guiding the output based on text input, the image generated by the model also gradually approaches the realistic image quality. Such a model is also particularly suitable for image restoration, and it is also possible to modify the area of the image according to the description.

Recent diffusion-based models are much slower to sample than GAN-based models. These models need to be more efficient in order to make them useful for real-world applications. More research into human-computer interaction is also needed in this area to determine how these models can help human creation in the best way and with their applications.

8 Machine learning for science

In 2021, machine learning technology has made some breakthroughs in advancing the natural sciences.

In meteorology, precipitation approaching forecasts and advances in forecasting have led to a significant improvement in forecast accuracy. In both cases, the model is superior to the most advanced physics-based forecasting model.

In the field of biology, AlphaFold 2.0 predicts the structure of proteins with unprecedented precision, even in the absence of similar structures.

In mathematics, machine learning has been shown to be able to channel mathematicians' intuition to discover new connections and algorithms.

Transformer models have also been shown to be able to learn differential systems of mathematical properties, such as training enough data to be locally stable.

Using models in-the-loop to help researchers discover and develop new advances is a particularly compelling direction. It requires both the development of powerful models and the study of interactive machine learning and human-computer interaction.

9 Procedural synthesis

One of the most compelling applications of this year's large language model is code generation, where Codex, as part of GitHub Copilot, was first integrated into a major product.

However, for current models, generating complex and long-form programs remains a challenge. An interesting related direction is learning to execute or model a program, which can be improved by performing multi-step calculations where intermediate calculation steps are recorded in a scratchpad.

In practice, the extent to which the code generation model improves the workflow of software engineers is still a problem to be solved. To really work, these models – similar to dialog models – need to be able to update their predictions based on new information and need to take into account code contexts both locally and globally.

10 Prejudice

Given the potential impact of pre-trained large models, it is critical that these models should not contain harmful biases, should not be misused to produce harmful content, and should be used sustainably.

Some researchers investigated biases in protected attributes such as gender, specific ethnic groups, and political leanings, highlighting the potential risks of such models.

However, simply removing bias from toxicity models may lead to reduced coverage of texts related to marginalized groups.

So far, bias has mostly been explored in English and pre-trained models, as well as specific text generation or classification applications. Given the intended use and life cycle of these models, we should also work to identify and mitigate biases in different combinations of patterns in a multilingual environment, as well as biases at different stages of the use of pre-trained models— after pre-training, after fine-tuning, and at testing.

11 Search enhancements

Retrieval-augmented language models integrate retrieval into pre-trained and downstream tasks.

In 2021, the search corpus has expanded to a trillion tokens, and models have been able to query the network to answer questions. The researchers also discovered new ways to integrate retrieval into pre-trained language models.

Retrieval enhancements enable models to utilize parameters more efficiently because they require less knowledge to be stored in the parameters and can be retrieved. It also enables efficient domain adaptation by simply updating the data used for retrieval.

In the future, we may see different forms of retrieval to take advantage of different types of information, such as common sense knowledge, factual relationships, linguistic information, etc. Search extensions can also be combined with more structured forms of knowledge retrieval, such as a general approach to the knowledge base and open information extraction retrieval.

12 No Token model

Since the advent of pre-trained language models like BERT, text consisting of subwords after tokenize has become the standard input format for NLP.

However, sub-word markers have been shown to perform poorly in noisy inputs, such as typos or spelling variations, which are common in social media and some types of lexicals.

New token-free methods have emerged in 2021 that use character sequences directly. These models have been shown to perform better than multilingual models and perform particularly well on non-standard languages.

Therefore, token-free may be a more promising alternative model than subword-based Transformer.

Because the token-free model has more flexibility, lexicals are better modeled and new words and language changes are better summarized. However, compared to subword approaches based on different types of morphology or word-building processes, it is unclear how they behave and what trade-offs these models make.

13 Timing adaptability

Models are biased in many ways based on the data they are trained on.

In 2021, these biases are receiving increasing attention, one of which is that there is a bias in the data time frame on which the model is trained. Given the constant evolution of language and the constant entry of new vocabulary into discourse, models based on outdated data have proven to be relatively poorly summarized.

However, when temporal adaptation is useful may depend on downstream tasks. For example, if an event-driven change in language usage is not related to task performance, it may not be of much help to the task.

In the future, developing methods that can adapt to new time frames will require getting rid of static pre-training fine-tuning settings and need effective methods to update the knowledge of pre-trained models, both of which are useful as well as retrieval enhancements in this regard.

14 The importance of data

Data has long been a key component of machine learning, but its role is often overshadowed by advances in models.

However, given the importance of data to scale models, attention is slowly shifting from model-centric to data-centric. Key topics include how to effectively build and maintain new data sets and how to ensure data quality.

Andrew NG conducted a workshop at NeurIPS 2021 to study this issue – data-centric artificial intelligence.

There is currently a lack of best practices and principled approaches on how to effectively build data sets for different tasks and ensure data quality. Little is still known about how data interacts with model learning, and how data affects bias in models.

15 yuan to study

Meta-learning and transfer learning, although both share the common goal of Blow-shot learning, but the groups studied are different. On a new baseline, large-scale transfer learning approaches are superior to meta-learning-based approaches.

One promising direction is to expand the meta-learning approach, which can be combined with a training method that makes more efficient use of memory and can improve the performance of meta-learning models on real-world benchmarks. Meta-learning methods can also be combined with effective adaptation methods, such as the FiLM layer [110], to make the general model more effectively adapted to new data sets.

Resources:

https://ruder.io/ml-highlights-2021/

10,000 words long! DeepMind scientists summarize 15 high-energy studies for 2021

Read on