laitimes

Impossible Triangle: What's Next for Pre-Trained Language Models?

The impossible triangular dilemma of PLM.

Compile the | Wang Yue

Edit | Chen Caixian

In recent years, large-scale pre-trained language models (PLMs) have significantly improved the performance of various NLP tasks. Starting with BERT and GPT-2, the self-supervised pre-training paradigm and supervised fine-tuning paradigm have been a huge success and have refreshed many of the most advanced achievements in the field of natural language processing, such as semantic similarity, machine reading comprehension, common-sense reasoning, and text summarization. In addition, these PLMs are medium in size (i.e., less than 1B in size), allowing the model to be fine-tuned and adapted extensively and quickly.

However, in many real- and especially novel NLP scenarios, the labeling data used for effective fine-tuning is very limited due to budget or time constraints. This spurred the development of zero-sample and low-sample NLP models.

Starting with GPT-3, Hyperscale PLM (SL-PLM) has shown superior performance on general NLP tasks without giving only task descriptions and some manual examples. This capability was not previously observed in medium-sized PLMs. However, the unprecedented hyperscale of these SL-PLMs has largely hindered their widespread application. It's even hard to get enough computing resources to load such a model, let alone deploy and fine-tune it efficiently. Therefore, we believe that there is currently no lightweight PLM that has excellent performance in both supervised learning and zero/few sample learning scenarios for general NLP tasks. This results in a lot of extra work to be done when using these PLMs in real-world scenarios.

For PLM, it seems to have created a dilemma that the ability to learn from medium-sized, zero/small samples and fine-tune abilities cannot occur at the same time. Recently, Chenguang Zhu and Michael Zeng, researchers in Microsoft's Cognitive Services Research Group, called this dilemma the "impossible triangle" in their new paper "Impossible Triangle: What's Next for Pre-trained Language Models?".

It is reported that Zhu Chenguang graduated from Tsinghua Yao class, and later obtained a doctorate degree in the Department of Computer Science of Stanford University, and entered Microsoft after graduation, and is now a senior researcher of natural language processing at Microsoft. Previously, AI Technology Review did a character interview with Dr. Zhu Chenguang, and more content can be seen: "Zhu Chenguang: An AI Researcher Who Never Stays Overnight".

1

Impossible triangle

Impossible Triangle: What's Next for Pre-Trained Language Models?

The impossible triangle of PLM contains three properties needed to deploy the model in a real-world scenario, namely:

P1: The model size is moderate, that is, the parameters are less than 1 billion

P2: SoTA has less sample learning ability

P3::SoTA fine-tuning capabilities

Triangle Source: https://commons.wikimedia.org/wiki/File:Penrose_triangle.svg

The figure depicts the impossible triangle of the current PLM barrier, which depicts three key properties of PLM: P1, which is moderate model size, P2, the SoTA small sample learning ability, and P3, the SoTA supervised learning ability. These three attributes correspond to three requirements in PLM practical applications: P1 is to use a reasonable amount of computing resources for efficient deployment; P2 corresponds to the situation where the label data is zero or very little; and P3 corresponds to the scenario where the label data is relatively rich.

One reason why triangles are impossible to exist is that at the current stage, there will only be a strong low-sample learning ability if the PLM reaches a very large scale and has sufficient model capacity. Although iPET was designed with a medium-sized PLM to achieve better sample-less learning performance than GPT-3, it has been surpassed by later SL-PLM such as PaLM. As the model size increases, we can observe discontinuous improvements in zero-sample/small-sample learning performance. For example, PaLM with a parameter of 540B has made a huge leap forward in accuracy on many tasks compared to models with parameters of 8B and 62B. Therefore, developing a medium-sized model with SoTA zero/few sample learning performance while maintaining superb supervised learning capabilities remains a huge challenge.

While no PLM can implement all three of the three features of the impossible triangle, many PLM already have one or two of these properties:

Medium-sized PLM (with P1 + P3 attributes), these language models are medium in size with less than 1 billion parameters, enabling efficient model tuning and deployment. They achieve SoTA performance in general NLP tasks, including GLUE benchmarks, text summaries, open-domain question answers, and common-sense reasoning. However, the zero/small sample learning ability of these models is usually relatively weak, which means that using these models depends on enough labeled data in the target domain.

With P2 properties, these language models are extremely large (parameters range from 10 to 100 billion) and have been pre-trained on very large-scale data. PaLM, with 540 billion parameters and pre-trained on a text corpus of 780 billion words, falls into this category. When only the task description and a small amount of input and output are prompted for the sample, they have already achieved SoTA performance in a typical zero/small sample NLP task. Overall, however, 1) SL-PLM's zero/few sample learning performance is lower than that of supervised trained models, and 2) after fine-tuning, many SL-PLMs still perform less than the best fine-tuned medium-sized PLMs, possibly because their models are too large.

2

Improvement measures

Due to the existence of the impossible triangle, academia and industry have taken many steps to address the lack of capabilities in PLM used in practice. To summarize:

Extremely large model (lack of P1): This occurs when a very large PLM needs to be deployed. To obtain a medium-sized model with similar performance to SL-PLM, a common practice is knowledge distillation (KD). In KD, the larger model is the teacher and the smaller model is the student, learning from the teacher's predictive distribution and/or parameters. Knowledge extraction is very effective when creating more efficient models, sacrificing only a little performance. However, two problems remain here. First of all, it is difficult for students to achieve the same performance as teachers. Second, the sheer scale of SL-PLMs hinders effective reasoning, making them inconvenient as teacher models.

Zero/few sample learning performance is poor (lack of P2). This is most common for medium-sized PLMs, which achieve SoTA performance after fine-tuning but have a relatively low zero/few sample learning capability. In many scenarios, when there is not enough labeled data, you want to deploy such a model. Therefore, one way to solve this problem is data augmentation, generating pseudo-labels and pseudo-data instances so that models can take advantage of this additional data for effective supervised training. However, the uneven quality of pseudo-data and the diversity of data types in different tasks challenge universally applicable solutions.

Poor performance in supervised training (lack of P3). This is common when using SL-PLM, where limited computing resources make it difficult to fine-tune all the parameters of a very large model. A typical solution is prompt learning. We can take advantage of hard prompts, such as discrete text templates, or soft prompts, such as continuous parameter embedding, so that only hard prompt words or soft prompt parameters are updated during fine-tuning. These methods have been shown to be very effective in improving the accuracy of SL-PLM. However, the effect of these methods is very sensitive to the selection of prompt and training data, and the final effect is generally still lower than that of medium-sized PLM after supervised learning.

The extra work mentioned above slows down the process of training and deploying PLM models. And for different downstream tasks or products, these efforts need to be carried out continuously. Therefore, if a PLM is able to implement this impossible triangle, it will greatly speed up the process of model training and practicality.

3

Looking to the future

Although there are currently impossible triangles in NLP models, the researchers believe that this problem can be solved through a three-stage approach.

Phase 1: Develop the PLM to reach some of the properties in the triangle and improve the other missing properties at the same time. For example, improve the effectiveness of a medium-scale model with SoTA supervised learning capabilities on small sample learning, or compress SL-PLM with SoTA small sample learning capabilities into a smaller model and make it have better supervised learning performance.

Phase 2: Implement a PLM with all three desired properties on one or several NLP tasks. In order to achieve this, the particularity of the target task can be exploited. For example, on certain tasks, model performance is less dependent on the size of the training data, the gap between zero/few sample learning and supervised learning performance is smaller, and so on.

Phase 3: Develop a PLM that implements all three properties on a generic NLP task on the basis of phases one and two. Possible methods are: i) pre-training a medium-sized model with larger data; ii) Better distillation of knowledge; iii) Generalization data enhancement methods, etc. Once a PLM has all three features of an impossible triangle in a generic NLP task, it will greatly change the entire NLP research and application landscape, facilitating rapid, efficient, and high-quality model development and deployment.

Impossible Triangle: What's Next for Pre-Trained Language Models?

Leifeng Network

Read on