laitimes

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

author:Book stacks on a rainy night

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

For a machine learning model to generate meaningful text, it must have a lot of knowledge about the world as well as the ability to abstract. While trained language models are increasingly able to automatically acquire this knowledge as they expand, it is unclear how best to unlock this knowledge and apply it to specific real-world tasks.

A proven technique called fine-tuning is to train pre-trained models (such as BERT and T5) on labeled datasets to adapt them to downstream tasks. However, fine-tuning requires a large number of training examples, as well as model weights stored for each downstream task, which is not always feasible, especially for large models.

In Fine-tuning the Language Model is a Zero-Lens Learner, we explore a simple technique called instruction fine-tuning, or instruction tuning for short. This involves fine-tuning the model, not to solve a specific task, but to make it more suitable for solving general NLP tasks. We use instruction tuning to train a model, which we call fine-tuning LAnguage Net (FLAN). Since the instruction tuning phase of FLAN requires only a small amount of updates compared to the large amount of computation involved in the pre-trained model, it is a metaphorical dessert for the pre-trained main course. This enables FLAN to perform a variety of invisible tasks.

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

background

A popular technique that has recently used language models to solve tasks is called zero-trigger or less-trigger prompts. This technique formulates tasks based on the text that the language model might see during training, and then the language model generates an answer by completing the text. For example, to categorize the mood of a movie review, you can give a language model a sentence, "Movie review 'Best RomCom since Beautiful Woman' is _" and ask for the sentence to be completed with the word "positive" or "negative.".

While this technique performs well on some tasks, it requires careful and timely engineering to design the task to be the data the model sees during training – this approach performs well on some, but not all, tasks, and can also be an unintuitive way for practitioners to interact with the model. For example, the creators of GPT-3, one of the largest language models in use today, found that this hinting technique does not produce good performance on natural language reasoning (NLI) tasks.

Instruction tuning

FlaN instead fine-tuns the model based on a large number of different directives that use simple and intuitive task descriptions, such as "Classify this movie review as positive or negative" or "Translate this sentence into Danish."

Creating a set of instructions from scratch to fine-tune the model will cost a lot of resources. Therefore, we used templates instead to convert existing datasets into instructional formats.

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

We showed that by training the model based on these instructions, it is not only good at solving the various instructions seen during training, but also at following instructions in general.

Evaluate the model

To compare FLAN with other technologies in a meaningful way, we use established benchmark datasets to compare the performance of the model with existing models. In addition, we evaluated the performance of the FLAN, but did not see any examples in the dataset during training.

However, if we train on a dataset that is too similar to the evaluation dataset, it may still affect the performance results. For example, training on one Q&A dataset might help the model do better on another. Therefore, we group all datasets into clusters by task type, keeping not only the training data for the dataset, but also the entire task cluster to which the dataset belongs.

We group the datasets into the following clusters.

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

outcome

We evaluated FLAN on 25 missions and found that FLAN improved over zero-shot hints on all but 4 missions. We found that out of 20 of the 25 tasks, our results were better than zero GPT-3s, and even better than a small number of GPT-3s in some tasks.

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

We also found that model size is important for the model's ability to benefit from instruction tuning. At smaller scales, flan techniques actually degrade performance, and only at larger scales can the model generalize from instructions in the training data to invisible tasks. This may be because a model that is too small does not have enough parameters to perform a large number of tasks.

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

conclusion

The FLAN model is not the first to be trained on a set of instructions, but to our knowledge, we are the first to apply the technique on a large scale and show that it can improve the generalization capabilities of the model. We hope that the approach we propose will help inspire more research into models that can perform invisible tasks and learn from very little data.

We also publish code that performs the transformation so that other researchers can reproduce our results and build on top of them.

Introduction to FLAN: A More General Language Model with Instruction Fine-Tuning

Read on