Editors: Du Wei, Chen Ping
The level of fine-tuning of large model instructions is constantly improving, and this time Microsoft uses GPT-4.
We know that from the Google T5 model to the OpenAI GPT series of large models, large language models (LLMs) have demonstrated impressive generalization capabilities, such as contextual learning and thought-chain reasoning. At the same time, in order to make LLMs follow natural language instructions and complete real-world tasks, researchers have been exploring instruction fine-tuning methods for LLMs. This can be achieved in two ways: fine-tuning the model on a wide range of tasks using human-annotated prompts and feedback, and supervising fine-tuning using common benchmarks and datasets enhanced by manual or auto-generated instructions.
Among these methods, Self-Instruct fine-tuning is a simple and effective method that learns from the instruction follow data generated by the instructor LLMs of SOTA instruction fine-tuning, aligning the LLMs with human intent. Facts have proved that instruction fine-tuning has become an effective means to improve the generalization ability of LLMs with zero and small samples.
More recently, the success of ChatGPT and GPT-4 provides a huge opportunity to use instruction fine-tuning to improve open source LLMs. Meta LLaMA is a family of open source LLMs whose performance rivals proprietary LLMs such as GPT-3. To teach LLaMA to follow instructions, Self-Instruct was rapidly adopted due to its superior performance and low cost. For example, Stanford's Alpaca model uses 52k instructions generated by GPT-3.5 to follow samples, and the Vicuna model uses about 700k instructions from ShareGPT to follow samples.
In order to advance the SOTA level of LLMs instruction fine-tuning, Microsoft Research used GPT-4 as a teacher model for self-intruct fine-tuning for the first time in its paper "Instruction Turing with GPT-4".

On the one hand, the researchers released the data generated by GPT-4, including the 52k instruction following dataset in Chinese and English, and the feedback data generated by GPT-4 to rate the output of the three instruction fine-tuning models.
On the other hand, based on the data generated by GPT-4, the LLaMA model and reward model for instruction fine-tuning were developed. To assess the quality of instruction-fine-tuned LLMs, the investigators evaluated the test samples using three metrics: manual evaluation of the three alignment criteria, automatic evaluation based on GPT-4 feedback, and ROUGE-L (one of the automatic abstract evaluation methods) for unnatural instructions.
Experimental results verify the effectiveness of LLMs instruction fine-tuning using GPT-4 generated data. The 52k instruction compliance data generated by GPT-4 achieves better zero-sample performance on new tasks than previous SOTA models. At present, the researchers have made public the data generated using GPT-4 and the related code.
data set
The study used GPT-4 to generate the following four datasets:
English Instruction-Following Data: For the 52K instructions collected from Alpaca, each instruction provides a GPT-4 answer in English. This dataset is mainly used to explore and compare GPT-4 answers with GPT-3 answers.
Chinese Chinese Instruction-Following Data: The study used ChatGPT to translate 52K instructions into Chinese and asked GPT-4 to answer with Chinese.
Comparison Data: Ask GPT-4 to rate its reactions, ranging from 1 to 10. In addition, the study asked GPT-4 to compare and score the responses of GPT-4, GPT-3.5, and OPT-IML models. This dataset is primarily used to train reward models.
Answers on Unnatural Instructions: GPT-4 answers are decoded on three core datasets of 68K instructions - input - output. This subset is used to quantify the gap between GPT-4 and instruction-fine-tuning models.
Figure 1 compares the English output response set for GPT-4 and GPT-3.5. Figure 1 (a) and (b) show two verb-noun pairs with output sets frequencies higher than 10, Figure 1 (c) compares the 25 most frequent pairs of words in the two sets, and Figure 1 (d) compares the frequency distribution of sequence lengths, showing that GPT-4 tends to generate longer sequences than GPT-3.5.
Instruction fine-tuning language models
The study was based on LLaMA 7B checkpoint and trained two models using supervised fine-tuning: (i) LLaMA-GPT4, trained on GPT-4 generated 52K English instruction following data. (ii) LLaMA-GPT4-CN, trained on 52K Chinese instruction compliance data generated from GPT-4.
Reward model
Reward modeling is one of the key parts of human feedback reinforcement learning (RLHF) designed to align LLM behavior with human preferences, a problem that is often formulated as a regression task to predict rewards between a given prompt and a response. But this approach often requires large-scale comparison of data, and existing open source models such as Alpaca, Vicuna, and Dolly do not involve RLHF due to the high cost of labeling comparative data. At the same time, recent studies have shown that GPT-4 is able to identify and fix its own errors and accurately judge the quality of responses. Therefore, to facilitate the study of RLHF, the study created comparative data using GPT-4, as described above.
To evaluate data quality, the study also trained an OPT 1.3B-based reward model to evaluate this dataset. The distribution of the comparison data is shown in Figure 2.
experiment
The study used three types of assessments: human assessment, GPT-4, and non-natural instruction assessment. The results confirmed that data generated using GPT-4 is an efficient and effective method for fine-tuning LLM instructions compared to other machine-generated data. Next, let's look at the specific experimental process.
Human assessment
Figure 3 (a) shows the LLaMA-GPT4 vs Alpaca comparison, and experiments show that GPT-4 wins with a score of 54.12% in the Helpfulness metric. Figure 3 (b) shows the LLaMA-GPT4 vs GPT-4 comparison, showing that the performance of the LLaMA fine-tuned by GPT-4 instructions is similar to the original GPT-4.
Compare to SOTA that uses automatic evaluation
The study used GPT-4 to automatically evaluate the responses of different models to 80 unseen questions. Start by collecting answers from two chatbots, LLaMA-GPT-4 (7B) and GPT-4, and publish them using other chatbots, including LLaMA (13B), Alpaca (13B), Vicuna (13B), Bard (Google, 2023), and ChatGPT. For each evaluation, the study asked GPT-4 to score the quality of responses between the two models, ranging from 1 to 10. The result is shown in Figure 4.
Figure 4 (c,d) compares all chatbots. LLaMA_GPT4 Higher performance: The 7B LLaMA GPT4 outperforms the 13B Alpaca and LLaMA. However, LLaMA_GPT4 still lags behind large commercial chatbots like GPT-4.
The researchers further investigated the performance of all chatbots in Figure 5 below. Start by using GPT-4 to translate the chatbot's English response into Chinese, followed by GPT-4 to translate English questions into Chinese to get answers. A comparison with GPT-4 translations and generated Chinese responses is shown in 5 (a) and 5 (b), and 5 (c) shows all model results that are asked to answer with Chinese.
In Figure 6 below, the researchers compared LLaMA-GPT4 with GPT-4, Alpaca unnatural instructions. The results showed that LLaMA-GPT4 and GPT-4 performed better with the increase in the length of the ground truth response. This means that when scenes are more creative, they can better follow instructions. When the sequence length is short, both LLaMA-GPT4 and GPT-4 can generate responses with simple ground truth answers, and adding extra words can make the response more chat-like.
Paper address: https://arxiv.org/pdf/2304.03277.pdf
Project Address:https://instruction-tuning-with-gpt-4.github.io/
GitHub address: https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM