laitimes

微软发布Phi-3大模型,3.8B击败chatgpt

author:Not bald programmer
微软发布Phi-3大模型,3.8B击败chatgpt

Microsoft released Phi-3 on April 23, and Phi-3 did the same thing as Mixtral-8x7B with a 3.8B minor version, which translates to a level of about 14B. After quantization, the size is about 1.8G, and 20 tokens can be issued in one second on the iPhone15. The minor version was trained with 3.3 T tokens, and the larger model was trained with 4.5 T tokens.

It has also been widely discussed on social media.

微软发布Phi-3大模型,3.8B击败chatgpt

There's an interesting post on reddit where Phi-3 beats GPT 3.5 Turbo in a banana logic problem with only 4B size.

微软发布Phi-3大模型,3.8B击败chatgpt

Flip it over, similar to the mentally handicapped question: What happens to a banana if you put a plate on it and move the plate to another room?

GPT3.5: Banana intact, but shifts position.

Phi-3: No, unless the banana sticks to the plate, it's still in its place.

It can be seen that GPT3.5 is wound in, and Phi-3 wins in this link.

As you can see, the Phi-3 does achieve good results with smaller parameters.

Phi-3 is the 4th release of Microsoft's Phi family:

Let's first review the development route of the 4th generation of Phi.

Phi-1 in June '23 [1], the Phi1 model parameter size is 1.3B, and it only takes 8 A100s to train for 4 days. This model can only write code, and the training data consists of 6B tokens from the network, which has been rigorously cleaned, and there are 1B pre-training +180M instruction fine-tuning data, all of which are composed of synthetic data generated by GPT-3.5.

The idea of washing data is similar to the method we mentioned before, which uses the quality scoring model of supervised learning to screen the method, but this model uses GPT4 data annotation. Specifically: 6B token training data screened from The Stack and StackOverflow, 1B token synthetic data generated by GPT-3.5 (for pre-training), and 180M token synthetic data generated by GPT-3.5 (for SFT).

After the Phi-1 model is SFT on 180M data, the code index is greatly improved. The Pass@1 of the Phi-1-small model reached 45% (20% before SFT), and the Pass@1 of the Phi-1 model reached 51% (29% before SFT).

Phi-1.5 in September 23 [2], the training data for Phi-1.5 consists of two parts: the 7B training data from Phi-1, and the newly collected 20B synthetic data. The subject matter of the newly collected 20B synthetic data expanded from code-only data of Phi-1 to general world knowledge and common sense reasoning. The authors constructed 20,000 topics as seeds, using GPT to generate data.

Phi1.5 also did such an experiment, that is, only the network data was trained, and the network data was filtered from Falcon's dataset, referred to as filtered data. There are also original 7B training, 20B generation experiments.

Conclusion: Filtered data + synthetic + raw training data >synthetic + raw training data> filtered data

Synthetic data and code data have been shown to improve performance.

微软发布Phi-3大模型,3.8B击败chatgpt

Phi-1.5 data experiments

Phi-2 [3] in December 23, Phi-2 was even less useful and only gave a technical blog. The report notes that Phi-2 continues to expand the amount of data in the web filtering class, but the final training dataset size is not stated. Phi-2 increased the model size from 1.3B to 2.7B, and similar to Phi-1.5-web, a total of 1.4T tokens were trained on the expanded mixed dataset.

In the April '24 issue of Phi-3 [4], in the Phi-3 generation, Microsoft continued to explore the same synthetic data experiments as llama3 (which was already used in Phi-1), with the difference that llama3 used 15T tokens, and Phi-3 tested up to 4.5T tokens.

It can be seen that Microsoft is particularly rolling, and the iteration speed on this matter has maintained an average of 3 months for a version.

To summarize the key messages of Phi-3:

1. Phi-3-mini is a 380 million-parameter language model, and despite its smaller size, its performance is comparable to that of some large models such as Mixtral 8x7B and GPT-3.5.

2. The quantization of the Phi-3-mini is deployed on the mobile phone, and the 1.8G can produce 20 tokens per second on the iPhone16 after quantization.

3. The training dataset of Phi-3-mini is an extended version of the dataset used in Phi-2, which contains a large amount of filtered network data and synthetic data. The basic version is 3.3T token training, and the larger model uses 4.5T, which is better than llama3.

4. Long Context Support: With LongRope technology, the Phi-3-mini also introduces a long context version, extending the context length from the default 4K to 128K.

5. The pipeline data training method is used, the first stage uses high-quality network data, and the second stage uses a subset of the first stage after stronger filtering plus GPT synthetic data. In the first stage, students will learn language ability and general knowledge, and in the second stage, they will learn logical reasoning ability.

微软发布Phi-3大模型,3.8B击败chatgpt

Renderings of Phi-3

A side-by-side comparison of the key information of Phi from 1 to 3 generations is shown in the following table:

模型参数量训练成本 (A100*小时)模型训练token数MMLU分数Phi-11.3B76850B-Phi-1.5-web1.3B3000300B37.9Phi-22.7B32,2561.4T56.3Phi-314B-4.5T68.8

As can be seen from the table, in addition to the data quality that Microsoft has been emphasizing, the growth of data volume and the expansion of model size are also quite critical.

Data has always been the core secret of the current large model, and the data of various large models that claim to be open source is almost not open source. Except for a few specific models that aim for "all open source", but they don't get particularly much attention because of their effectiveness.

The source, ratio, diversity, and quality of data have become the deepest "moat" of each large model

In addition, the author highlights some limitations:

Although the phi-3-mini model achieves a similar level of language comprehension and reasoning ability as the larger model, the model simply does not have the ability to store much "factual knowledge", which can be seen from the low score on TriviaQA, which may be due to the size of the parameters, and the fit of the knowledge reserve is not particularly adequate. However, the authors say that this can be compensated for with RAG, so the Phi-3 may be the most suitable high-efficiency small model for RAG.

Resources

[1] Textbooks Are All You Need : http://arxiv.org/abs/2306.11644

[2] Textbooks Are All You Need II: Phi-1.5 technical report: http://arxiv.org/abs/2309.05463

[3] Phi-2: https://huggingface.co/microsoft/Phi-2

[4] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone: http://arxiv.org/abs/2404.14219

Read on