Author | Nuclear Cola, Chu Xingjuan

Rankings such as Chatbot Arena have repeatedly proven that billion-dollar ChatGPT is still the undisputed king of chatbots. People can only call its API, they can't deploy it privately, and they can't train and adjust it themselves. As a result, we're now keen to build AI chatbots with open source big models, hoping to reach or even surpass proprietary models like ChatGPT in terms of performance.

Recently, Microsoft unexpectedly released an open-source small model Orca with only 1.3 billion parameters but large model inference capabilities, which uses innovative training methods to become the first challenger who dares to compete with proprietary models. Moreover, Orca is only a fraction of the size of its competitors (perhaps even a few hundredth of the GPT-4). Incredibly, Orca is even better in some scenarios and completely crushes Vicuna, the so-called strongest open source model to date.

Address: https://arxiv.org/pdf/2306.02707.pdf

Microsoft has also started an open source small model! ChatGPT and GPT-4 training, the strength of the open source model

So, how exactly does Orca do it?

A new way to train: ingenuity over brute force

When it comes to AI model training, capital investment has basically become the primary premise. Specifically, when it comes to the billions of parameters in the model, the implications behind this include:

Collecting training data alone costs millions of dollars;
The training of the basic model will cost millions of dollars;
Fine-tuning of the model can also cost hundreds of thousands of dollars;
Not to mention Human Feedback Reinforcement Learning (RLHF). If the company's quarterly revenue does not reach the order of billions of dollars, it is best not to touch this link.

So when it comes to the competition for "big language models", in fact, there are only four or five companies in the world who are qualified to participate.

Therefore, in order to compete with large-scale proprietary models such as ChatGPT at the performance level, researchers have no choice but to hack each other's financial resources with ingenuity. In the field of generative AI, the so-called "ingenuity" is "distillation".

To put it simply, distillation is to select excellent comrades, and then use their responsiveness as learning material for small models. Why do you want to do this? It's very simple: ChatGPT has billions of parameters, but only a "few" of them really matter. From the principle level:

We must first have enough parameters for the model to ensure that it grasps the complex representations in the real world.
As a result, most of the parameters in most models remain unused.

After realizing this practical problem, the researchers came to the following conclusion: assuming that advanced models such as GPT-4 must still take volume growth as a necessary condition in the future, can a much smaller model be used to simply reproduce some or all of the characteristics of the large model after having the trained large model?

In other words, when guiding AI models to learn real-world situations, can you first use large language models to complete the most onerous "pattern extraction" task, and then let them act as "teachers" to guide those smaller models?

The answer is yes. The process of distillation is one such AI learning method, using large-volume models as templates to train small-volume models. So the best AI chatbot development process for the open source community can be basically summarized as:

The large language model (teacher) is sampled to build a query dataset of {user instructions, output}. The common option here is, of course, ChatGPT.
Next, choose a smaller model (with a number of parameters between 500 million and 15 billion) as the Student.
The student's task is to minimize the discrepancy between his own output and that of the teacher, to learn it, and to imitate it.
In this way, the small-volume model can grasp the teacher's style and output similar results, and control the training and running costs at a lower level.

The result is a new state-of-the-art model at a cost of only one percent of the larger model. It sounds good, but the real world is clearly not that rosy.

While these models are effective in learning the teacher's style and language continuity (e.g., Vicuna or Alpaca), they often fail to grasp the powerful reasoning skills of the other party. That is, when evaluating complex tasks, they will perform far worse than their own teachers. True, it is "far" inferior.

Orca crushes the open source model and catches up with ChatGPT

Now, the performance of most open source models is deliberately exaggerated. The excellent performance of open source models such as Vicuna and Alpaca may be the result of careful selection by researchers. Until now, their performance on the inference benchmark has been difficult to say.

For example, while Vicuna has been able to achieve about 89% of GPT-4 in terms of style and language continuity in benchmarks for measuring complex tasks, the gap widens to an embarrassing 5400% when faced with challenges such as seven-step logic deduction. In other words, GPT-4 is 55 times more performant than Vicuna.

Orca's researchers were aware of the problem and worked to make improvements. In performance tests using zero-sample hints on Big-Bench Hard, the 2900% in parentheses represents how much Orca improves over Viguna.

Orca's overall performance on all tasks was slightly better than ChatGPT, but significantly behind GPT-4, 113% higher than Vikita. Similar to the results of AGIEval, Vicuna performed poorly on the complex inference task of this benchmark. Orca, while significantly better than Vicna and slightly better than ChatGPT, has an average performance of 49.7%, lagging behind GPT-4 by 26%.

In the test, Orca outperformed ChatGPT by 102%, 3.6%, and 1.7% in terms of time series (temporal inference), navigation (following navigation instructions), and colored objects (recognizing object colors in a given context). Orca excels on causal tasks, with performance on par with GPT-4 while outperforming ChatGPT by 4.7%. Orca and ChatGPT levels are similar in detecting translation errors. Orca doesn't perform as well as ChatGPT for tasks that require a variety of knowledge (e.g. sports, artists, humor, etc.), but performs better when it comes to movie recommendations.

In the Web of Lies test, Orca even cut GPT-4 down, and the performance was 3% higher than this star model, which is 100 times larger than itself. Vicuna is no exception, but Orca scores 24.3% higher.

来源：Microsoft (Web of lies example)

Impressively, in all of the above tasks, Orca's average performance has surpassed GPT-3.5. This is not only a new milestone for the open source model, but also steadily maintains more than twice the performance of Vicuna.

While for the most part, Orca still lags behind the undisputed king GPT-4, how exactly did this performance of beating out other open source peers and occasionally outtaking Big Brother come about?

What Orca researchers did

At present, the way in which small models imitate large models through instruction fine-tuning mainly has the following problems:

The instructions are simple and lack diversity.
The scale of the data collected is small and the tasks lack diversity.
The imitation signal is limited, and imitation learning can only be carried out through the <query, response, > generated by the teacher's model.
Evaluation criteria are weak. The results of instruction tuning of small models with large models generally rely on GPT-4 for automatic evaluation, for example, the model after instruction tuning using the results of GPT-4 responses tends to generate longer text, and GPT-4 has deviations in the order of candidate responses.

Orca's researchers focused on two key innovations:

Explanatory training

Before Orca, models such as Vicuna and Alpaca could only sample simple {user instructions, answer} queries from models such as GPT-4 for distillation, thereby training new models to mimic their own teachers:

But on Orca's side, there has been a huge shift in R&D thinking.

Instead of simply extracting queries as before, the researchers introduced a third constraint, system instructions. That said, in addition to user instructions and model answers, Microsoft researchers added a series of additional instructions designed to model the behavior and thought process of student models, as shown in the following figure:

This is not difficult to understand: students need to imitate not only the quality of the output of GPT-4, but also the teacher's thought process in order to master similar reasoning abilities.

Progressive learning through intermediate teaching

As of now, most open source models only use a pair of {student, teacher} materials. But at Orca there are two teachers. The first is naturally ChatGPT. As the first teacher, it is responsible for guiding the student model to solve those less complex queries. GPT-4 then provides more complex enquiry guidance for students to learn further based on their previous knowledge.

The process is very similar to how humans learn. Before we learn multiplication and division, we must first master the know-how of addition and subtraction, and gradually break through one difficulty after another. And compared to training methods using GPT-4 alone, progressive learning is indeed more effective.

Conclusion

Whether the current larger and more energy-intensive development model will soon come to an end remains to be verified, but now there are new results that break through the existing rules of the game and technological boundaries almost every week, and everyone has made a lot of efforts in terms of efficiency.

Judging from Orca's ability to crush many open source models with a little trick, we can only say that we know very little about AI technology. As the king who has an absolute advantage in the market with ChatGPT, Microsoft took the lead and upgraded the open source model to a new dimension. The open source model may usher in a new era of its own.

Reference Links:

https://medium.com/@ignacio.de.gregorio.noblejas/orca-microsoft-7c78ca03c803

This article is reproduced from:

https://www.infoq.cn/article/pJwogXHgoNwH1CMkF0m3

Microsoft has also started an open source small model! ChatGPT and GPT-4 training, the strength of the open source model

A new way to train: ingenuity over brute force

Orca crushes the open source model and catches up with ChatGPT

What Orca researchers did

Conclusion

Read on

The AI search that ChatGPT did not do is not the next battleground

最强OpenAI发布新ChatGPT-4o,AI领域的突破情感识别+视觉理解

OpenAI overturned the voice assistant overnight! ChatGPT learns to look at screens, and the real-life version of Her is here

Sudden Kill! The Chinese version of Ali ChatGPT is here! I couldn't resist signing up for the experience

Hu Xijin is going to lose his job? Netizens used ChatGPT to imitate "Hu Biao" writing, laughing crazy

Let's talk about ChatGPT-4o from the perspective of human-computer interaction

The iOS version of ChatGPT updates support the app's preferred language setting Chinese

How to make ChatGPT "understand you" better

Risk and Governance of Generative AI – The Case of ChatGPT

This is the biggest update for ChatGPT4o! The press conference didn't mention a word! GPT-4o's image recognition ability is so strong! Even the portrait photo can tell who I am 👍 here

ChatGPT's new feature is online: when chatting, you can directly select network disk files such as OneDrive

ChatGPT is able to help doctors accurately analyze clinical studies and medical records

ChatGPT consumes more than 500,000 kWh of electricity per day, and it is energy that is stuck in the development of AI?

Terror! Imploring a Stanford professor to help it "break from prison"? ChatGPT-4 has emerged since

and ChatGPT engage in yellow young people

Former OpenAI director reveals the inside story of Ultraman's recall: The board of directors knew that ChatGPT had been released from X