The Fudan MOSS model is scheduled to be open source in mid-April, and Qiu Xipeng explains in detail how to build it

·“ We are still iterating on MOSS, which is expected to be open sourced in mid-April. Overall, MOSS is trained based on open Chinese and English data, has 20 billion parameters, has the ability to talk to humans, and can achieve iterative optimization through interaction with humans. But at the same time, although the ability to understand the language and ChatGPT are very similar, the overall completion of MOSS is not high, mainly because MOSS has invested very limited capital in deployment and training. ”

Qiu Xipeng, professor of the School of Computer Science and Technology of Fudan University and head of MOSS system.

Recently, at the "Beyond ChatGPT: The Era Revolution Triggered by Large Language Model" hosted by the School of Management of Fudan University, Qiu Xipeng, professor of the School of Computer Science and Technology of Fudan University and head of MOSS system, deeply deconstructed the ChatGPT model from the perspective of technology and principle, and introduced the relevant details of the first ChatGPT-like model MOSS in China.

Why is ChatGPT so strong?

ChatGPT is an artificial intelligence model released by OpenAI at the end of 2022, with monthly active users directly exceeding 100 million within 60 days, becoming the fastest growing consumer application in history, its main function is to talk directly with humans, Bill Gates praised it as another new breakthrough in technology after PC and the Internet.

The amazing dialogue, understanding and expression skills demonstrated by ChatGPT have made more and more people realize that artificial intelligence has ushered in a new milestone and is expected to serve as a vital pedestal system that penetrates into all walks of life at an unprecedented speed and continues to detonate the digital economy system of the future world.

So, what exactly is the technical principle behind ChatGPT?

Before formally answering this question, Qiu Xipeng believes that it is necessary to first understand the "language model". Language modeling, that is, the use of computers to remodel human language, a mechanism to transform natural language into language that machines can understand and judge.

Human natural language is very flexible, on the one hand, it has rules, on the other hand, it can break the rules at any time, and even there is a huge ambiguity, any sentence, in different situations, everyone will understand it will be very different, which causes great difficulties and challenges to modeling.

One can use probabilistic judgment for statement processing. If the sentence conforms to the laws of natural language, it is given a relatively high probability, and vice versa, it is given a relatively low probability. But in this case, a new question arises: how to give probability to a sentence?

According to Qiu Xipeng, this requires us to obtain massive amounts of text data from the Internet. But this is also a difficult problem, this probability space is large, difficult to model directly. The current solution is to greatly reduce the difficulty of modeling language models by breaking down the joint probability of the entire sentence into the multiplication of the conditional probability of each word in it. This translates the language model into a machine learning task given the above prediction below.

Language modeling, that is, the use of computers to remodel human language, a mechanism to transform natural language into language that machines can understand and judge.

The more accurate a language model is to predict, the more it needs to fully understand human language and world common sense. For example, letting the model predict that eggs are round rather than square implies a certain common sense of life. In addition, there is a very difficult problem in linguistics called "referent", for example, "you", "I" and "he" are pronouns, but who is referred to is unknown, and in some Chinese scenarios, even pronouns are directly omitted when predicting, which requires completion through context to predict the next word more accurately.

As long as there are enough formulas, the model can still capture what rules "×" are and what rules "+" are, and complete self-learning from data.

Another example is "12×3+9=? ", you need to predict the result of this mathematical formula. However, when training, it does not necessarily tell the model what "×" stands for and what "+" means, but just enters a lot of mathematical formulas to let it train itself. However, as long as there are enough formulas seen, the model can still capture what rules "×" are and what rules "+" are, and complete self-learning from data.

Why can I learn on my own? There is a concept of information compression, for example, there are a trillion words, which can be saved intact on the hard disk, or you can use a neural network of 1 billion parameters to remember all the corpus knowledge. This forces the neural network not only to save this shallow knowledge, but also to accumulate the knowledge and laws contained in it, because there is a lot of information that is redundant. The reason is simple: writing down the rules is better than a lot of shallow words. In this way, language models are forced to try to discover the various laws behind these words, so that the model can better understand human language and world knowledge.

In addition, in terms of the neural network architecture used in large-scale language models, Qiu Xipeng mentioned Transformer, which originally means transformer, with input and output, and is a structure similar to the two-tower type. Bringing into a language model is to enter one sentence, come out of another sentence, and give a specific network model that predicts the next sentence given the previous sentence.

The word transformer originally means transformer, has input and output, and is a structure similar to a two-tower type.

Today, Transformer has become the most mainstream architecture system in the entire field of artificial intelligence. In addition to its high capabilities, another very important reason is that its architectural design is particularly friendly to GPU (graphics processing unit) computing. Because the Transformer architecture is different from convolutional and recurrent neural networks, convolutional and recurrent neural networks are designed before the birth of GPUs, and then GPU acceleration is achieved. Transformer appears after GPU, so the design can naturally fully consider the maximum use of GPU capabilities, and it is easier to achieve the effect of large-scale language models.

However, in order to make the predictions of the model more accurate, it is necessary to train it with a large number of parameters to help large-scale language models fully understand the rules of human language and their logical relationships. At present, with the blessing of Transformer, researchers can already make the scale of the model tens of billions and hundreds of billions. This is today's Large Language Model. In a large language model, you can input some of the above text, which is processed by the Transformer neural network composed of humanoid neurons behind the language model, so as to predict the next word and output the corresponding text.

"In the training process, scientists found that after the calculation amount is about 10 to the 22nd power, the model ability will complete the leap from quantitative change to qualitative change, showing an amazing explosive growth, which we usually call 'emergence ability'." Qiu Xipeng said.

After the calculation volume is about 10 to the 22nd power, the model capability will complete the leap from quantitative change to qualitative change, showing an amazing explosive growth, which we usually call "emergence ability".

What are the key technologies behind emergence capabilities?

"Large-scale language models begin to gain 'emergence capabilities' after reaching tens of billions of scales, and behind the emergence capabilities, there are further three very important technologies: situational learning, thought chain and instruction learning, which is the key reason why ChatGPT has been able to win in the field of artificial intelligence." Qiu Xipeng said.

Behind emerging capabilities are three very important technologies: situational learning, thought chains, and instruction learning.

In-context learning profoundly changes the traditional machine learning paradigm, only through a series of well-designed conditional statements (Prompt), detailed description of the task, and then supplemented by some situational examples, the model can refer to the established examples to complete specific tasks.

Qiu Xipeng gave an example, if you want to develop a movie review sentiment classifier to count whether a movie's rating is mostly positive or negative, then you can design a conditional statement to describe the task first, such as: "This is a movie review sentiment classifier." Comment: 'I love this movie! This comment is positive. Comment: 'I don't know, it's okay. This comment is neutral. Comment: 'What a waste of time, don't recommend this movie. This comment is negative. "Then the model automatically learns and predicts in context. This model is obviously not the same as the traditional store of knowledge directly in parameters, and it also reveals to some extent why ChatGPT is usually presented in the form of multiple rounds of conversation.

In fact, for ChatGPT, every time it accepts a conversation sent by a human, it will input all the previous chat history as the above, input it to the language model, and then the language model will continue to write a following and feedback it to the user. In this way, allowing a large language model to interact directly with humans is indeed very intelligent and far-sighted from the perspective of product innovation.

In-context learning has profoundly changed the traditional machine learning paradigm.

Model capabilities can be improved by scaling up parameters, but Google researchers have come up with a better way: the model has a simple problem that breaks down a complex problem into multi-step reasoning, so that the model can understand and learn how humans derive this answer step by step, which is called chain-of-thought.

"After massive pre-training, large language models have seen a lot of reasoning, and we just need to guide it step by step to make it reason the way you want." Qiu Xipeng said that the thinking chain method further liberates the potential of the model, so that complex problems that the model will not solve can be broken down into many simple problems, and then solve simple problems one by one, and finally make complex problems also solved.

As for Learning from Natural Instructions

，

Traditionally, machine learning has required a lot of labeled data in order for machine learning models to learn from the data. But the standard data itself is very cumbersome, and humans always hope that the language model can learn directly from the instructions and directly understand the intention of people.

And facts have also proved that this idea is feasible, even humans only need to be directive on a small number of tasks, after experiencing about 40 tasks of commanding the model moderately fine-tuned, it is easy to generalize to hundreds, thousands of tasks, even if those tasks may never have been seen, it can still cope well.

Qiu Xipeng believes that the only problem in the current technology field may be that many existing tasks have not yet been "aligned" with humans. Although natural instruction learning has greatly improved generalization, the understanding of true human intent still varies greatly, and OpenAI (the developer of ChatGPT) hopes to collect real human intent and let experts write answers to better match human preferences. "In this process, the meaning of human participation is very important, so that the machine always maintains alignment with human values and thinking in iteration, and also avoids that the machine itself may be further and further away from human preferences and original intentions when it iterates." Qiu Xipeng said.

How is MOSS made?

"After understanding these basic technical principles behind ChatGPT, we can basically try to replicate this large language model." Qiu Xipeng continued.

The first step is to implement the language model base, the second step is instruction fine-tuning, and the third step is to continuously strengthen and iterate on capabilities. Although these key steps and the general trend of development are already very clear, the details of each step need to be explored by ourselves, and it is still full of unknown challenges.

The first step is to implement the language model base, the second step is instruction fine-tuning, and the third step is to continuously strengthen and iterate on capabilities.

Regarding the first step, the MOSS team mainly optimizes modules on the Transformer architecture. The most challenging of these is getting the model to handle the Chinese.

"First of all, for ChatGPT, it does not pay special attention to Chinese, many times just directly encode the Chinese in English, we as Chinese naturally want to optimize the Chinese, we need to re-implement better Chinese coding, and find a way to connect Chinese and English; In addition, if multimodal is accessed in the future, coding problems will also bring many problems and troubles such as architecture design and training stability. Qiu Xipeng said.

Secondly, regarding instruction fine-tuning, Qiu Xipeng believes that the difficulty of instruction fine-tuning is even higher than pre-training, "In the pre-training stage, you can use the mature pre-training models of some large companies to achieve good training results in a short time; However, instruction fine-tuning is very difficult to do immediately, and there is a very obvious gap between this and OpenAI. ”

In terms of aligning with humans, it is difficult to make the model's answers as consistent as possible with human thinking habits. "And considering that OpenAI is not open source for the time being, we can only slowly explore step by step." If you want to surpass ChatGPT, you must find a better implementation path than it, and the process is undoubtedly full of difficulties. Qiu Xipeng said.

Implementation of MOSS.

Qiu Xipeng also specifically talked about the implementation of MOSS.

First, stimulate the dialogue ability of MOSS large-scale language model. "Because schools don't like OpenAI, which can hire a lot of people to write answers, we started by writing some seeds through so-called self-directive technology, and using 'da Vinci' to help us expand a lot of conditional statements and answers." After writing, some small-scale dialogue data begins to be generated, and we can further use the supervision strategy to make a model, and then project and develop it on this basis so that it can be gradually aligned with the real needs of humans. Qiu Xipeng said.

The process of using MOSS is actually the only way to help it align with humans and become better and better. In this regard, scientists want to let the machine write the answer itself, encouraging it to continuously achieve iterative optimization according to our human preferences, to produce data types that are more and more in line with human habits.

Qiu Xipeng revealed, "We are still working on the iteration of MOSS, which is expected to be open source in mid-April, when everyone can use it to their heart's content." On February 21, Qiu Xipeng said at the 2023 Global Artificial Intelligence Developer Pioneer Conference that the MOSS large model will be open source at the end of March if it goes well.

Overall, MOSS is trained based on open Chinese and English data, has 20 billion parameters, has the ability to talk to humans, and can achieve iterative optimization through interaction with humans. But at the same time, Qiu Xipeng also admitted that although the ability to understand language and ChatGPT are very similar, the overall completion of MOSS is not high. The main reason is that MOSS has invested very limited capital in deployment and training, compared to ChatGPT of hundreds of billions, it is only about one-tenth of its scale, so MOSS has a lot of factual knowledge that cannot be remembered, and the thinking chain ability is relatively poor. However, Qiu Xipeng also said that the team is also actively trying to introduce some external tools to further expand the scale of model parameters and continuously achieve improvement and optimization.

How will AI disrupt the future of society?

Considering that ChatGPT already has the ability of universal language understanding and can further increase the external interface, it has become the technical foundation of Artificial General Intelligence (AGI), which means that the accelerated realization of general artificial intelligence is no longer a dream for human beings at this stage. Even optimistically, artificial intelligence images like those in science fiction movies may soon appear in human life.

General artificial intelligence technology represented by ChatGPT can detonate the digital economy, give full play to the efficiency of data and computing power, and give birth to a large number of new business models; It can empower industrial digitalization and solve the problem of insufficient resources of industry experts through human-machine collaboration; It can provide development momentum for new formats and models of digital economy in the form of digital people, personal assistants, search engines, etc.; It will also profoundly change the ecology of education, social governance, justice and other fields, and greatly improve the level of the industry.

The accelerated realization of Artificial General Intelligence (AGI) is no longer a dream for humans at this stage.

"Of course, we must also face up to the fact that the current general artificial intelligence technology still has many shortcomings, including randomness, uncontrollability, easy to 'talk nonsense', etc., but I believe that these problems will be gradually improved in various ways in the future over time." Qiu Xipeng said.

So for the next stage of large-scale language models, Qiu Xipeng believes that the key thing to do now is to "align" the model with the real world and human values, and become a real agent, with its own learning, cross-modal learning, knowledge and tool utilization and other capabilities. At the same time, the "alignment" of AI and human values cannot be ignored, after all, if AI values are contrary to human values, it will be very dangerous.

At the end of the speech, Qiu Xipeng said, "Perhaps as Yann LeCun, a Turing Award winner and a well-known artificial intelligence expert, said: the next generation of models should be more factual, harmless, immediate, and able to flexibly use calculators, databases, search engines, simulators and other auxiliary tools, which is also a problem that people urgently need to focus on." ”

The Fudan MOSS model is scheduled to be open source in mid-April, and Qiu Xipeng explains in detail how to build it

Read on