Lin Yonghua: As AI enters the era of big models, how will the tide rise and fall in the new decade? |163-1 Lecture 1

Lin Yonghua's keynote speech on the big model allowed the audience to "know more about ChatGPT" and other phenomena;

This lecture is co-sponsored by Wen Wei Po, Shanghai TreeMap Blockchain Research Institute, Institute of Modern Chinese Thought and Culture of East China Normal University, and Ethics and Wisdom Research Center of the Department of Philosophy of East China Normal University.

The keynote speech has been published for the benefit of listeners and readers.

AI, which fell into the trough, entered a new decade last year due to the inflection point of the large model represented by ChatGPT and the Wensheng diagram

I am very pleased to take this opportunity to share with you the opportunities and challenges I have experienced in the field of AI over the past many years: from small models to large models, from scientific research results to industry. Over the past few decades, AI has had its ups and downs. Before June last year, the entire artificial intelligence was in a range where the previous wave fell. In the second half of last year, there were two phenomenal applications: one is the Wensheng diagram, and the other is the emergence and outbreak of large model technology represented by ChatGPT. These two events led the entire AI from one inflection point to the next starting point, and this new starting point is indeed the decade in which big models lead the development of artificial intelligence in the future.

Thought 1: Big models bring about a change in the paradigm of AI R&D

Why is the R&D paradigm important? Because when the scientific research community makes a breakthrough and innovation in a technology, how they are widely implemented in all walks of life is closely related to its research and development paradigm and the cost of research and development products.

*Phase 1 paradigm: Train the domain model from scratch

So far, the AI R&D paradigm has undergone three stages of change.

The first stage is to train the domain model from scratch. When deep learning and artificial intelligence first appeared, everyone considered how to use the amount of data at hand, through many computing resources, to train the model from beginning to end, and then deploy it to all walks of life. Because it requires a lot of data, computing power, especially the technical talents of the entire AI full-stack are particularly expensive. Therefore, this paradigm cannot be sustained.

The second stage paradigm: transfer learning for pre-trained model + fine-tuned training

In 2014, articles describing the transfer learning technique of pre-trained model + fine-tuning appeared at several top AI summits. Using a picture library with more than 10 million images covering 20,000 common items, a general visual classification basic model is trained, and its scale is small and medium-sized models. Since then, we have trained it using data from our own fields, such as medical image analysis and industrial defect detection. This process is transfer learning from one general domain to another. From today's perspective, it is equivalent to a junior high school graduate becoming a specialist with professional skills after three years of specialist training.

As a result, the R&D paradigm enters the second stage - from the pre-trained basic model plus fine-tuning training with a small batch of data and a small amount of computing power, different models can be formed to be implemented by enterprises in different scenarios. In this paradigm, industry enterprises only need to do data collection and processing, model training, model services and other part of the work, from the perspective of human, material and financial resources, the amount of investment has been reduced several times, or even ten times.

Transfer learning in the field of computer vision has driven the ebb and flow of AI in the past decade. This whole process today looks like a small model stage.

From 2013 to 2015, artificial intelligence became more sought-after for computer vision analysis based on deep learning in many fields because of the emergence of transfer learning. Another phenomenal event is that in the 2015 ILSVRC (ImageNet Large Scale Visual Recognition Challenge) image classification competition, the image discrimination error rate of the ResNet network was as low as 3.57%, which has surpassed the human recognition ability (about 5%). Because of these two landmark events, artificial intelligence is expected to be a large-scale success. Many AI companies, including SenseTime, Yuncong, YITU, and Geling Shenpui, were also founded at that time and were generally sought after by the investment community.

But since 2017, artificial intelligence has slowly declined from its climax.

In 2017, more than 4,000 companies around the world set up AI companies every year because they received financing. But by 2020, that number had fallen to 600-700, so much so that there were even many claims of an AI bubble bursting in the past year or two.

Why share this? Seeing the emergence of another new decade of AI, practitioners need to think deeply: why did the previous decade appear highly anticipated, but in the end it was not widely implemented in all walks of life as imagined? And in the next decade, what should be done right so that the new wave of technology can be better developed after the tide, rather than quickly falling.

The third stage paradigm: basic large model + application prompt

The third stage of research paradigm shapes the generalist large model and reduces the application cost of downstream enterprises

In the current third stage of research and development paradigm, the basic large model is very important to the base, one is to use massive pre-training data to train it, usually more than 100 billion data. Second, the number of parameters is very large, billions of parameters are the entry, many times will reach tens of billions of parameters, or even hundreds of billions of parameters. Third, the computing power required is greater. This basic large model helps us learn a variety of general knowledge, including the ability to implement various models, such as understanding, generation, and even emergence. What kind of large model of this base can be seen in the industry now? For example, GPT-4, GPT-3.5, LLaMA, KLCII's newly developed Skyhawk · Aquila and others. The most important function of the basic large model is the prompt learning ability. It's very similar to people, and you can learn from it.

In the third R&D paradigm stage, for many downstream industry enterprises, it is not even necessary to go through the second stage of fine-tuning the training model, but directly reduced to just make API calls, with a greater cost reduction, especially applicable to various application fields. After ChatGPT came out, everyone tested it with various professional questions in the field of human beings, including law, medicine, politics, and AP courses in the United States, and it did well, just like a generalist. It sounds really nice.

Thought 2: How to land the big model industry?

How to land large models? Only by taking this step can hundreds of millions or even billions or tens of billions of R&D investment in large models can truly lead the intelligent improvement of all industries.

* Basic model pre-training + basic model continuous training + instruction fine-tuning

There are two ways to apply large models: one is prompt learning, and the other is instruction fine-tuning training.

If you only rely on the "prompts" in prompt learning, each API call must be accompanied by lengthy and increasingly long prompts, which is difficult to meet in actual products. Therefore, when the product really lands, it is necessary to introduce instruction fine-tuning. Instruction fine-tuning is to use the knowledge of the underlying model to complete a specified task. Just like undergraduates who have learned a lot of knowledge, they need on-the-job training. Instruction fine-tuning is not very expensive, for example, we once did a natural language to SQL scenario for an application, when the prompt learning did not work, the instruction fine-tuning data was only put 20, including all the environment construction took a total of 8 hours.

In fact, the ChatGPT I see today is not a basic model, it is a dialogue model that has been fine-tuned with many instructions, so it seems to do everything well. In fact, it is precisely because it collects many instructions from humans around the world and constantly fine-tunes it. For example, KLCII's AquilaChat dialogue model is also based on the Aquila basic model and can answer various human questions through instruction fine-tuning. For example, June 8 happens to be the national college entrance examination, and it completes the college entrance examination essay of the day in 10 seconds.

But in this process, in fact, it only has general capabilities, that is, mainly facing the application of the Internet, such as small talk, Q&A. If you want the big model to really serve more economic systems and the real economy, you need to consider how to implement the big model into professional industries. It is very important to form a basic model of the professional field by adding a large amount of professional domain knowledge for continuous training on top of the basic model of general ability. It is like an undergraduate student who has done a general education and then gives him one to three years of postgraduate study.

Therefore, on the whole, basic model training is equivalent to undergraduate learning in the general field, and the continuous training of basic models in professional knowledge data is equivalent to postgraduate advanced study in professional fields, and then instruction fine-tuning training is equivalent to on-the-job training in professional fields.

To land in specific industrial fields, the basic large model also needs to go out of the two steps of professional continuous training and on-the-job training

* How to overcome amnesia and hallucination rate in the industrial landing of large models

After all, the model is trained on hundreds of millions of articles or web pages, in fact, it forgets things just like people. The conclusion of scientific research statistics is: First, the larger the model, the better the memory, and the greater the percentage of memory. Regardless of the size of the model, if the model is only allowed to see the data 2-3 times, it can only remember a few percent of the amount of data.

This creates a pair of contradictions. First of all, from the perspective of copyright protection, it may not be expected to remember too well. The training of large models has to obtain a lot of articles and works from the Internet platform for training. So far, there is no clear definition, if it produces a large amount of the same content because of reading these articles, will it cause copyright problems? This is an issue to be solved.

From this point of view, if the memory of the model is only a few percent, the copyright problem will not be so serious. But when the real industry lands, this will become a big problem, that is, the model is trained for half a day but cannot remember.

"Hallucination rate" is what we often call a serious piece of nonsense. What is the cause? First, the pre-trained dataset may contain some erroneous information, a lot from twenty or thirty years ago, yesterday and today. Second, it is more likely that the data of the model is pre-trained in hundreds of millions or hundreds of millions of data that does not directly contain relevant information. This leads us to face serious industries, such as medical, financial, legal, etc., and must consider what additional technologies to use to reduce the hallucination rate.

* Large models and small models are bound to coexist in the next decade

I personally believe that the big model and the small model will definitely coexist in the next decade. There are three important differences between large and small models:

First, in the era of small models, our knowledge of the target domain is obtained through transfer learning and fine-tuning training, and the basic model itself does not have any target domain knowledge. But in the era of large models, the basic model itself needs to have sufficient professional domain knowledge, and instruction fine-tuning training is just to let the model tell the model how to use knowledge.

Second, it is closely related to the application field, and it is necessary to give very accurate results for areas with high precision requirements, especially in the field of perception, for example, in medical treatment, an image illustrates the condition of the tumor at the first grade of lesions. This requires a single model with a very high accuracy rate. At this time, it does not need to learn the generalization ability and general ability of large models such as Qinqi, calligraphy and painting, and this scene is suitable for small models.

Third, computing power, infrastructure and model selection are related, and important occasions with low cost requirements and latency requirements, such as automatic driving and industrial millisecond-level control, are still suitable for small models in the environment of communication and delay, because it is easier to place on the edge side with low computing power. Large models are the opposite. These two technologies are fused with each other.

*How do small model track companies integrate into the era of large models?

The SAM Universal Segmentation Large Model released by Meta in March this year is sought after from the web

Many people have asked whether AI companies and scientific research teams that have developed small models in the past decade need to migrate to large models in the era of large models? How can they do better with what they already have?

First, the original algorithm in the era of small models can be updated, and the new technology of large models can be integrated into small models. For example, the transformer model structure is considered an important technical indicator by the era of large models, because CNN networks in deep learning are often used in small models, especially in computer vision. We have done an experiment to replace CNN networks in the era of small models with VIT computer vision models based on Transformer, and found that in the case of achieving similar accuracy, large models can save 1/4 of the video memory in the pre-training stage, and the inference speed only requires 58% of the delay of ResNet50, and fewer resources are required during experiments. This really breaks the law that large model technology must be resource-intensive.

Second, apply new methods to solve problems that were previously difficult to solve. For example, Meta's visual segmentation large model SAM, released in March this year, can accurately segment various objects in the visual range. This technique can be used to count the quantity of goods in supermarkets, warehouses, etc. This has been difficult to do before, or requires multiple complex technologies to be stacked. I know there are already some small model companies that have implemented SAM large models.

Third, small models in large models, such as our newly released AquilaChat Skyhawk dialogue model, with only 7 billion parameters, can run on 4G memory through int4 quantization technology. At present, the chips on the edge side of the domestic industry already have 8G video memory. Therefore, under the wave of large models, many companies in the AI small model track can completely renew their vitality.

Thought 3: The importance of building a basic big model

In his speech, Lam Wing-wai cited KLCII's example of creating a large model, which is vivid

The most important of the large models is the pedestal model below. Building a large model of the base is as important as the CPU in AI.

* The investment is very expensive, and tens of billions of parameters are often more than tens of millions of yuan

First, in addition to making chips and CPU tape-outs, the basic model has become the largest part of single product investment in the era of AI large models. This can be seen through some figures in the industry, including our research and development of large models: a model with 30 billion parameters, including the cost of data, training, and evaluation, and all manpower, material resources, and computing power combined, costs 20 million; The model with hundreds of billions of parameters is about 40 million or even higher. Therefore, it is often tens of millions to train a model, and the investment is very high.

Second, the basic large model determines the important capabilities of various models downstream. You will find different chatbots, some can only speak English, some can program, some can't program, some know more science, and some can read pictures. In fact, these abilities are determined by the following basic model, and only when these abilities are added in the pre-training can they be reflected in the dialogue model.

The basic model largely determines the ability of the subsequent model, industrial landing and other factors. From the perspective of ability, the ability to understand, emerge, and learn in context of the large model are all determined by the structure, size and so on of the basic model. From the perspective of knowledge, both general knowledge and professional knowledge are learned in the process of basic model training.

* The guarantee of values first requires a clean corpus

Third, from the perspective of compliance and security, for the content generation model, whether the generated content is positive, whether there is bias, ethical issues, etc., is largely determined by the basic model. How can the underlying model capture human values? By training the corpus. Some scientific research institutions and companies at home and abroad train basic models, usually applied to the Common Craw corpus, which is the world's largest collection of Internet training corpus. But only a small number of them are Chinese data, and among all Chinese data, only 17% of the Internet sources, websites, and URLs come from China. The vast majority of Chinese corpus comes from other countries and regions. Good domestic Chinese content did not appear in it. We observe that there is a great deal of risk in training a Chinese-capable base model based on such a dataset.

*The basic model of commercial licensability benefits more enterprises

Fourth, from the perspective of copyright and commercial licensing, many models are either closed source or open source using non-commercial licenses, which have no impact on academic research, but they cannot be used for subsequent commercial and business by enterprises. Why do we always advocate open source, even giving users a commercial license when it is open source? KLCII hopes to open source these resource-intensive models and make them available to more companies. According to statistics, from January to May this year, a total of 39 new foreign open source language models were released, of which 16 could be directly commercialized, but only 11 domestic open source language models, and only 1 dialogue model was directly commercially licensed.

From another point of view, the basic model is more valuable to the development of the entire industry. There are many domestic teams that have open sourced large models, how many of them are really basic models? According to statistics, as of the end of May, only 5 of the open source language models released abroad are basic models, while only 2 of the open source language models released in China are basic models, which are Fudan's MOSS and Tsinghua's CPM-Bee.

*KLCII's development principles: Chinese-English bilingual capability + model open source

As a non-profit research institution, we advocate more investment: first, support for the bilingual pedestal model in English and Chinese. Bilingual support in English and Chinese, rather than relying on translation. There is a lot of knowledge in the Chinese that needs to be trained directly into the model, and much Chinese knowledge cannot be incorporated by translation. Second, it is hoped that commercial licensing agreements can be supported, which can avoid many companies duplicating resources to build the base model. Third, meet the needs of domestic data compliance, especially the inclusion of excellent and high-quality Chinese content. Because we see that there is a lot of unclean corpus in the pre-training of the pedestal model, we are very cautious when building the base model. Chinese corpus are all from the data accumulated by KLCII since 2019, and more than 99% of them are from our domestic stations. The advantage of domestic website sources is that they all have ICP licenses, so they also regulate the reliability and credibility of network content.

Code model is a very important model for the landing of large model industry, with broad application prospects. Based on Aquila-7B's powerful basic model capabilities, we efficiently implement the best Chinese and English bilingual code model with less code training data and small parameters. We have completed the training of code models on NVIDIA and domestic chips respectively, and promoted chip innovation and bloom through the open source of code + models that support different chip architectures. As you can see from the given example, the code model allows us to enter a simple description, and we can automatically complete a simple login page to achieve drawing of sinusoidal trigonometric functions. KLCII is also mining and using these code models to accomplish more tasks, such as assisting the implementation of new compilers, which may change deeper research and development in the computer field.

Thought 4: In the era of big models, evaluation has become extremely important

At the beginning of the lecture, Professor Oslie Yeung delivered a speech, believing that technological development is a necessary stage for a humanized society

Large model training should focus on two ends: one is data, and the other is evaluation.

Why are reviews important? A model with 30 billion parameters invests 100,000 yuan in computing power per day, which is very expensive. On the other hand, because it is big, it is more necessary to pay attention to all the details in the whole process, and once there is a problem, it is necessary to find it in time and make timely adjustments.

*The subjectivity and objectivity of assessment ability have not been fully addressed

In addition, the capabilities of large models are complex, and it is difficult to use a single indicator to indicate the various capabilities of this model in the future, so it is necessary to use various evaluation methods and evaluation sets to evaluate it. After the large model training is stable, it is necessary to start the instruction fine-tuning training, and then perform loop iteration and continuous adjustment. If only computer objective evaluation is used in the process, it is difficult to accurately and systematically see the subjective generation ability, so subjective evaluation must also be added. Subjective evaluation has so far only been carried out by humans. We also tried to use ChatGPT with human evaluation, but there were still large deviations in many test cases.

Finally, the preferred model should also enter the red-team evaluation, that is, find a group of people who have not participated in the development of the model to play the user group, and ask various questions about the model, including various malicious and tricky questions, to evaluate the effect of the model. OpenAI conducted similar reviews for months before the release of chatGPT to ensure its current results.

In order to make the language model more comprehensive and systematic evaluation, KLCII has created the FlagEval Libra model evaluation system, which includes 22 objective and subjective evaluation sets in Chinese and English, and more than 80,000 evaluation items. Based on the latest evaluation, AquilaChat achieved optimal performance with about 50% of the training data of other models. However, since the current English data only trains 40% of Alpaca, it is temporarily lagging behind Alpaca, which performs instruction fine-tuning based on LLaMA, in terms of objective evaluation in English. As the follow-up training progresses, it is believed that it will soon be surpassed.

* Cross-modal graphic discrimination and evaluation to promote the development of basic models

Evaluation plays a very important role in the development stage of large models, and is also the key to driving the development of large models. Taking cross-modal graphic evaluation as an example, for simple graphic evaluation tasks, a good model has basically reached or exceeded the human level, between 70 and 90 points. But for slightly more complex graphic evaluation tasks, the large model only has 10-11 points. The identification of cross-modal graphics, especially with logical understanding requirements, is the huge gap between large models and human capabilities. Therefore, evaluation is the key to driving the development of large models, and it is hoped that by adding more complex evaluation items, we can pull large models to develop into more complex scenarios that humans need.

*Assessments have evolved into cognitive abilities and human thinking abilities

The current assessment has entered the third and fourth level, that is, cognitive ability and human thinking ability

The large model has entered everyone's field of vision since last year, and its capabilities have developed rapidly. At the same time, the difficulty of the evaluation has also climbed all the way, which is equivalent to constantly lengthening the ruler in order to better measure the ability of the large model. The basic model research and development is often tens of millions, so for more startups, AI companies, or downstream companies are no longer to train the entire model from 0 to 1, but more from the market to choose open source or closed source large models for processing. How should this selection process proceed? This is a very important factor for the evaluation of the large model era to the implementation of the industry.

With the improvement of the ability of large models, there are four steps of evolution in evaluation:

First, the ability to understand. In the past decade or two, AI has been based on the assessment of comprehension ability, whether it is computer vision or natural language processing.

Second, the ability to generate. Now that there is AI-generated content, it has to rely on human subjectivity for evaluation. It is difficult to fully guarantee consistency and objectivity in the quality of subjective evaluation, and now we are gradually introducing some AI auxiliary means to do it.

Third, cognitive ability. At present, people consider various large models, and no longer think that they are just a language model that can speak and write, but want to see a variety of intellectual abilities and cognitive abilities. Therefore, the bigger challenge for evaluation is how to portray the cognitive abilities of all human beings. In addition, many people now test these models with various test questions, but many of these questions have been leaked into the training corpus of the model, so this evaluation of cognitive ability is also biased.

Fourth, human thinking ability. What's more difficult is that many people want this model to be more like a human mind to understand and speculate. Therefore, how to evaluate and evaluate the mental capacity of the model requires multidisciplinary intersection.

Thought 5: In the era of big models, KLCII's mission, craftsmanship and curiosity

Curiosity and craftsmanship are the two wings of the mission

KLCII is a not-for-profit R&D organisation with nearly 200 full-time researchers. In the era of large models, we see a variety of practical problems and technical problems, and we urgently need to break through. Whether it is the application of Wensheng Graph or chatGPT, it is inseparable from the accumulation of the entire large model technology stack under the iceberg, and this is the part that KLCII has been committed to building - all the basic models, including datasets, data tools, evaluation tools, and even AI systems, and a variety of cross-chip technologies. This is our mission, not only to build a large model technology stack below the iceberg, but also to open source it all in a commercially available form, so that both the code and the model can be fed back to the entire industry and academia. We also hope that more academia and more scientific research teams will join us to contribute to open source, and it is especially crucial to co-innovate in disciplines inside and outside the AI field.

The era of large models requires science and engineering to go hand in hand, on the one hand, every big model needs to be forged in the spirit of craftsmen, and every step must be carefully crafted, whether it is data, training process or evaluation. On the other hand, there are too many unknowns in the big model, which need to be explored with the curiosity of chasing stars and moons, only if we explore better, can it be more stable in the industry, and the next ten years can continue to develop steadily after the tide. Finishing: Li Nian Jin Meng

Author: Lin Yonghua (Vice President and Chief Engineer, Beijing KLCII Institute of Artificial Intelligence)

Photo: Live shooting/Zhou Wenqiang Production/Hu Yang PPT is from the speaker's authorization

Editor: Li Nian