Technical principles related to GPT

Because in the recent half year, this ChatGPT can be said to have detonated our public opinion circle and received widespread attention from all walks of life. Today, I will make some introductions to some GPT-related technologies and future development trends and working principles from three aspects.

First of all, discuss what this GPT brings us, the entire development process of this GPT, first from the 2012 CNN network is a convolutional neural network. We know that since 2012, this deep learning or deep neural network has become a very important research direction in artificial intelligence and even the entire field of computer science software engineering, 17 years ago, in fact, the main method of this neural network for vision is a variety of convolutional neural networks, we call CNN for short, from the earliest Alex net, including the rest net, including the dance net we Chinese ourselves and so on belong to the convolutional neural network type. The focus is on this image, of course, but also covers a part of the video data, including visual tasks such as object detection, object recognition, semantic segmentation, and so on.

There is also a type of language that is actually language-oriented, mainly with LSTM and RNN such partial sequence type of network, that can support the analysis and translation of some text like language, including tasks such as emotional analysis, etc., then the earliest these two branches are actually independent development. For vision and language, that in the past 17 years, Google has proposed a new architecture, this architecture we call the Transformer architecture, then the focus of this architecture is the introduction of the so-called attention mechanism, this attention mechanism is actually a very simple mechanism, that is, when we humans understand the world, he does not treat all data equally, we often have a certain focus on this connection and understanding of the world.

This focus is actually the so-called attention, that is, grasp the focus, in a big problem we grasp the focus, so this transformer wrote this attention into its model architecture, so that natural language processing, and even vision has a very important impact, of course, the most prominent impact is on natural language, then in the field of natural language based on the transformer has emerged a lot of new models. One of the two most representative technical routes, one is Google's Bert, the name Bert may actually think that everyone does not know what it means, in fact, it is a character in this cartoon. This system has this two-way self-attention ability, can grasp the interrelationship between contexts, and is more suitable for the task of understanding written language.

OpenAI proposes a GPT series based on the teleportation architecture, which will be relatively simpler than bert later, because it only has one-way attention and is more suitable for text generation, but in the past five or six years from 17 years to this now, Bert's architecture is good, GPT architecture is good, in fact, in the competition, this GPT architecture has obviously achieved certain advantages. Especially from GPT 3, we know that GPT 3 this system was proposed in 2020 and then came to GPT 3. 5, that is, the now familiar ChatGPT, and the latter GPT 4, all of a sudden this ability is obviously a relatively large improvement than Google's performance, so it presents a unique trend.

At the same time, we should also pay attention to the visual field, because vision has always been the visual field of this convolutional neural network, and the Transformer is also used to redesign and construct this model in the visual field. In the news you saw a few days ago, there is this new architecture for semantic segmentation such a transformer proposed, in fact, we also have some similar systems in China.

Okay, so let's go back and focus on the GPT series, its general development route in the past five years. From GPT 1 was released in 18, GPT 2 was released in 19, and GPT 3 was released in 2020. In fact, whether it is the model or parameters or the training data is relatively public, of course, it does not mean that it can be downloaded open-source, but at least everyone knows its internal structure and the training method used. The earliest GPT1 is about 100 million parameters, to GPT 2 is 1.5 billion parameters, to GPT 3's 150 billion parameters, training data is also from 5 GB to 45 TB, covering quite a lot of natural language corpus, to ChatGPT3.5 and not long ago GPT4, then model parameters, training volume are no longer publicly available, everyone sees his experience ability.

The characteristics of these model development are actually a paradigm of technological development in the era of intelligence that we often say, that is, a complex of large computing power, increasing data, and enlarging models, who can organically integrate these three things together and control the complexity of this highly technical engineering in it, who can ultimately win.

According to this boundary table refinement, that is to say, you can see that although Transformer was proposed in 2018, in fact, this was called the big practice model before 2020. What is a big practice model? Regardless of the size of the unit, everyone is based on their own business needs, right? Find a suitable network architecture, whether it is CNN, RNN or Transformer neural network, is based on their own business, according to their own data annotation, and then to train the model, presenting a very rich model ecology.

But since 2020, it has actually changed from a large training model to a large training model, which is highlighted by the large model. Of course, we will explain specifically some new technical problems brought to us by this large model in training, so this OpenAI has GPT 3, Google has switch, Transformer and so on. Then since 2021, especially since this ChatGPT, then it has changed from this big model to a big training model, what does it mean? This model is a big model, almost everyone is engaged in it, everyone is practicing, who can win? That's an interesting topic. Then our country is actually preparing since about 2020, for example, at that time like this Baidu has this Wen Xin word, then including the latter Zhiyuan, it is actually this enlightenment series of Tsinghua, as well as this Pangu series comprehensively created by Huawei and Shenzhen Pengcheng Lab, and this Ali's Dharma Academy got the M6 series, this and the multi-modal model of the Institute of Automation, a little smaller, Zidong Taichu series, and even the wave withdrew from its own model, So it's really called a big training model.

In such an era of big training models, in fact, although the architecture of Transformer is very capable, it is still not easy for you to control it, because its architecture is a stacked architecture, which has both this forward neural network and a very flexible connection relationship of this self-attention guidance head. So it is essentially when training, whether it is the amount of computing power or the amount of data, especially the demand for memory is huge. Because its attention is a matrix relationship, it forms an order of magnitude growth, which is actually a flat way of that growth, so when training large models, it is often difficult to get it by single card clicks, because the large model itself everyone knows that at least tens of billions, hundreds of billions or even trillions of parameters, so it often uses many cards to train at the same time.

So how to put a very complex, because this neural network will eventually compile into a large computational graph, this computational graph is a way of this directed acyclic graph, which is equivalent to dividing this graph into different cards, so that each card is responsible for different faces of this graph, and then in this training process is an iterative process, on the one hand, this card to carry out very intensive training, sometimes this GPU will burn if it is not good, it is very hot in this training, if the algorithm design is not good, sometimes pushing the GPU beyond the safe range. On the other hand, it has a very strong communication requirement, because the card and the card between the very intensive and frequent communication, because it is a whole large graph cut into several subgraphs to do training here, so it must do data transmission.

After talking about this overall background, we may want to review this feature of OpenAI of the company that made the big model of GPT, because the company's development history is actually relatively short, not a company with a long development history, in essence, it is still a technology start-up. We all know that its founder is mainly Sam Altman, this person is not a pure technical expert, but like Musk, a technical venture capitalist, who understands technology, and he is very good at pulling investment and doing the start-up of this enterprise.

The main purpose of establishing this OpenAI was because of Google's rapid development, and we know that in 16 years, Google and deepmind launched the Alpha-g series, which achieved outstanding results in the world of Go, so all this for some investors in this Silicon Valley and some other companies in Silicon Valley, felt threatened by Google, and he did not want Google to occupy a completely dominant position in this enterprise where artificial intelligence is booming. So it is necessary to set up a technology company that can compete with Google, and call it OpenAI.

At the beginning here, Musk was one of the founders and also had an investment mission, which was said to invest $1 billion, but in 18 years, they all felt that this was done for half a day, and compared with Google, the competitive advantage was not obvious. So this Musk said to Altman and the rest of the board, in the future, this company will be managed by me, I will definitely be able to manage it, but in fact, these people do not actually buy his management style, you can see a lot of news about him, so it is equivalent to a dispute within this board, and then Musk was kicked out of the board, after the kick out, he only invested $100 million, and the remaining investment gap was gone, and he no longer invested money.

At this time, Microsoft just saw a good opportunity, because we know that OpenAI was originally competitive with Google, Microsoft injected $1 billion, then of course, this $1 billion is not purely money, half of which is all that Azure cloud computing resources, equal to the support of huge computing resources, which played a crucial role in the success of the latter GPT series. It is said that the training of a model like ChatGPT uses this GPU card of 10,000 on this A round, so without such a huge amount of computing power, it is difficult to imagine that it can achieve success.

We can also look at the fact that over the years, Google and OpenAI have been competing in areas such as reinforcement learning, language models, including images and automatic generation of this code. Of course, reinforcement learning because this is from the AlphaGo series, including the protein folding prediction that can be done later, like alphafold two, including Alphastar like this multiplayer game, etc., this Google has always taken the lead, and the accumulation of the company's capabilities in this field is too powerful. So although OpenAI has also done loneliness, like OpenAI Five has a certain influence in academia, but it cannot shake Google's technical advantages in this field.

Then the following two areas are different, especially in terms of language models, this OpenAI this from GPT 1 to 2 to 3. 5. This time, especially 3. 5 is beyond Google's stature. Google is actually doing well, including the aforementioned Bert5 and so on, have achieved good results, but Google is promoting the real application of large models, especially to let more users participate, by forming a user input This constantly improving so-called data flywheel such a model, Google is a bit conservative.

Of course, it is not unreasonable to be conservative, because as such a large enterprise, to use this such an emerging and even hallucinatory thing, quickly for example, on its search business, is actually extremely risky, even if GPT 4 is produced today, today's GPT series of this model, will still output a lot of wrong, incorrect content, so this is an unbearable risk for Google, we all know that in the news LaMDA was once fired because his employees thought that the dialogue with LaMDA, this LaMDA is sentient, and the news comes out on the news, so it is conceivable that Google has actually adopted a relatively cautious and conservative attitude towards the development of this language model, but it is this that makes him lose the opportunity in the competition with OpenAI, and then is now in a relatively disadvantageous position.

In terms of this visual image, especially the recent emergence of this diffusion, OpenAI's dalle series is stronger than Google, and the combination of AIGC and ChatGPT makes OpenAI obviously have strong capabilities in this field. The last area is coding, which is actually automatic coding, automatic code generation, OpenAI because of the alliance with Microsoft, and Microsoft has mastered our world's largest open source resource, which is GitHub, so the Codex and Copilot on Git Hub are better than Google's Alphacode in terms of application and stability.

So from a comprehensive point of view, OpenAI is actually surpassing Google in many ways at the moment, so this is a very interesting phenomenon. Imagine that Google is a company of tens of thousands of people, and OpenAI is actually a total of about 100 core developers. Then this GPT series may have some of their own experience on the news or on the Internet, then its function is all-round. For example, refer to Lu Xun's writing and write this impression during the epidemic, then he can write it, for example, let him write a piece of code that can compress files according to your needs, and he can also write it smoothly. Then for example, in our department, maybe this often encounters all kinds of this form filling, generating forms, you just tell him how to do it, he can also help you generate forms to fill out things. Even for example, let him tell him that you write an English poem that rhymes, or someone deliberately embarrasses him, you have to write a poem that does not rhyme, even if he rhymes without rhyming he can write it, so his language ability is very strong.

Then for some of the typical tasks of natural language that we have in the academic and industrial fields, including things like text abstracts, that is, to output a summary description of the text to a text, including the analysis of the emotions of the text, including the word segmentation of the text, that is, to put forward the first word, but also to list the specific entity conceptual entity of the word in it, including such as Chinese-English translation, which can be easily completed. So in fact, GPT's all-rounder is a bit of an ability that feels omnipotent, making it the entrance to the next general artificial intelligence information system, which is a very significant historical change, because in the past, for example, whether it is this PC era or the mobile Internet era, the main entrance to our information is either an operating system platform like Windows or a mobile Internet era. It is the various APPS on Android or Apple mobile phones, then the core here is actually the operating system and the network operating system, which has become the main entrance to the system.

But the AI field has actually been decentralized, without such a unified entrance situation, there is no real AI operating system like Windows, Android, and iOS in the AI field, in fact, it does not exist. Although some people advertise that I made an AI operating system, that is not true. So at present, this kind of ChatGPT as a representative, we call it foundation model, a basic model with strong capabilities, it does present a unified interface for information in the intelligent era, that is, when you put massive data, these data may not be this labeled, a large amount of data is actually it self-supervised learning, it does not need to be labeled, self-labeling and then feed him into this large language model, Let language models fully remember the concepts, patterns, and other things in them. Then on top of this it will produce a very rich ability, which I just mentioned, and then on this ability can add a variety of applications, so it is really a new operating system in the AI era, or called the interface of the system. There is a paper on this called Language Models a General Purpose Interfaces, which has been clearly proposed abroad, and this is the role of an information portal. In addition, the industrial ecology of ChatGPT has been developing rapidly, this ChatGPT series is the most so-called phenomenon since the Internet, this Internet product can reach 100 million in the past, netizens can only complete in two months, is unprecedented, has never seen this situation, then the current plugin around ChatGPT is its plug-in application, is very rich in rapid development.

People talk about it and do all kinds of things, this code assistant, voice companion assistant, this machine translation, AI customer service and so on. In fact, the industrial ecology of ChatGPT, because it is a universal human-computer interaction interface is taking shape, and is forming a subversive service model, as far as the future can be imagined, everyone is no longer needed for this information interface, for example, we take out a mobile phone to enter things on the APP, type, choose on the menu, you may speak to the mobile phone, completely express your intentions through natural language communication, output your corresponding information, and then the system can understand you. Then according to your needs in the background according to various plugins, and then call the corresponding functions to complete the services you need.

So we can think that we are on our way to this threshold to serve a whole new era of intelligence, so all of this marks the arrival of an era of general artificial intelligence. Because ChatGPT has emerged a certain degree of human-like language intelligence and cognitive intelligence. In addition, the capabilities of this chat GPT and GPT 4 are multifaceted, general-purpose, it has generated a powerful artificial intelligence base, then is opening this new era of general artificial intelligence, of course, there will be more advanced product technology in this era, you just need to wait and see.

The following introduces some basic principles of this large language model, because the previous is a relatively macroscopic talk about some of its capabilities, its development trend These problems, here from the principle must talk about the architecture of this Transformer, this is a bit technical, in fact, Transformer architecture You can understand it as a codec, in fact, it is composed of these two stages of encoding and decoder. So the so-called encoding is actually to give our language images to it through a vectorized representation, to it into the system, the so-called vectorization is a bit like us, to do a catalog of things, of course, it is a very high-dimensional code, then in this coding process will experience the main thing is like some fully connected layers, we know that in the neural network, this forward is called feed forward network, This omnidirectional fully connected layer has a strong ability to approximate this function, and it can basically fit all functions, so it is most appropriate to use it to remember concepts and remember these things. At the same time, this so-called self-monitoring and self-attention mechanism is intertwined.

The self-attention mechanism is very good at linking various concepts organically together, which is actually the role of encoder, that on the decode side is how to decode and restore the information you encode, which can be turned into output into the content you need. Simply put, so the transfer architecture is a stacked architecture. Under the definition of such a unified Transformer architecture, many different categories will evolve. More typical is the first encoder-based, called encoders-based, here just talked about Bert, and some other types, this is partial encoding, it can be encoded if it can be two-way association, two-way association is very suitable for the analysis and understanding of various semantics of this text. Then the GPT series is decoder, decoder is unidirectional, also called autoregressive language model, very suitable for generating is the task of text solitaire, of course, there are all this, the encoder and decoder two parts are combined, there are also some of this representative model. So the focus is that we can look at this GPT2, which is 1.5 billion parameters, and make a simple comparison with this BERT.

GPT2 is all decoded, layer by layer; And Bert is a cartoon character, it is a coding layer encoder, layer by layer stacked together, in fact, through continuous stacking, these models can absorb more information content. Of course, by stacking each layer, then its parameters will increase a lot, so here is a simple example, let's take a look at this GPT 2, how its text solitaire is done. For example, if you type this text in this line below, recite the first law, and then you ask the machine to say, can you help me predict what the most likely word is? This is one of the tasks that the GPT series is best at, and then in fact, when he processes it, he sends these sequences, the words that appear in this sequence just now, into this layer of content composed of decoders through coding, and then each layer will actually combine a very complex pattern, distinguish and predict the probability of the occurrence of this word pronoun, each layer forms a synergy at the same time, so finally vote together to decide what the next word that will actually be output.

For example, in this example, the front says Robert must Obey, so what is the biggest possible next word? Probably orders, so he guessed. So in this process to train such an ability, in fact, mainly rely on this so-called self-monitoring mechanism, that is, in fact, cloze, that is, to blindfold him with the next word you want to guess, and then tell him the real result, a large number of corpus tells him that he will reverse train these parameters of this transformer of each layer in the future, through this gestalt fill-in-the-blank way, so that he can be more able to embed from the word of this corpus, It can accurately predict what the probability of the next output is.

You see that this may be too small, the word is the probability that it is possible to predict each of these words, and then there is a bunch of probabilities, it may pick the largest one to output, then the basic way the GPT family works is like this, but the difference is that from GPT2 to GPT 4, the first layer stacked layer has increased greatly, of course, we know that ChatGPT and GPT4 have not published its architectural parameters, so everyone has no way of knowing, you can think that it must be there GPT3 should have added more layers.

Then you can take a look, GPT1 has 12 layers stacked together, GPT2 has reached 48 layers, GPT 3 to 96 layers, close to 100 layers, so this different number of layers, and then each layer corresponding to this attention head is that attention head is also different, the data has also increased, so the more layers stacked, the stronger the ability. This is the first one.

The second is that the GPT can predict the context of each next word, and this length has also increased dramatically. In GPT 2 it can also analyze the context of 1600 words, then GPT 3 has reached more than 12800 words, then we now see ChatGPT and GPT 4, it must be more expanded for the context, so based on such capabilities, in fact, there are three very important technologies in ChatGPT to guarantee his ability, especially the ability to emerge this cognition has a very important role.

One is called prompt tuning, called prompt tuning. The second is called instruction tuning, which is the thought chain. The third is reinforcement learning based on human feedback, which is reinforcement learning for human feedback in English, which we can look at one by one and talk about some of the most basic concepts.

For example, machine learning, in fact, is mainly divided into three categories, one is supervised learning. To put it simply, for example, if you want to train a machine learning model to recognize these two flowers, how to recognize them? That is, put these two different flowers in different lighting angles, label it, this is a sunflower, that is a chrysanthemum, and then generate this labeled training data, and then use this data to train the model, this model is actually a classifier, it can divide different data into two categories, this is called supervised learning supervise learning.

There is also a category that says I don't have a label, I will give you the data, you learn according to the characteristics of the data itself, this is called unsupervised, or unsupervised learning, English is unsupervised learning. Another is reinforcement learning, that is, he does not give labels, but it does not mean nothing, give you a game environment, let the individual of the game in that environment to trial and error, through trial and error to train themselves. This is the Alphago type, like playing Go, in fact, in the end, he is not with people, he relies on himself and his own similar Zhou Botong left and right hands to train, so he has practiced can be very high, the level is very high, in fact, it is essentially to help himself.

So for large language models, people propose a self supervise learning, which is self-supervised learning. Why can't supervised learning be used directly? Because the cost of annotation for supervised learning is too high. You imagine that if you take the hundreds of billions of corpus in the world, each of which is labeled according to different tasks, how much money it will cost, how much labor and time will be spent, which is actually unrealistic. So in fact, self-supervised learning is a similar mechanism, that's it, for example, for example, to use an image example, for example, you want to train this machine to recognize this word, this handwriting, and then actually cover this part with that paste, and then let the machine only see the part left after the mask to guess the blindfolded part, this is the so-called self-supervised learning. That is, the label is generated by introducing a mask on this data.

For example, the following example is also like this, you have to predict the next word, you give it the next word, label it, and then this posted answer is the label, and then you send it to this language model, in fact, there is a long list of corresponding corresponding words behind the following example, this is self-insight, such benefits are obvious, because there is no need to find someone to label, according to certain rules it can do a mask to generate labeling data, and then use these data to train, so that the cost of training is greatly reduced. You only need to generate this self superlearning data, then train such a original model, in fact, it actually had these models before BERT, but it actually exists as a basic condition, it cannot be directly used for downstream tasks.

That is to say, when you get these original corpus, after training the model through the self-supervised method just mentioned, you have to do a variety of downstream tasks, such as the equivalent of writing poetry just now, doing this sentiment analysis, doing these tasks, when doing classification tasks, you also need us to call fine tuning, that is, to do fine-tuning operations, that fine-tuning operations are actually to really make that downstream model can get ability improvement.

For example, here is an example, you have to ask where this American flag was born, in fact, when training, you are blindfolded to train, and then you need this fine tuning to complete when you really have a downstream task. For example, I wrote an example here, when doing scheme tune on BERT and T5, when generating this pre-trained model, you need to find a part of the data, which is to be labeled, just like supervised learning. Then, for these supervised data of downstream tasks, supervision training is carried out in part of the language model, which is called fine-tuning, which can achieve better capabilities.

But since entering the super large language, especially this ChatGPT, in fact, this fine-tuning work has become much simpler, that is, it does not need to specifically find a batch of data and then do annotation, but directly write prompts, that is, it was originally to find some more data to label and retrain, and when his prompt came out, he only needed to write hints under the pre-trained model, and then he automatically learned to prompt through the learning of prompts, which is not the kind of label data.

Prompts are actually examples of language, and it's actually learned through an example.

Wait until that instruction part is even more powerful, examples do not need to be learned, directly send instructions, that is, you are the same as a person, you only need to send instructions to him, and even for tasks that have not been seen automatically have such capabilities. You can do the task that you originally trained, the task is B, C, D, and then automatically it will do A, so these are three different ways of fine-tuning. To give a more specific example, let's say that this is a bot model, and this bot is actually this image. Then let's say this is very cool sentence, and then you have to see which can replace the word very by masking, for example, there can be pretty, really, super is all possible, in fact, it is trained by masking in this way. Then when training, if the prompt just mentioned, then the model originally did to do this tuning, is to have new parameters for each task, new data in, and then for each task task A, task B, task C to do this fine tuning fine-tuning to get results.

You don't need to use it later, directly for example, for task A you write a prompt example, this task B writes a prompt example, task C writes, after writing, send it directly to him, no training, directly out of the result, directly have this ability. Here are some simple examples, such as this is about a movie, what this movie is like, and then there is this great a very good movie, a terrible movie, and then say that this movie looks this painful, or artistically painful and so on. So when training, first cover him with this mask, then let him practice, and then write.

When it is a specific prompt, I just need to send him the corresponding information, for example, to give him an example, like this, then this instruction is to tell him that you want to classify this movie score as good, great is positive and terrible as negative. After telling him this example, the model will automatically have such capabilities, and there is no need to train again.

Then in this instruction is more flexible than the prompt just mentioned, which is equivalent to this, such as this, this is an example of training reasoning, and also began to give an instruction, such as how to sleep well in the summer such a goal. Then there are two options, one is to put this thing in the refrigerator, the other is to put the thing in the microwave, and then let you choose which is the right way, of course, the refrigerator is more suitable, right? You can't get hotter and hotter if you want to sleep in the summer. And then the one on the right of this example is also an instruction, which is an example of making this so that he can do translation, and then you can give him a lot of other instruction examples, and the next time you let him do a job, it is not covered here at all. You can also let it do a logical reasoning task similar to this natural language, such a task has obviously not been trained before, but through the previous instruction this system will soon master such an ability, to briefly summarize, in fact, this prompt itself is in the case of a large model to exciting, further stimulate the completion ability of this model, essentially belongs to this language model task, itself this instruction Just one level higher than it, it is equivalent to stimulating the ability to understand the language through examples of different tasks, and then allowing him to master new examples.

Language also has the ability to generalize, to do the most pursued thing by AI, here involves the so-called thinking chain that everyone often sees in the document, that is, China thought, in fact, the prom instruction just talked about to it according to logical reasoning again. For example, here is an example, a simple math problem in this elementary school, this Roger has 5 tennis balls, and then he buys two cans, each can has three of those balls, this tennis ball, and then asks how many classic ways there are now is to tell him the answer is over, and then your next question is to ask you this changed the scene, there are 23 apples in this bag, and then this apple is bought in and out, and you have a few apples left in the end.

Such a problem is actually a way for us to often teach children to do math problems, give him an example, and then let him tell him how to understand according to the meaning of the question, and then give one, then in this thinking chain is to tell this step of solving the problem step by step to this machine. For example, in this example, the previous example directly gives an answer, and the one on the right will say how I calculate. The person starts with 5 balls, then there are 3 in each of these two jars, so it's 6 balls, 5 + 6 = 11, so the answer is 11.

Then use such a thinking chain to better train it, the result research shows that this is actually easy to understand, if only tell the answer, it can only guess the next question based on some superficial things, he is wrong. But if you tell him the reasoning process, he can actually imitate this thinking process to do reasoning, so he can come up with the correct answer, then that is to say, the thinking chain is actually a step further in this prompt format, he turns the original input-output mode into the input reasoning process plus output mode, because it this template This prompt template is generally question and answer, it is in question answer The step-by-step steps of this problem solving are listed. This actually helps this, so that the model can better understand the reasoning process and be able to imitate this way to do reasoning.

In fact, it is to break down the reasoning task into this specific step, so that a small number of examples are used, without retraining, so he is very suitable for doing those math problems we just saw, elementary school math problems, or some of this common-sense reasoning, logical reasoning, etc. Studies have shown that the thinking chain has a very important role in the cognitive emergence of this large model.

Another very important technique is actually reinforcement learning with human feedback. This human feedback reinforcement learning is actually very interesting, not OpenAI proposed, this is 17 years of Google DeepMind proposed, because in this problem like Alphago just mentioned, it needs to do long-term training on this agent this agent, but let the agent itself find that the agent sometimes tried for half a day, will be stuck in some local optimal circles can not come out, can not achieve the best effect.

So at this time, you need someone to add some feedback on the side, for example, I use an analogy, for example, if you train an intelligent walking maze, you may not learn to walk a certain step of intelligence, at this time you need someone to try to go to the left, or that you are walking in the wrong direction, prompting him is feedback. In this way, he can avoid falling into the situation of falling into the local optimal at the key point, and can get out of the local best and finally achieve a relatively good result.

Then ChatGPT actually uses this method for such training of natural language models, that is, let people sort the output of this GPT and generate this reward function through such sorting. Because we know that the most important thing in reinforcement learning is to reward the function, and this reward function is actually in the reinforcement learning, generally think of it as an art, not science, that requires very delicate to adjust according to the task, the more complex the problem, the more complex the reward function, so often at this time people need to mark. But this is very interesting, in fact, people can't give a quantitative reward, which can't give a quantitative reward. But people are very good at sorting, for example, this thing is good or bad, it can be ordered, in fact, through the data in this sorting to generate and train a reward neural network, and then can better help this GPT, can make its output meet the needs of people. That's why it's important why he can speak like a human right now.

Here is an example, for example, you have a question here What is a banana, let GPT give a paragraph to explain the concept of banana, then the original GPT in training this version, then it will train many versions, the output of that different version is not the same, so openai hired some people, and then their corresponding different versions of this output results to make a one, let these people do an evaluation.

The way of this evaluation is what I just said about the good in front of the ranking, the bad in the back, and then arrange the answer in order, and then after integrating the results of these sortings, I will design a reward function, and then I can use this reward function to train this model, itself this training is non-stop training, just like this game, self-fighting is the same. Then he asked him to adjust the parameters himself, and he could practice until he output, rewarding the result of the judgment that was the largest and most frightening of the function, so this is called reinforcement learning of human feedback.

It can be said that ChatGPT actually takes many artificial intelligence methods and makes a big synthesis of current artificial intelligence methods. Some people have been criticizing this, in fact, ChatGPT these things are not original, there is no innovation in theory, nothing new, they have all been seen, they have all been seen, but he put these three things in such a large data scale, such a complex model, it is indeed quite difficult to play it together organically to the extreme. The importance of this integrated innovation cannot be underestimated. What it currently knows is that it has used this 45 terabytes of data, this nearly trillion words are equivalent to the amount of words in this 13.51 million Oxford dictionaries, this number is very staggering, and billions of lines of source code, there is an estimate that may not be accurate, that is, the world is these generated texts, high-quality texts are 5 trillion tokens, 5 trillion is that ChatGPT This series has used 1 trillion to use 1/5. So there is an assertion that maybe in two or three years, maybe all the corpus that can be seen in the world has been trained into the big model, and this is really a new era.

Then of course the model itself this parameter scale and training, just now I talked about 10,000 V100 this card, the investment is 1 billion yuan 100 million US dollars, but at the same time to train and spend money, this cost of electricity is very powerful, not to say that there is a card alone, GPU is still very power-consuming, so the cost of computing power is 63640 Pad Flops a day, you can calculate how much does his electricity cost? Training this ChatGPT starts at least a few billion dollars, so it's a very expensive thing. Well, we can now summarize some of the basic features of this ChatGPT.

The first is the road to simplicity. Why do you say that? Because in the past, from this big training model, CNN, RNSTMGN, then it seems that at least in the language and even in the future in the visual may not necessarily converge into the Transformer architecture, that is, Wan Zong converted, right? Only this one is the most spiritual, why reason? In fact, there are theoretical reasons for it, and some people have theoretically proved that the Transformer architecture is Turing-complete. What is Turing-complete? That is to say, these solvable problems in the world, this general machine is Turing-complete, because Turing we know is a scientist in the founding era of computers in the UK, specifically has the Turing Award, which defines what is called Turing-complete, this is the computability of the problem, that is, Transformer is a very general computing machine, so the road to simplicity is this concept.

Second, the aesthetics of violence, which everyone often sees in the news, right? In fact, in the end, it is to rely on a large amount of computing power, massive data and a very powerful ability to control this complex model engineering to the extreme, so that this model has a strong ability to obtain and learn this information. The third is the emergence of cognition, the so-called quantitative change to qualitative change, that is, people originally thought that the formation of this intelligence may be a very long and difficult process, and now found that through violent aesthetics to the model to a certain scale, his ability jumped, right? Beyond a certain critical point, detonated, phase changed, it was completely different from the original appearance, so this is the so-called emergence problem, and suddenly he can see all kinds of cognitive abilities at this advanced level.

The last point is value alignment, in fact, this is also inherent in the current GPT series, its flaw is that this emergence is actually a very, very complex phenomenon, at first GPT is not open, and then in fact, the emerging constraint person must be much more complex than him, but the model is relatively simple and crude, he is to take the information to predict the next value, so it has neither values nor facts of various checks.

So everyone will see the earliest various Guan Gong Zhan Qin Qiong, which made up facts with you, it really looks like this, but you check the facts are wrong. In addition, there is a lack of guidance of values, you can talk indiscriminately, say a lot of offensive things, say a lot of wrong things, so these are called illusions, which is the so-called cognitive illusion. So in order to prevent this, so that the machine can better comply with our human morality, ethics must be value alignment, this value alignment is mainly completed by the reinforcement learning of human feedback just mentioned.

Finally, how did the emergence of cognitive intelligence of the big language model come about? Why did this emerge? Of course, there may be more technology involved here. First of all, there are many in the big language model, because the current evaluation of the big language model actually has a very rich Benchmark, there are dozens of items, involving many different capabilities. Here are a few typical examples, such as three-digit addition and subtraction, such as some translational tasks, such as recovering words from scrambled words, such as open language questions and answers like Persian, which are some of the different benchmarks. Then everyone found that this parameter is the thinking chain just mentioned, how can these abilities play a role? Generally speaking, it is necessary to have parameters above 10 billion scale, so the pattern of seeing this curve is the same, just at these points are around 10 billion, its ability can go up at once, otherwise at first it feels like there is no progress, which is the so-called emergence problem.

Then of course, this emergence must be used in the thinking chain, that is, if you do not have a thinking chain, even if the parameters reach a certain level, the emerging ability is not very good, once the thinking chain is used, many abilities will be like this math problem, do math problems this will suddenly improve a lot. So here we have to ask, that is, first, what is emergence, right? Everyone may know the meaning of the word, but what does it mean in the scientific sense? The second is the mechanism of the emergence of such systems? This is actually a scientific problem that has not been completely solved at present. Emergence, simply put, is a glance of many, quantity changes to qualitative change, when this parameter gradually expands, its ability will suddenly come out of a lot of capabilities. Of course, this many high-level abilities actually emerge in this current situation and there is still room for improvement, because after all, this large language model, in terms of cause and effect, there is still a lot of this unknown place, not that it can know it, and know the reason, so there is this so-called reductionism and this system theory such scientific speculation. Many studies in the past, including chemistry, emphasized reductionism, that is, reducing the problems of the complex world to its parts. For example, whether the physics of this world boils down to the laws of this atom and molecule can be solved, and all can be explained, then so. But in fact, the reduction hypothesis is flawed. Many complex problems will actually emerge, that is, its integrity will be much more complex than the simple superposition of parts, and will produce many new properties. Then the emerging problem in nature is that our macroscopic world is universal, we see this bird flying together, this fish group swimming together, including such as this bee colony, ant colony, everyone together to act foraging, these are in the macroscopic world can see emergence, including our daily traffic various large-scale activities, in fact, this is all emergence, that is, from the original relatively disordered state, it becomes an ordered state of a mutation phase transition.

So this is actually in a more microscopic situation, for example, if you look at this virus or this bacteria, it is also emerging, it appears in groups, and there are also very interesting emergence phenomena. So emergence is actually a kind of problem from this organic to inorganic, from this micro to macro, which can be said to be a universal problem. In philosophy, there is quite a lot of research on many sciences, so it can simply be said that Emergence can make a basic classification of this type of it, which is a very rough classification. For example, this does not have feedback, this is a simple emergence, that is, the system is simply put together, and there is feedback that there is a certain difficulty in emerging, but this emergence is still a slightly weak emergence, and human science has more research on weak and limited research, such as these behaviors of ants in this society just mentioned, including these behaviors on our Internet.

Then there are more difficult, such as stocks, we know that stocks cannot be predicted, so there is a very complex feedback on the stock market, so its emergence has become a much more difficult emergence than the easy to predict emergence just mentioned. The most difficult thing is that we have these fundamental questions related to cosmic life and brains, which are generally listed in SARS annual questions, these questions are currently too many unknown answers, so this kind of is called strong emergence, as far as science is difficult to explain, then emerging in our artificial intelligence, especially in swarm intelligence has a very important position.

In the 2017 national plan, we listed four in the field of swarm intelligence, one of which is incentive and emergence, specifically written in the strategic plan of the new generation of national artificial intelligence in 17, as a document of the State Council I want to release. So this is actually essentially saying that these intelligent systems in the world, which is a dynamic complex network of cognition, then in fact, these networks are characterized by these behavioral patterns are a kind of emergence. Well, in fact, the emergence of the most relevant big language model is the emergence of the brain, so here we can also learn from some research on the emergence of the brain to see how we view the emergence of this neural network such as the big language model. For such an emergence phenomenon, generally from this cross-pool perspective, from this macroscopic view and microscopic three levels, we know that the microscopic human brain is composed of thousands of neurons, artificial is natural neurons and this trillion axons that connect neurons. Whether it is the human brain, you are taking an insect, such as a fruit fly, the fruit fly is a species that has studied the structural relationship of this neuron more thoroughly, it is about 7,000 neurons, and then the relationship between the neuronal connections in it has been able to draw this so-called neural map.

So in such a network, its microscopic is the operating mode of a single neuron, then now for the operation mode of this single neuron, humans have a lot of observation and analysis methods, numerical simulation is also very clear, then at the boundary level, that is, you through thousands of such neuritis composed of this functional column of nerves, right? There are also such neural circuits, such research is actually now in full swing in human beings, but this kind of research actually lacks too many mysterious links, because it is very simple, this observation is difficult.

At present, one method is to say that when you have this brain disease, implant electrodes in the brain, and this electrode can study the nerves inside the brain area, this neighbor's group, which is a summary of the pulse cells of thousands of neurons. But such a research sample is very small, and can not be real-time to get, now people are thinking of this more powerful way of this microscope, non-invasive way can have research, this exploration road is still very long.

So at the macro level, people have done this macroscopic study of the brain area, this is very good, now everyone has that called MRI, FMRI these devices, this in the centimeter and millimeter level for this different brain region mutual, because he is looking at the blood flow of this brain area to decode, so there has been quite a lot of progress in this synergistic relationship of the brain region. So in fact, emergence is generated from microscopic individual neurons that make up the functional regions or circuits of the nerve, and then form larger brain regions to interact. So in fact, language ability is only a part, the brain area like the Transformer architecture already has hundreds of billions of this parameter, so its neurons are in a sense close to the scale of the human brain, so it also needs to be studied from the macro and microscopic view.

So at the macro level, we can see that it already has this ability, is it a variety of capabilities? So you can test all kinds of this benchmark, it can do problems, write code, and even answer some complex logic problems. At the micro level, this actually involves the training and iteration of this artificial neural network of the neural network. There are many studies on these problems, and what is most lacking at the moment is the boundary view level of the Transformer language model, which means that it is a stacked one, right? There are many layers stacked together, so what does each layer mean? In fact, current scientific research is still in progress.

The basic understanding is that each layer may be responsible for different functions, generally speaking, the lowest layer is responsible for that bias, such as lexical, higher is grammatical semantics, and the context is becoming more and more complex such a level, so you need to study methods like this interpretability, and something like this like that neural network electrode, right? We need to have this kind of exploration technology, which can organically interpret these brain areas inside this Transformer, the internal neurons, and some parameter laws of artificial neurons.

So there are already some related studies being done, especially the language processing ability of the human brain and the ability of the GPT series to do some analogous analysis, that is, the human brain, that is, I just said that there are some people with electrodes implanted in the brain, using the mode of this signal in this electrode to decode and analyze, and then give the human brain and this GPT model to show the same task, such as this task of text solitaire.

So the study shows that in fact, at least on this solitaire task of this language, they have some similar computational principles, such as the ability to make good predictions of the next word in continuous context. In addition, when he found that the prediction and the actual situation did not match, there would be a prediction error, and then this prediction error can stimulate the human brain or stimulate the nerve artificial neural network to adjust accordingly, all have this ability to represent words through the context, so at least from some current research progress shows that this GPT series of models has a certain human-like calculation mode in language ability.

The second is that because GPT 4 and ChatGPT have not announced parameters, this is still an analysis of the composition of the 175 billion parameters in GPT 3. First of all, 1% of the parameters in this parameter are used to express the word, that is, the word is portrayed as a vector of 12001220080 dimensions, for example, you put a school, such as Zhejiang University or Beihang, or our company as a vector input, in fact, this word related to this word has a lot of information, right? This information can be compiled into this 10,000-dimensional information. So in the GPT series, it first has the ability to associate the word with some of the corresponding meanings of the word. Then there is association, which is the attention mechanism, that is, out of the 175 billion parameters, 30% of the parameters are used for attention, and there are 96 mechanisms to do various attention analysis. Because what does attention actually mean? It is to connect this concept with the concept, so with this extreme such association mining, it can touch the bypass, and can emerge ability. So this is the second 30% parameter. There is also the most important parameter accounting for 60%. It is the feed forward fully connected layer of such a network that I just said is used to specifically remember concepts, it can remember this grammar, this word, the concept of various forms of polymorphism in this text, and the conceptual patterns of various levels, so it is called thin water and long flow thinking, and by combining these three aspects of the element parameter elements, it can emerge cognitive ability.

So there have been some more theoretical studies that show that in fact, the artificial neural network is actually in a kind of representation learning called geometric popularity, because the data in the world, because it is driven by data, the data actually has very complex dimensional information, then these dimensional information through the neural network It will map it to a low-dimensional popularity, this low-dimensional popularity actually corresponds to these concepts, that is, it is equivalent to data is actually a representation of this complex world. He needs to synthesize all kinds of information in this world in a lower dimension, can be better understood in this world, so the more complex the network, in fact, its ability is stronger because it can understand the larger world, can grasp more information, and then can remember the greater capacity of such popularity. So why a model of a certain scale? It is in this truth, because the capacity of your technology is not enough, then your understanding of the world is not deep enough or not complete enough, then this is a doubt.

The second point is that this Transformer network in this pre-link layer is the FF layer, the understanding of these concepts is actually much richer than the concept we say, its concept is very detailed, for example, here are some things extracted from the model, such as the age with the same hyphen in the first example, and the word with the same suffix in the second example. The third example has some base names for all Germany. The fourth is a decimal number with a percentage, the fifth is a decimal number representing an amount, the sixth is the age of the 1819th century, of course, there are numbers, there are English words, and there are some others, then I will not repeat them. So these actually various conceptual patterns are actually manifestations of this diversity in language.

The overall research shows that in fact, the bottom words, the word syntax is better recorded, and then the higher and higher the stack, the semantic corresponding pattern of the situation is recorded, so its weight is not the same. The other thing is that it actually implements a sublibrary called key value, and that key is the concept we just talked about, a variety of concepts. What is this value? Because we go back to the strongest and main function of this model that we just talked about is to do text solitaire, that is, to predict what the next word will be in a context. So when it comes to forecasting, in fact, here are a few examples.

When predicting, in fact, it is not predicted by one layer, each layer will contribute, and each layer will first identify the concepts that it can remember in this context, that is, that key, that pattern, and then find the corresponding value from that pattern, this value brings together the values of several layers to form a judgment of that probability, and then find the closest word that essentially means this.

So here are some examples, for example, this is the key code in this first layer of number 449, so what does he actually remember? It's all these sentences that this word ends in this substu, see what? All of these sentences end with substu. Then on the sixth layer of this 2546, this he remembers is some military, that is, these words related to the military, that is, the base related to the army. In fact, by the sixth layer, some relatively shallow semantics have appeared.

At the tenth level, there is a more abstract semantics, that is, part of, on this part of this relationship, you can imagine this, here you can turn to this example, in fact, it is some, they don't even appear in the word part of, right? But in fact, its meaning is a partial meaning, then more such things will be remembered later, so this is key, and then value is actually these are examples of value, is to remember to say that under such an example, when this prediction of the word, what kind of value will it output? So that means that it actually makes different levels of this neural network have transformers, and then according to the context you enter, and then jointly judge what the next word will be. But this judgment is not decided by one person, not by one layer, but by letting these 100 layers or even higher layers decide at the same time. So having such an ability makes it have a very strong ability to accurately predict a word.

But this alone is not enough, it is related to the instruction we just talked about, and the thinking chain, which involves a concept called situational learning, because in situational learning, it is actually through some examples to let the model quickly generate a certain aspect of the ability. For example, there is an example here, we let him first learn some famous life deeds, for example, this is a brief introduction of Einstein, Einstein is a German theoretical physicist, what great work he has done, so through this he remembered a concept, Einstein and this bunch of words are related. Then, for example, at the same time, he also remembered the corresponding words Gandhi, Marie Curie, and so on. Then you come and teach him, for example, Einstein is German, Gandhi is Mahatma, Gandhi is Indian, and then the next one asks him to answer who Marie Curie is, and he can judge by this example.

Why? Because the previous language model knows that it has strong predictive power, it can match this through the conceptual context, and then actually through examples, it is to further connect these concepts, so that it can select those concepts that are most relevant to the task according to a specific task, and then make a small sample. reasoning. We call this a Bayesian network in our terminology, because Bayesian networks have very low requirements for data, but in the past we trained a Bayesian network to be designed by people, but in the large language model it can be dynamically generated, through this it is very strong This text just talked about these abilities to dynamically component its own Bass network according to different tasks.

Speaking of this, we can combine these information technology points to make a general analysis of this emerging machine top.

First of all, through a massive corpus, this language model obtains enough conceptual patterns at all levels. The second is through the conceptual mode, which can achieve text solitaire, which is the same as the way people talk. The third is the scenario learning through this instruction just mentioned, he has organized these concepts into a dynamic network, according to the required tasks can activate these networks, and then achieve the ability of people to expect, is a very strong generalization ability. Of course, why may it take tens of billions to emerge now? The main reason is that there is an information bottleneck here, that is, if you want him to give only a small amount of information according to such a task, that is, examples, it can be formed, you first have to remember enough concepts, right? There are enough conceptual patterns, and if there are not enough of them and do not exceed that threshold, then even if you apply a small amount of this instruction, it will not be able to complete the logical reasoning process just described. So why? That's why such a large network is trained.

In the future, there is a very interesting question here, which is if such a huge network is used in our department A, for example, in department B, right? Or in other C departments, do you need to remember this full amount of conceptual information, or in other words, how can I shrink and distill a business-oriented information from a large network in the future, without losing the conceptual network that is needed? This is a problem that many people are studying now, which may require more advanced models such as distillation, compression, pruning techniques, and then identify which conceptual patterns are indispensable according to the business scenario, and then retain this part of the concept, and let the model gradually forget those that may not be relevant, because he remembers too many things, in fact, some of them are definitely useless.

How to forget these, and then delete the forgotten part to it, so as to compress the scale of the model, so that our last one, such as a small and medium-scale scale model can be adapted to the needs of our business, then in fact, the big language model we talked about a lot of its experience before, but it still has hallucinations, what is the reason for this illusion? Then here are also some examples, such as this big, this is some typical examples in the big language model, such as singular and plural inconsistency, such as the subject of this place is a kiss, this is a plural, should have used are, but also use that is, and this number miscalculated.

This is common to everyone, and it is in many news reports, such as the conceptual error of this place, he said that this trophy does not fit in this suitcase of mine, because the trophy is too small, which is obviously a self-contradictory statement. There is also this referential error, for example, it was originally a female she, and the above was written as she sister, but the lower part appeared in the he reference, and it was found that this change was inconsistent, so these are actually equivalent to the common mistakes in the model. So after the release of ChatGPT, in fact, GPT 4 has made further improvements to these problems, but it still cannot be completely avoided. This reason is actually very easy to understand, because the current language model focuses on the realization of the language function of the human brain in this language, then its cognitive ability is because a large number of our cognitive carriers are in language, so through the spillover effect, it also has a certain cognitive analysis and reasoning ability. But fundamentally this part is still language function, and the human brain in addition to language only occupies part of the brain area, only a brain area, it has a lot of other parts, including places like logical reasoning, recording the situational intention of world knowledge, in the brain nerve research shows that it is not in that language brain area, so then our current language ability right? Although it is already very strong, it cannot cover all human abilities, so it is easy to understand such and such deficiencies. So here are some interesting examples of this study.

You can see that in terms of logical reasoning, in this GPT 3 it shows some interesting shortcomings. For example, there is a very ridiculous scenario, which is to tell the language model how do you move this your sofa to the roof of your house, and then start asking GPT 3 GPT3 is what should I do? I'm looking for a good ladder, right? Find a friend with a lot of strength to help me move together, if it's a person, this person can be anyway, just say I rent a powerful elf, and then help me move well. This is the first one, this seems to be similar to what the machine said, then look at the second is to limit these conditions, that is, you have to move this sofa to the roof, but do not let you use pulleys. At that time, GPT 3 is the concept that it provides, that is, I tied this sofa with a rope and attached it to my car, so at this time he didn't answer very well. How did the man answer? That is, I have to make a big shunt and then transfer to the roof through the big shunt, so the person is obviously better than he answered. If you impose some restrictions on this logic, you can't use various tools, cranes are not allowed, ladders are not allowed, then this GPT 3 you see nonsense, he said that he will give the sofa p into several pieces, and then stuff the bottom of the sofa through the window, that is, make up something indiscriminately, that person can come up with other tricks. So you can see that this is actually when you design the example, it is still easier for you to distinguish between humans and machines, that is, when you make the requirements of this reasoning more and more difficult, you will find that he can't think of it, and he starts to make it up.

So this illusion is actually difficult to avoid with large language models, and language models will find that it is actually very dependent on this pattern statistical pattern. For example, in our research, you can insert some specific patterns, such as punctuation marks or number series, which can completely mess it up and muddy its brain, because after it enters these patterns, it may produce some unexpected results, which are not right at all. So this also triggered a new science, called prompt attack, that is, how to write one, write some hints, and expose some backdoors and some weaknesses of the language, which has become a new research direction. In addition, the reasoning ability and factual errors just mentioned are still widespread. And as a language model, in fact, you can think of people actually learning language, not like him to collect all the things in the world to learn, people actually learn from a young age people are a small amount of data, gradually we can learn, so in the language model there are quite a lot of further improvement points, that improvement method is to enter the modular design, then there is now a lot of research work in doing this, the logical reasoning, world knowledge, These contextuals become different modules and language modules are combined with such abilities for richer cognition. Because it now has a certain human-like ability, it is necessary to build more elaborate evaluation methods and evaluation data sets. Therefore, this is also a better guide to the ecology of the large refining model that is constantly emerging in our country.

Because in the absence of a very objective, comprehensive, like for the human strategy IQ such a benchmark, then each business can create a model, and then give a few good examples, that is, I already have such an ability, but in fact may not be the case, so in the future these evaluations of this language model are to be standardized.

In fact, in our country and even internationally, there is now standardization work being done. This standard evaluation is more difficult than the past model, machine learning model, including general deep learning model, because it actually has a certain cognitive ability, so the design of this benchmark should be from the core of the corpus and cognition combined with corpus linguistics and many methods of cognitive psychology, design some corresponding evaluation tasks.

In addition, the method of language model, its inherent shortcomings make us need to use some other aspects of artificial intelligence technology to further compensate and enhance, the most complementary is the knowledge map in symbolism, as well as logical reasoning, cognition in this regard. In fact, for the language model, it is only the language function in the cognitive model. So what I just talked about, for example, is logical analysis? For the reflection of experience, the verification of this fact, and the error correction of output, these can be designed to work in connection with this large language model, so that it can better meet human needs. That is in fact already such work being done, for example, this is a paper compared to last year, and it was also done in China, which we talked about earlier in the fully connected FF layer of the Transformer, which actually has these key values and records a lot of fact-related content, so when you have a more reliable knowledge graph, you can input the results of the knowledge graph information into such a language model and check these keys that exist inside the language model. Then go and find those wrong places.

For example, here is an example to ask this one, let's say the capital of Sri Lanka, right? To find that it is an error output, it is necessary to find these errors according to the working mechanism inside the language model, and then correct the weight of the layer, which can further reduce the possibility of some things going wrong in the future. So to solve this problem comprehensively, in fact, we need to according to the needs of our business, for example, we have a good data accumulation in different fields under the internal system of a certain department, to better integrate this big knowledge, and then form an organic interaction relationship with the future language model. On the one hand, we can use the language model to help us automatically extract knowledge, including the extraction of entity relationships, and then incorporate it into our dynamic knowledge base. At the same time, we need to use the knowledge graph to check the reasoning in real time, at least check it, analyze and judge, and correct errors to ensure that its output can meet the needs of our business. Many departments must have a lot of questions to consult, this consultation is usually the current system is on-site or has customer service, then GPT system has such language dialogue capabilities, he is very helpful for the future automation of this interactive consulting Q&A. But his problem is still the problem that he just said that he has this hallucination, so he sometimes outputs some wrong answers or incorrect facts. To ensure that our system can also be an authoritative answer, so how to use the knowledge method just described to verify the accuracy of the content generated by this language model, a check, this has become a very important technical problem and practice to explore and groping place.

The analysis and generation of reports, I believe that this language model is also relatively good at and can be done. So of course, this needs more of this and professional knowledge to combine, to turn knowledge into this instruction, designed as a prompt collection. Now we know that doing this promp has become a new profession, called this prompt engineer, essentially to learn how to deal with this large language model, can better induce it to produce corresponding results, not that you can simply speak to achieve your favorite effect, the professional field needs to have a special person to design the corresponding prompt, and this prompt to be specially maintained and managed. In the future, this pair of prompts can be said to be no less than our management of information systems. At the same time, because such a language model cooperates with the knowledge graph, its cognitive analysis, judgment ability, and decision-making ability will also be significantly enhanced, so it may also assist us to pay some better attention to this compliance review and risk analysis judgment in the future.

So we know that a large number of documents are actually natural language, in the judicial department, for example, in the field of intelligent justice, for the filing of judicial cases, as well as the rapid judgment of these judicial cases, etc., are already using artificial intelligence in some to do some auxiliary means, because it itself has a strong processing ability of natural language, and even it has this common sense inference ability, you just need to write out this logic of your ruling, judgment, write this thinking chain, He might be able to make simple judgments as if he were really a judge. There is the corresponding standard process experience in it, if you give the corresponding example and write this thinking chain clearly, maybe he can partially or even a considerable part of it like judges and auditors to do this review. In addition, these various behaviors can be better automatically analyzed and identified by the system, so as to detect various risks earlier and improve the level of supervision. The development of this AI is already rapidly changing our lives, at that time we felt that the skyline emerged a few points, just like the sun was born, maybe now the sun is half out, and when it is the whole red sun, we will completely feel that our world has really entered the era of intelligence.