laitimes

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

author:Heart of the Machine Pro

Machine Heart report

Speech: Zhang Junlin

On July 8, at the 2023 WAIC AI Developer Forum held by Heart of the Machine, Mr. Zhang Junlin, Head of New Technology R&D of Sina Weibo, delivered a keynote speech entitled "Natural Language Interaction: The Interaction Transformation Brought by Big Language Models". In his speech, he mainly introduced the changes brought by large language models to the way of human-computer interaction, and the core idea is that whether it is human-computer interaction or interaction between AI, natural language is adopted, so that the way people manipulate data will become simpler and more unified. The large language model is at the center of human-computer interaction, and the complex intermediate process will be hidden behind the scenes, which will be solved by the language model through Planning+Programming.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

The following is the content of Mr. Zhang Junlin's speech, and the machine heart has made an edit without changing the original meaning.

Now many people at home and abroad are making big models; There are two core issues to consider for larger models:

The first is the pedestal large model. Building a powerful foundation requires a lot of data, computing power, and money. Although ChatGPT shocked the world when it came out, the main reason was not in the pedestal model. The powerful pedestal large model was not born when ChatGPT appeared, but slowly developed - from 2020, the scale of the model developed abroad has gradually increased, and the effect is gradually getting better, and the pedestal large model of ChatGPT may have improved compared with the previous effect, but there has been no qualitative leap. So the main reason why ChatGPT is so influential is not the pedestal model.

The second is the command understanding ability of the large model. If you want to consider the question of why ChatGPT has such a big influence, this is the main reason. ChatGPT allows large models to understand human language and commands, which is probably the most critical factor, and it is also what makes ChatGPT more unique than large models in the past.

There is an ancient poem that is particularly suitable to describe these two key components of the large model: "In the old days, Wang Xie Tang Qianyan flew into the homes of ordinary people."

"Tangqian Yan" is a large model of the base, but before the ChatGPT era, it was mainly researchers who paid attention to and improved. "Flying to the Ordinary People's Home" refers to RLHF, or Instruct Learning, which allows all of us to interact with large models in natural language. In this way, everyone can use it, and everyone can appreciate the power of its pedestal model. I think that's the fundamental reason why ChatGPT can cause such a big stir.

The topic I'm going to share today is "natural language interaction", which I think is probably the most fundamental change brought to us by ChatGPT-based Large Language Model (LLM).

Traditional human-computer interaction

First, let's take a look at the traditional human-computer interaction method.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

The essence of human-computer interaction is the relationship between people and data. People perform some behaviors in the environment, which will produce various types of data, which can be divided into two categories: one is unstructured data, such as text, pictures, videos; The other category is structured data. Enterprises may be more concerned with structured data, because much of the internal data of the enterprise exists in the form of databases or tables. People need to process various types of data, typical behaviors such as creation, addition, deletion, modification, search, etc.

Before the big model came out, how did people and data relate to each other? People cannot directly relate to data, they need to go through an intermediary, which is the application software. For example, even if you do the simplest text editing, you need a text editor, and a more advanced text processing tool is Word; Excel is needed to make tables, MySQL is needed to manipulate databases, and PhotoShop is needed to process images.

From here, we can see a feature of the traditional interaction method: different application software is used to process different types of data, which is a diversified interaction interface. Another feature is that traditional interaction methods are complex and cumbersome, and a lot of data needs to be processed by professionals. Taking images as an example, even if you are provided with PhotoShop, it may be difficult for ordinary people to process images well, because it involves complex operations that require finding the functions they want from multi-level menus that only trained professionals can do.

To sum up, before the advent of large models, our relationships with data and the way we interacted with them were complex and diverse.

Human-computer interaction in the era of large models

After the emergence of the big model, what has changed in nature?

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

In fact, there is only one key change: the big language model stands at the center of human-computer interaction.

In the past, people interacted with a certain kind of data through a certain application software, but now it has become a human interaction with a large model, and the way is very direct and unified, that is, natural language. That is, if you want to do something, you just talk to the big model.

In fact, this is essentially the relationship between people and data, but due to the emergence of large models, application software is blocked behind the scenes.

Let's take a look at the future trends. In the short term, LLM can replace some application software for text or unstructured data, such as the replacement of PhotoShop by multimodal large models. That is, LLM can do some common tasks on its own, and no longer needs the functional support of behind-the-scenes applications. At present, most structured data still needs corresponding management software, but the large model has come to the foreground, and the management software is hidden in the background. If we look at the long term, the large model may gradually replace various functional software, and we will look at it in a few years, and it is likely that there will only be a large model in the middle.

This is the fundamental change in the way people and data interact with the big model. This is a very important change.

Human-computer interaction in the era of large models seems very simple, if you want to do something, just say it, and LLM will do the rest. But what are the facts behind it?

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

For example, Apple's products have a particularly good reputation, why? It is because they provide users with simple to extreme operations, while hiding the complex parts behind the scenes.

The big model is similar to Apple's idea of making software. It seems that the interaction mode between LLM and people is simple, but in fact, the complex things are done by LLM for the user behind the scenes.

The complex things that LLM needs to do fall into three broad categories:

First, understand natural language. At present, the language understanding ability of large models is very strong, even if it is a small scale of large models, the language understanding ability is very strong.

2. Task planning. For complex tasks, the best solution is to break it down into several simple tasks and then solve them one by one, which generally works well. That's what mission planning is responsible for.

3. Formal language. Although human-machine interaction uses natural language, subsequent data processing generally requires formal languages, such as programming, API, SQL, and module calls. There are many forms, but in the final analysis, they are all programming, because APIs are essentially function calls to external tools, SQL is a special programming language, and module calls are actually APIs. I believe that as large models develop in the future, their internal formal language is likely to be unified into programming logic. That is, after complex tasks are planned into simpler subtasks, the external form of each subtask solution is often in the form of programming or API calls.

Let's take a look at how people and data interact in the era of big models for different types of data? There is no need to mention the text class, ChatGPT is typical.

Use natural language to manipulate unstructured data

Let's start with unstructured data, starting with pictures.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

As shown in the figure, this is a typical Planning+Programming pattern, which can fully illustrate the three things that the big model mentioned above does behind the scenes. In this example, people manipulate pictures through language, including adding, deleting, modifying, and so on. The work, called Visual Programming, won the Best Paper Award at CVPR 2023.

Take the stills of "The Big Bang Theory" at the bottom of the picture as an example. The user submitted a group photo and tasked the big model: "Mark the picture with the names of the 7 protagonists of the TV series "The Big Bang Theory"

How does LLM accomplish this? First, LLM maps this task into a program statement (the next five lines). Why are there five lines? This is Planning, where each line is a subtask, executed sequentially.

Briefly explain the meaning of each line of the program: the meaning of the first program statement is to recognize the face in this picture; The second sentence is to have the language model issue a query to find the name of the protagonist of "The Big Bang Theory"; The third sentence is to let the model correspond the face to the person's name, which is a classification task; The fourth sentence is to have the model frame the face in the picture and mark the name. The fifth sentence outputs the edited image.

As you can see, this process includes Planning and Programming, which are two in one and integrated together, and it is not easy to see the planning step, in fact, there is.

It's similar for videos. Give you a video where you can ask questions in natural sentences, such as, "What is this person doing in this video?"

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

Of course, this task is essentially a multimodal task. As you can see, the model first encodes the video, with the encoded text information at the top and the visual information encoded below. The Chinese comes from speech recognition. So this is a model that integrates text and pictures, and although it is not drawn in the figure, it will also do all kinds of planning and reasoning inside.

To digress here, let me talk about my views on multimodal large models. Now everyone is generally optimistic about the direction of multimodality, but I am personally not as optimistic about the development of multimodal large models as most people. The reason is simple: although many models that process both text and images are now okay, the reason is not that image or video technology has made a breakthrough, but that the text model is too powerful, and it is flying with the image model. That is to say, in terms of technical ability, text and image models are not equivalent, but the text is strong and weak, and the text complements the picture. In fact, there are still serious technical hurdles in image and video that have not been broken. There is a "dark cloud of technology" hovering above image processing, and if it cannot be broken, multimodality faces shadows and great obstacles, and it is difficult to make significant progress in application.

Use natural language to manipulate structured data

Let's look at structured data, which includes three typical types: tables, SQL, and knowledge graphs.

Let's start with the table. We can manipulate tabular data through natural language, and Microsoft's Office Copilot has done just that. The question is how is it done? Of course, we don't know exactly how Microsoft does it, but other researchers have done similar work.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

Here's an example. For a sales data table, there are many columns, and there is a relationship between the data between the columns, such as one column is sales, one column is unit price, and one column is sales. LLM works well with tables because LLM learns a lot from the pre-trained model, such as sales = unit price × sales. LLM is able to use what it learns to work with tabular data.

For this sales data sheet, the user can issue a query: "Light up records with sales between 200-500."

LLM (here GPT-4) will first plan this task into subtasks, here are three subtasks: 1) first filter out items with sales between 200-500; 2) Light up the background blue; 3) Embed the lit data back into the table.

What does the model do for the first step of the screening process? Here is a brief introduction to this process. The first is to write prompt. As you can see in the figure, this prompt is particularly long.

Writing prompts has now become a science. Some people say prompt is like chanting a mantra, I think it's more like PUA for a large model. We can compare a large model to a person who can play a variety of different types of roles, and in order for it to do the task at hand, we need to adjust it to the role that is best suited for the task. To do this, we need to write prompt to lure the role out: "You are very knowledgeable, you are particularly suitable for this thing, you should do it more professionally, not too casually." Etcetera.

Then tell it the schema of the table (i.e. the meaning of each column of the table). GPT-4 generates an API, which is a filter that filters out of all data that satisfies 200-500. But look at the red color, the model wrote the wrong parameters when generating the API. What to do at this time? We can give it some documentation for GPT-4 to learn, and there are many examples in the documentation that tell it how to call the API and write parameters in this case. After reading GPT-4, I changed the API and changed it correctly.

Then there's execution, which filters out the data you need.

Another type of structured data is a database. Our goal is to manipulate the database in natural language, essentially mapping human natural language into SQL statements.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

For example, this is Google's SQL-PaLM, based on PaLM 2. PaLM 2 is a pedestal model that Google retrained in April to benchmark against GPT-4.

SQL-PaLM operates databases in two ways. One is in-context learning, that is, give the model some examples, including the schema of the database, natural language problems and corresponding SQL statements, and then ask a few new questions and ask the model to output SQL statements. Another way is fine-tuning.

How does this model perform now? On more complex database tables, its accuracy rate is 78%, which is close to the practical level. This means that with the further rapid development of technology, it is likely that SQL statements do not need to be written by humans; In the future, you can speak clearly what you want, and the machine can do the rest.

Another typical structured data is the knowledge graph. The question is the same: how do we manipulate the knowledge graph in natural language?

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

Here the user asks the question: "What country is Obama?" How does the big language model give the answer? It also plans and breaks down tasks into knowledge graph actionable APIs; This will query to get two sub-knowledge, and then do reasoning, you can output the correct answer "United States".

The relationship between the large model and the environment

The above is about the relationship between people and data, and the relationship between the big model and the environment. The most typical is the robot, which is now generally called embodied intelligence, that is, how to give the robot a brain and let the robot move in the world.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

Embodied intelligence has five core elements: understanding language, task planning, physical execution of movements, acceptance of environmental feedback, and learning from feedback.

The biggest difference between the embodied intelligence of the big model stage and the previous one is that the core of these five links can now be taken over by the big model. If you want to command the robot, you only need to express the requirements in natural language, and all five steps are planned and controlled by the big model. The big model can give the robot a powerful brain, better understand human language and commands, and use the world knowledge learned by the big model to guide behavior, which is qualitatively improved over previous methods in these aspects.

But there's a problem: Many researchers aren't bullish on this direction. The reason? If you want to use a physical robot to learn and exercise in real life, you will face the problem of high cost and low data acquisition efficiency, because the physical robot is very expensive, the range of action in the real world is limited, the efficiency of obtaining data is very low, the speed of learning is slow, and you can't fall, because falling means a lot of maintenance costs.

The approach of more people is to create a virtual environment in which robots can explore. Virtual environments mitigate costly and inefficient data acquisition. Minecraft is a commonly used virtual environment. This is an open world, similar to wilderness survival, where the game character learns to survive better, so it is especially suitable for replacing the activities of robots in the real world.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

The cost of the virtual environment is very low and the data acquisition is very efficient. But there is also a problem: the virtual world is far less complex than the real world.

The Voyager developed by NVIDIA is to let the robot explore the unfamiliar environment in "Minecraft", and the large model that plays the main driving role behind it is also based on GPT-4, and the robot and GPT-4 also communicate through natural language. As you can see on the left side of the figure, the model will learn to do a variety of tasks of difficulty step by step, from easy to difficult, from the simplest logging to making tables in the back, fighting zombies, etc., in the general machine learning context, we call this "easy to difficult, step-by-step" learning mode "course learning". The "course learning" tasks are generated by GPT-4 according to the current state, you just need to make the PUA large model in prompt to make it easy to produce the next task.

Suppose the task now is to fight zombies. In the face of this task, GPT-4 will automatically generate the corresponding "zombie-fighting" function and program code that can run in the "Minecraft" environment, and when writing this function, you can reuse the tools formed when solving the previous relatively simple tasks, such as the stone sword and shield that the character learned to make before, and use these tools directly through API function calls. After forming the "fight zombie" function, you can execute code to interact with the environment, and if an error occurs, feedback the error message to GPT-4, and GPT-4 will further correct the program. If the program is executed successfully, then this experience will be put into the library as a new knowledge and can be used next time. Later, based on the "course learning", GPT-4 will produce the next more difficult task.

The big model of the future

The above is from the perspective of the relationship between people and data, as well as the relationship between large models and environments, to illustrate that natural language interaction is everywhere. Next, let's look at how natural language interactions play a role in AI and AI interactions.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

What research progress is worth paying attention to in the large model of the base in the past six months? Except that the model size continues to increase, overall progress is not great, and most of the new progress is concentrated in the instruct part, thanks to Meta's open source LLaMA model. To talk about the progress of the pedestal model, I think there are two things worth paying attention to, one is the rapid growth of the length of the model input window, this technology is progressing very fast, and the current open source model 100K length or even longer input can be reached soon; The second is large model enhancement.

I believe that the future large model will most likely be the model shown above. The previous large model is a static single large model, and the future should be a large model composed of multiple agents with different roles, which communicate and communicate with each other through natural language and join forces to do tasks. The agent can also call external tools through the natural language interface to solve the shortcomings of existing large models, such as outdated data, serious hallucinations, and weak computing power.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

The current pattern of large model usage tools is relatively uniform, as shown in the figure, and a large number of available external tools can be managed through an API management platform. After the user gives the question, the model determines whether the tool should be used according to the requirements of the problem, and if it feels that the tool needs to be used, it further decides which tool to use, and calls the API interface of the corresponding tool, fills in the corresponding parameters, integrates the results returned by the tool to form the answer after the call is completed, and then returns the answer to the user.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

Agents are a very noteworthy technology, but we currently do not have a unified definition of the intelligent body in the era of large models. You can think of agents as different roles assigned to large language models that accomplish tasks through division of labor. Agents are a research direction with a long history, and there are definitely decades of history, but in the era of large models, due to the blessing of LLM's capabilities, agents have completely different capabilities and contain huge technical potential. In addition, regarding its definition, I think that the traditional definition of an agent may not meet the situation in the new situation, and the era of large models may need to redefine the meaning of an agent.

The above shows an agent system similar to the game sandbox environment that simulates human society, each agent has its own professional identity, different agents can communicate through natural language, and can hold various gatherings, which seems to be the prototype of the science fiction drama "Westworld".

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

If you summarize the collaboration between agents, there are two main types: competitive and collaborative. Competitive is when different agents question, quarrel, and discuss with each other to get better task results. Collaboration is to divide labor through roles and abilities, each shoulder one of the task links, and complete the task together by helping each other and cooperating.

Finally, let's discuss the advantages and problems of natural language interaction. The advantages of interacting with natural language are that it is more natural, more convenient, and more uniform, and users need almost no learning costs to do things; However, natural language also has disadvantages such as ambiguity and linguistic ambiguity.

WAIC 2023 | Zhang Junlin: The interaction mode change brought by the big language model

The ambiguity of natural language is that sometimes it is not easy to explain the true intention in natural language, you think you are clear, but you are not, but you may not realize that you are not clear. This is also why the requirements for writing prompt with a good model are relatively high, after all, if the user is not clear about his intentions, the model cannot be done well.

The problem of ambiguity in natural language has always existed and is pervasive. For example, "bring me the apple" actually has different meanings, and the listener can also have different understandings, how do you let the big model know which meaning it is?

Considering the ambiguity and ambiguity of natural language, from the perspective of human-computer interaction, in the future, large models should enhance the initiative of interaction, that is, let the model actively ask questions to users. If the model feels that there is something wrong with the user's words and is not sure what exactly it means, it should ask "what do you mean?" or "Did you mean this". This is the part that the big model should strengthen in the future.

Thank you all!

Read on