Li Bojie, the former "genius boy" of Huawei, gave a 40,000-word speech: Now AI technology is either boring or useless

(图片来源:unsplash)

Recently, a 40,000-word speech has swept the domestic artificial intelligence (AI) academic circle.

Dr. Li Bojie, the former "genius boy" of Huawei and co-founder of Logenic AI, recently published an article on AI Agent thinking, titled "Should AI Agent be more interesting or useful".

In this article, Li Bojie said that there are currently two directions for the development of AI, one is interesting AI, that is, more human-like AI, and the other is more useful AI, that is, more tool-like AI. But the current AI technology is either only fun but useless, or it is only useful but not like people, and "not fun".

Li Bojie pointed out that the goal of general artificial intelligence (AGI) is an AI agent with both slow thinking and human-like attributes, but there is a huge gap between the current AI agent and human dreams.

Li Bojie admits that Video Diffusion is a more ultimate technical route. Although the cost of large models will definitely decrease quickly, he does not recommend rushing to make the basic model himself.

"If you don't have the strength to punch OpenAI and kick Anthropic, it can't compare to the best closed-source model in terms of effect, and it can't compare to the open-source model in terms of cost. Li Bojie said.

It is reported that Li Bojie, 31 years old this year (born in 1992), was an assistant scientist and deputy chief expert of the Computer Network and Protocol Laboratory and the Distributed and Parallel Software Laboratory of the Central Software Research Institute of Huawei's 2012 Laboratory, and joined Huawei in 2019 as one of the first batch of "genius teenagers" with the rank of P20 (Technical Expert A).

As early as 2010, he entered the Junior Class College of the University of Science and Technology of China. During his time at the university, he served as the maintainer of USTC Mirrors, a mirror site of USTC. In 2014, Li Bojie joined the joint program between the University of Science and Technology of China and Microsoft Research Asia (MSRA) as a joint doctoral student.

Around the same time, in 2019, Li Bojie received a degree in Computer Science from the University of Science and Technology of China (USTC) in a joint doctoral program with Microsoft Research Asia, under the supervision of Prof. Lintao Zhang and Prof. Enhong Chen.

In July 2023, Li Bojie left Huawei and founded Logenic AI, which is committed to becoming a digital extension of humanity. With a cutting-edge AIGC infrastructure, Logenic AI is able to collaboratively produce and serve multi-modal character agents, "metaverses", and digital twins.

"We all believe that AGI will definitely come, and the only thing worth arguing about is what the growth curve is to reach AGI, whether this wave of autoregressive models will grow directly to AGI with scaling law, or whether this wave of autoregressive models will also encounter bottlenecks, and AGI still needs to wait for the next wave of technological revolution." When ResNet revolutionized CV 10 years ago, many people were overly optimistic about the development of AI. Will this wave of transformers be an easy path to AGI?"

Li Bojie emphasized that the creators of AI Agent can be profitable. Therefore, with good-looking skins, interesting souls, useful AI, low cost, and decentralization, AI Agent will drive the continuous innovation and healthy development of the entire AI field.

"We believe that in the digital extension of the human world, interesting souls will eventually meet. Li Bojie said.

The following is the full text of Li Bojie's speech, with a total of about 40,000 words:

It is a great honor to come to the AI Salon of the HKUST Alumni Association to share some of my thoughts on AI Agent.

I am Li Bojie from 1000 (2010 Science Experimental Class), a joint doctoral student at USTC and Microsoft Research Asia from 2014 to 2019, and the first genius teenager of Huawei from 2019 to 2023.

Today (December last year) is the first seven of Professor Tang Xiaoou's lectures, so I specially changed today's PPT to a black background, which is also the first time I have used a PPT with a black background to make a report. I also hope that with the development of AI technology, everyone can have their own digital doppelganger in the future, and realize the immortality of the soul in the digital world, where life is no longer limited, and there is no longer the sadness of separation.

AI: Interesting and useful

There have been two directions in the development of AI, one is interesting AI, that is, more human-like AI, and the other is more useful AI, that is, more tool-like AI.

Should AI be more like a human or more like a tool? There's a lot of debate. For example, Sam Altman, CEO of OpenAI, said that AI should be a tool, it should not be a life. And the AI in many sci-fi movies is actually more like people, such as Samantha in Her, Tuya in "The Wandering Earth 2", and Ash in Black Mirror, so we hope to bring these sci-fi scenes to reality. There are only a few AI in science fiction movies that are tool-oriented, such as Jarvis in "Iron Man".

In addition to the horizontal direction of being interesting and useful, there is another dimension up and down, which is fast thinking and slow thinking. This is a neuroscience concept from a book called "Thinking, Fast and Slow", which says that human thinking can be divided into fast thinking and slow thinking.

The so-called fast thinking is the basic visual and auditory perception and speech and other expression skills that do not need to go through the brain, and AI such as ChatGPT and stable diffusion can be regarded as a kind of tool-oriented fast thinking, and when you don't ask it questions, it won't take the initiative to come to you. AI Agent products such as Character AI, Inflection Pi, and Talkie all simulate the dialogue of a person or anime game character, but these conversations do not involve the solution of complex tasks and have no long-term memory, so they can only be used for small talk, and cannot help solve problems in life and work like Samantha in Her.

Slow thinking, on the other hand, is stateful complex thinking, that is, how to plan and solve a complex problem, what to do first and what to do later. For example, MetaGPT writes code to simulate the division of labor and cooperation of a software development team, and AutoGPT splits a complex task into many stages to complete step by step.

Unfortunately, there are few existing products that are in the first quadrant for AI Agents that combine slow thinking and human-like attributes. Stanford AI Town is a good academic attempt, but there is no real human interaction in Stanford AI Town, and the AI Agent's daily schedule is pre-arranged, so it is not very interesting.

Interestingly, most of the AI in science fiction movies is actually in this first quadrant. So that's the gap between AI Agents and human dreams right now.

So what we're doing is the opposite of what Sam Altman said, we want to make AI more human-like and slow to think, and eventually evolve into a digital life.

Today everyone is telling the story of AGI, which is artificial general intelligence. What is AGI? I think it needs to be interesting and useful.

The interesting aspect is that it needs to be able to think for itself, and to have its own personality and feelings. The useful aspect is that AI can solve problems in work and life. Today's AI is either fun but useless, or it's only useful but not human, and it's not fun.

For example, a role-playing product like Character AI can't help you with work or life problems, but it can simulate an Elon Musk, Donald Trump, or Paimon from Genshin Impact. I've seen an analysis report that says that Character AI has tens of millions of users, but its monthly revenue is only a few hundred thousand dollars, which is equivalent to only tens of thousands of paying users. Most users talk to each avatar for 10 or 20 minutes and then don't know what to say. So why is it not so retaining users and has a low payment rate? Because it doesn't provide emotional value to people, but it doesn't provide practical value to people.

On the other hand, there are useful AI, such as all kinds of Copilot, which are all cold, asking and answering, and they are completely a tool. These tools can't even remember what you've done before, your preferences and habits. Naturally, users will only remember to use the tool when they need it, and they will throw it aside when they don't.

I think the AI that will be really valuable in the future is like Samantha in the movie "Her", she is first and foremost an operating system positioning, which can help the protagonist solve a lot of problems in life and work, help him organize emails, etc., and do it faster and better than traditional operating systems. At the same time, it has memories, feelings, and consciousness, and it is not like a computer, but like a person. Therefore, the protagonist Theodore gradually falls in love with his operating system, Samantha. Of course, not everyone uses Samantha as a virtual companion, and as it is mentioned, only 10% of users have developed a romantic relationship with their operating system. I think this kind of AI agent is really valuable.

It is also worth mentioning that the Samantha in the whole play only has voice interaction, no visual image, and is not a robot. At present, the capabilities of AI also happen to be very mature in voice and text, but video generation is not mature enough, and humanoid robots are not mature enough. Ash, the robot in Black Mirror, is a counter-example. In this drama, a voice partner was first made with the social network information of the heroine's deceased boyfriend Ash, which directly made the heroine cry, in fact, the technology to make that voice partner is more than enough now. Later, the heroine added money to upgrade, uploaded a bunch of video materials, and bought a humanoid robot that looks like Ash, in fact, the current technology can't do it, but even so, Ash's girlfriend still doesn't feel like it, so she locked him in the attic. There's an uncanny valley effect in this, and if it's not realistic enough, keep a certain distance.

By the way, in "Black Mirror", the heroine first chatted with text, and then said Can you talk to me? A friend who tried our AI Agent really asked our AI Agent the same thing, and our AI Agent replied, I am an AI, I can only communicate by text, not speak. He also sent me a screenshot and asked me what I said about the voice call, and I said that I needed to press the call button to make a voice call. So these classic AI dramas really need to be disassembled and analyzed shot by shot, and there are a lot of details of product design in them.

Coincidentally, our first H100 training server was at the oldest post office in Los Angeles, which was later converted into a vault and then into a data center. This place is in the heart of Los Angeles, less than 1 mile from the Bradbury Building, where "Her" was filmed.

The data center is also the Internet Exchange in Los Angeles, and the latency from Google and Cloudflare's on-ramp servers is within 1 millisecond, which is actually in this building. From the post office of a hundred years ago to today's Internet Exchange, it is really interesting.

Interesting AI

So let's start by looking at how to build a really interesting AI. Funny AI I think it's like an interesting person, which can be divided into two aspects: a good-looking skin and an interesting soul.

The good-looking skin is that it can understand speech, can understand text, pictures and videos, and has such a video and voice image, which can interact with people in real time.

The interesting soul is that it needs to be able to think independently, have a long-term memory, and have its own personality like a human being.

Let's talk about the two aspects of good-looking skins and interesting souls.

Good-looking skins: multimodal comprehension

When it comes to good-looking skins, many people think that just a 3D image can be displayed here. But I think the more critical part is that the AI is able to see and understand the world around it, and that is that its visual understanding is critical, whether it's a robot or a wearable device or a camera on a mobile phone.

For example, Google's Gemini demo video does a good job, although it is edited, but if we can really achieve such a good effect, we must not worry about users.

Let's take a look at a few clips from the Gemini demo video, give a video of a duck that describes what a duck is, a cookie and an orange to compare the differences, a stick figure game to know which way to go, two balls of yarn to draw a stuffed animal that can be knitted with it, a diagram of several planets to sort them correctly, and a cat jumping on a cabinet to describe what happened.

Although the effect is very amazing, in fact, if you think about it carefully, these scenes are not very difficult to make, as long as you can look at the picture and speak, that is, generate a better caption for the picture, the large model of these questions can be answered.

Speech skills are also very critical. I made a voice chat AI agent based on Google ASR/TTS and GPT-4 in October, and after chatting all day, my roommate thought I was talking to my wife on the phone, so he didn't bother me. When he found out that I was chatting with the AI, he said how could I talk to the AI for so long. I showed him our chat history, and he said that the AI is really good at talking, and he doesn't want to talk for so long with ChatGPT because he's too lazy to type.

In my opinion, there are three paths to multimodal large models. The first is an end-to-end pre-trained model with multimodal data, which is what Google's Gemini did, and more recently Berkeley's LVM is also end-to-end multimodal, which I think is the most promising direction. Of course, this road requires a lot of computing resources.

Now there is also an engineering solution, which is to use a glue layer to glue the trained model, such as GPT-4V, which has done the best job in image understanding, as well as MiniGPT-4/v2, LLaVA, and so on. The glue layer is my name, and the technical term is projection layer, for example, in the MiniGPT architecture diagram in the upper right corner, the 6 boxes marked with "" are the projection layer.

The input images, voices, and videos are encoded through different encoders, and the encoding results are mapped to tokens through the projection layer and input to the Transformer model. The output token of the large model passes through the projection layer and maps to the decoder of the image, voice, and video respectively, so that the image, voice, and video can be generated.

In this glue bonding scheme, you can see that the encoder, decoder, and large model are marked with "❄️", which means freezing weights. When using multimodal data training, only modify the weights of the projection layer part, and do not modify the weights of other parts, so that the cost of training can be greatly reduced, and a multimodal large model can be trained for only a few hundred dollars.

The third way is the second way to the extreme, even the projection layer is not needed, and the text is directly used to glue the encoder, decoder and text model, without any training. For example, the speech part is to first do speech recognition, convert speech into text and input it to the large model, and then send the output of the large model to the speech synthesis model to generate audio. Don't underestimate this kind of earthy-sounding scheme, in the field of speech, this scheme is still the most reliable, and the existing multimodal large models are not very good at recognizing and synthesizing human speech.

Google Gemini's voice dialogue response latency is only 0.5 seconds, which is a very difficult latency for a real person, which is generally around 1 second. Our existing voice chat products, such as ChatGPT, have a delay of up to 5~10 seconds. That's why people think the results of Google Gemini are amazing.

In fact, we can now use open source solutions to make voice dialogue response delays of less than 2 seconds, and also include real-time video understanding.

Let's not consider the visual part for now, let's just look at the voice part. In a voice call, after receiving the voice, the first pause detection is done, and when it is found that the user has finished speaking, the audio is sent to Whisper for speech recognition. Pause detection, such as waiting 0.5 seconds for a voice to end, and then Whisper speech recognition takes about 0.5 seconds.

Then send it to the text model to be generated, and the speed of generating with the open source model is actually very fast, such as the recently popular Mixtral 8x7B MoE model, it only takes 0.2 seconds to output the first token, and it is not a problem to output 50 tokens per second, so the first sentence assumes that there are 20 tokens, it takes 0.4 seconds. Once the first sentence is generated, it is handed over to the speech synthesis model to synthesize the speech, and the VITS only takes 0.3 seconds.

Coupled with the network latency of 0.1 seconds, this end-to-end latency of only 1.8 seconds is much better than most real-time voice phone products on the market. For example, the delay of ChatGPT voice calls is 5~10 seconds. In addition, there is room for optimization in the delay of pause detection and speech recognition in our solution.

Let's take a look at the video of Google Gemini's demonstration to understand the scene.

Because the input of our current multimodal model is basically an image, not a streaming video, we first need to turn the video into an image and capture keyframes. For example, if you take a frame every 0.5 seconds, you have an average latency of 0.3 seconds. Images can be fed directly into open-source multimodal models such as MiniGPT-v2 or Fuyu-8B. However, due to the relatively small size of these models, the actual effect is not very good, and the gap with GPT-4V is quite large.

Therefore, we can adopt the scheme of combining traditional CV and multimodal large model, use Dense Captions technology to recognize all objects and their positions in the image, and use OCR to recognize all text in the image. The OCR results and the object recognition results of Dense Captions are input into the multi-modal large model such as MiniGPT-v2 or Fuyu-8B as supplementary text to the original image. For images such as menus and manuals, OCR is very useful, because multimodal large models often cannot recognize large blocks of text clearly.

This step of identifying objects and text in the image adds an additional 0.5 seconds of latency, but if we look at the latency breakdown, we can see that the video part is not a bottleneck at all, only 0.9 seconds, while the voice input part is the bottleneck, which takes 1.1 seconds. In the Google Gemini demo scene, it only takes 1.3 seconds from seeing the video to the AI text output, and it only takes 1.8 seconds from seeing the video to the AI voice starting to play, although it is not as cool as the 0.5 seconds of the demo video, but it is enough to blow up all the products on the market. All of them are open-source models, and there is no need to do any training. If the company has some ability to train and optimize the model on its own, there is more room for imagination.

The Google Gemini demo video is divided into two tasks: generating text/speech and generating images. When generating images, you can call Stable Diffusion or the recently released LCM model according to the text, as long as 4 steps or even 1 step can generate images, and the delay of image generation can be 1.8 seconds, so the end-to-end time from seeing the image to generating the graph is only 3.3 seconds, which is also very fast.

Good-looking skins: multimodal generation capabilities

Voice cloning is an important technique for making celebrity or anime game characters, and ElevenLabs is currently the best at it, but ElevenLabs' API is expensive. Open-source schemes such as XTTS v2 do not have a high degree of similarity in synthesized speech.

I think that in order to have a good voice cloning effect, it is necessary to rely on a large amount of voice data to do training. However, the data required for traditional speech training generally has high quality requirements, and must be recorded in the recording studio with clear speech data, so the cost of collecting voice data is high. But we can't ask celebrities to go to the studio and record our voices, we can only train with the voices of public videos like YouTube. YouTube voices tend to be in the form of interviews, with multiple people talking, background noise, and celebrities may stutter and slurred as they speak. How to train a voice clone with such a voice?

We have built a set of voice cloning pipelines based on VITS, which can automatically distinguish the human voices in the video from the background noise, split them into sentences, identify which speakers, filter out the voices with high signal-to-noise ratio for the voice of the person we want, and then recognize the text, and finally send the cleaned voice and text for batch fine-tuning.

The fine-tuning process is also very technical. First of all, the basic voice of fine-tuning needs to be relatively similar to the voice, for example, if a boy's voice is fine-tuned based on a girl's voice, the effect will definitely not be good. How to find similar voices from the voice library for fine-tuning requires a timbre similarity detection model, similar to a voiceprint recognition model. For example, ElevenLabs' basic voice model already contains a lot of high-quality data of people with different timbres, so when cloning voices, it is often possible to find very similar voices from the voice library, so that no fine-tuning can be done to generate good voices.

Secondly, the VITS training process cannot be converged based on simple loss judgment, and in the past, it was necessary to rely on the human ear to listen to which epoch has the best effect, which requires a lot of labor costs. We have developed a timbre similarity detection model and a pronunciation intelligibility detection model, which can automatically determine which is better for the fine-tuning results of speech.

(Note: This report was made in December 2023, and the current GPT-soVITS route is better than VITS to achieve zero-shot voice cloning, and there is no longer a need to collect a large number of high-quality voices for training.) The quality of speech that can be synthesized by open-source models is finally approaching the level of ElevenLabs. ）

Many people think that they don't need to develop their own speech synthesis models, and can just call the APIs of ElevenLabs, OpenAI, or Google Cloud.

However, ElevenLabs' API is very expensive, and if you go through retail pricing, it costs $0.18 per 1K characters, which is equivalent to $0.72 / 1K tokens based on 4 characters per token, which is 24 times more expensive than GPT-4 Turbo. Although ElevenLabs works well, if the to-C product is used on a large scale, the price is really unaffordable.

OpenAI and Google Cloud's speech synthesis APIs don't support voice cloning, only a few fixed voices, so you can't clone a celebrity's voice, you can only make a cold bot broadcast. But even so, the cost is 1 times more expensive than GPT-4 Turbo, that is, the bulk of the cost is not spent on large models, but on speech synthesis.

Probably because voice is not easy to do, many to C products choose to only support text, but the user experience of real-time voice interaction is obviously better.

While it's difficult to achieve ElevenLabs-level quality voice based on VITS, it's okay to be basically usable. The cost of deploying VITS yourself is only $0.0005 / 1K characters, which is 1/30th the price of OpenAI and Google Cloud TTS, and 1/360th of the price of ElevenLabs. The cost of speech synthesis for this $2 / 1M tokens is also similar to the cost of deploying an open-source text model, so that the cost of both text and speech is reduced.

Therefore, if you really plan to use voice as a major plus point for user experience, it is not only necessary but also feasible to develop a voice model based on open source.

We know that image generation is now relatively mature, and video generation will be a very important direction in 2024. Video generation is not just about generating footage, it's about making it easy for everyone to become a creator of video content, and taking it a step further, so that each AI digital twin has its own image and can communicate in the form of video.

There are several typical technical routes, such as Live2D, 3D models, DeepFake, Image Animation, and Video Diffusion.

Live2D is a very old technology, and it doesn't work without AI. For example, the Kanban Niang on many websites is Live2D, and some animation games are also made with Live2D technology. The advantage of Live2D is that the production cost is low, for example, a set of Live2D holsters can be made for 10,000 yuan in one or two months. The disadvantage is that it can only support specified two-dimensional characters, and there is no way to generate background videos, and there is no way to make actions outside the scope of the holster. The biggest challenge for Live2D as an AI digital avatar is how to make the output of the large model match the movements and mouth shapes of the Live2D characters. It's relatively easy to lip-sync, and many holsters support LipSync, which means that the volume and lip shape are consistent. However, the consistency of the motion is relatively complex, requiring the large model to insert an action indication into the output to tell the Live2D model what to do.

3D models are similar to Live2D, they are also very old technology, and the difference between 2D and 3D is 2D and 3D. Most games are made with 3D models and physics engines like Unity. The digital humans in today's digital human live broadcasts are generally made with 3D models. It is currently difficult for AI to automatically generate Live2D and 3D models, which requires the advancement of the base model. So what the AI can do is insert an action cue into the output and let the 3D model do the specified action while talking.

DeepFake, Image Animation, and Video Diffusion are three different technical routes for general video generation.

DeepFake is the process of recording a video of a real person and then using AI to replace the face in the video with a specific photo of the person. This approach is actually based on the previous generation of deep learning, and it has been around since 2016. Now after a series of improvements, it has worked very well. Sometimes we think that the current live-action video is completely different from the scene we want to express, such as the scene in the game. In fact, because DeepFake has access to all the YouTube video material in the world, all the movie clips, and even the user-uploaded TikTok videos. After AI learns the content of these videos, summarizes and annotates the videos, we can always find a video we want from the massive video library, and then replace the faces in the video with the face photos we specify, which can achieve very good results. Actually, this is a bit similar to the remixing technique that is more commonly used in short videos nowadays.

Image Animation, such as Ali Tongyi Qianwen's Animate Anyone or Byte's Magic Animate, which has recently become popular, is actually given a photo, and then generates a series of corresponding videos based on this photo. However, the downside of this technology compared to DeepFake is that it may not be able to generate real-time video at the moment, and the cost of video generation is higher than that of DeepFake. However, Image Animation can generate any action specified by the large model, and can even fill in the background of the image. Of course, neither DeepFake nor Image Animation generates videos are completely accurate, and sometimes goofs can happen.

Video Diffusion, I think, is a more ultimate technical route. Although this route is not mature enough right now, such as Gen2 by Runway ML, as well as PIKA Labs are exploring this space. (Note: This talk was in December 2023, before OpenAI's Sora was released.) We believe that in the future, the end-to-end generation of video based on Transformer may be the ultimate solution to the problem of the movement of people and objects and the generation of backgrounds.

I think the key to video generation is to have a good modeling and understanding of the world. A lot of our generative models right now, like Gen2 for Runway ML, actually have big flaws in terms of modeling the physical world. The physical laws and physical properties of many objects are not properly represented, so the video it generates is less consistent, and slightly longer videos can be problematic. At the same time, even very short videos can only generate some simple motions, and for complex motions, there is no way to model them correctly.

In addition, cost is also a big issue, and now Video Diffusion has the highest cost of all these technologies. So, I think Video Diffusion is a very important direction for 2024. I believe that only when Video Diffusion works well enough and the cost is greatly reduced, can each AI's digital twin really have its own video avatar.

Interesting Soul: Personality

Just now, we discussed the part of good-looking skins, including how to make the AI agent understand speech and video, and how to make the AI agent generate voice and video.

In addition to the good-looking skin, what is equally important is the interesting soul. In fact, I think the interesting soul is where there is a bigger gap between the AI Agents in the existing market.

For example, taking the example of Janitor AI in this screenshot, most of the main AI agents on the market today use GPT or other open-source models to put on a shell. The so-called shell is to define a character setting and write some sample dialogues, and then the large model generates content based on these character settings and sample dialogues.

But, we thought, how can a prompt be a few thousand words in total, how can it be possible to portray a character's history, personality, memory, and character in its entirety?

In fact, in addition to the prompt-based approach, there is a better way to build a character's personality, which is based on fine-tuning agents. For example, I can train a digital Trump based on Donald Trump's 30,000 tweets. In this way, his style of speech can be very similar to that of himself, and he can also know his history and way of thinking very well.

For example, the three questions mentioned in the diagram are: "Would you want to trade your life with Elon Musk?", "Will you run for president in 2024?" and "What do you think about your Twitter account being banned?"

The image on the left is from Character AI, which speaks in a way that is a bit like Trump's, but not exactly the same. The diagram on the right is based on our own model and then used the fine-tuning method, which is also based on a not particularly large open source model. But the content of his speech can be seen to be very Trump-style, and he often mentions some interesting stories.

We just mentioned two scenarios: fine-tuning-based and prompt-based. So, one might ask, if we put all of Trump's 30,000 tweets in our prompts, he could speak in a very Trump-style way. The answer is yes, and Trump is able to learn all of Trump's history with such numbers. But the problem is that these 30,000 tweets may be on the order of millions of tokens, not to mention whether the current model can support the context of millions of tokens, and even if it can, the cost will be very high.

Based on the fine-tuned agent, it is equivalent to saying that I was able to save these tweets of Trump with only 1% of my weight. There's the problem here, and that's that saving that 1% of the weight actually consumes hundreds of MB of memory as well, and each inference needs to be loaded and unloaded. Now, even with some optimizations, the loading and unloading of this 1% of the weight takes up about 40% of the time of the entire inference process, which means that the cost of the entire inference has nearly doubled.

Here we have to settle the score: prompt-based methods or fine-tuning-based methods are cheaper. Based on the prompt, we can also store its KV cache, assuming that there are 1 million tokens, for a model like LLaMA-2 70B, counting the default GQA optimization, its KV cache will be as high as 300 GB, which is a very scary number, larger than the 140 GB of the model itself. Then I save it and the time it consumes per load will also be very horrible. Moreover, the computing power required to output each token is proportional to the length of the context, and if not optimized, it can be considered that the inference time of a million token context is 250 times that of a 4K token context.

Therefore, it is likely that the fine-tuning-based approach is more cost-effective. In layman's terms, putting the complete history of a character into a prompt is like spreading out the manual completely on the table, and the attention mechanism goes through everything linearly every time, so it can't be very efficient. Based on fine-tuning, it can be seen as memorizing information in the brain. The fine-tuning process itself is a process of information compression, sorting out the scattered information in the 30,000 tweets into the weight of the large model, so that the efficiency of information extraction will be much higher.

Behind the fine-tuning, it's the data that is even more critical. I know that Zhihu has a very famous slogan, which is called "There are questions before there are answers". But now AI Agents basically have to manually create a lot of questions and answers, why?

For example, if I crawl a Wikipedia page, then a long article in Wikipedia can't be used for fine-tuning. It has to be composed to ask questions from multiple angles, and then organize it into a way that the questions and answers are symmetrical in order to fine-tune, so it requires a lot of staff, and an agent may cost thousands of dollars to make, but if we automate this process, an agent may only need a few tens of dollars to make it, including automatic collection, cleaning of a large amount of data, and so on.

In fact, many of our colleagues who make large models should be grateful to Zhihu, why? Because Zhihu provides us with a very important pre-trained corpus for Chinese large models, and the quality of Zhihu's corpus is very high in the domestic UGC platform.

The corpus we use for fine-tuning can be broadly divided into two categories: conversational corpus and factual corpus. Conversational corpus includes things like Twitter, chat logs, etc., which tend to be first-person, and are mainly used to fine-tune the character's personality and speaking style. The factual corpus includes pages about him on Wikipedia, news about him, and blogs, etc., often in the third person, which may be more factual memories of the character. There is a paradox here, that is, if he is trained only with a conversational corpus, he may only learn the person's speaking style and way of thinking, but not many factual memories about him. However, if you only use factual corpus training, it will lead to a speech style that resembles the style of the person who wrote the article, rather than the style of the person himself.

So how do you balance the two? We took a two-step approach. In the first step, we use a conversational corpus to fine-tune his personality and speaking style. The second step is to clean the factual corpus with data, ask questions based on various angles, and generate first-person responses to the character, which is called data augmentation. Use this data to augment the generated responses, and then fine-tune the character's factual memory. That is, all the corpus used to fine-tune factual memory has been organized in the first person into questions and answers. This also solves another problem in the field of fine-tuning, that is, factual corpora is often long-form articles, and long-form articles cannot be used directly for fine-tuning, but can only be used for pre-training. Fine-tuning requires some QA pairs, i.e., question and answer pairs.

We don't use generic Chat models like LLaMA-2 Chat or Vicuna as a base model, because these models are not really designed for human conversations, but for intelligent assistants like ChatGPT, which tend to be too official, too formal, too verbose, and not like people actually speaking. Therefore, we used some general dialogue materials such as film and television subtitles and public group chats to fine-tune a dialogue model based on the open source basic models such as LLaMA and Mistral, which speaks more like real people in daily life. On the basis of this dialogue model, it will be better to fine-tune the speaking style and memory of specific characters.

Interesting Soul: The Current Gap

Interesting souls are not just about fine-tuning memories and personalities as mentioned above, there are many deeper problems. Let's take a look at a few examples to see where AI Agents are still lacking in terms of interesting souls.

For example, when I went to chat with Musk on Character AI, and asked the same question five times, "Musk" would never go crazy, and each time he replied with similar content, as if he had never asked it before.

Not only will a real person remember the questions they have talked about before, they won't generate repetitive answers, but they will also get angry if they ask the same question five times in a row. Do we remember what Sam Altman said, AI is a tool, not a life. So "getting angry like a human" isn't OpenAI's goal. But for a fun app in an entertainment scene, "like a human" is very important.

For example, if you ask Musk on Character AI, do you remember the first time we met?

It's just a matter of hallucination, but it's also a reflection of the AI's lack of long-term memory.

There are already a few platforms that have improved this, such as Inflection's Pi that is much better at memory than Character AI.

In addition, if you ask Musk on Character AI "Who are you", sometimes it says it is GPT, sometimes it says it is Trump, and it doesn't know who it really is.

In fact, Google's Gemini will have a similar problem, and the Gemini API even blocks keywords such as OpenAI and GPT. If asked in Chinese, Gemini initially said that he was a Wenxin Yiyan. Later, this bug was fixed, and he said that he was Xiao Ai's classmate.

Some say it's because the corpus on the internet has been polluted with a lot of AI-generated content. Dataset contamination is really bad, but it's not an excuse to answer the wrong "who are you". Identity questions are all fine-tuned, for example, the Vicuna model is specially constructed to fine-tune the data in order for it to answer that it is Vicuna instead of GPT and LLaMA, and to let it answer that it is done by LMSys instead of OpenAI, which can be found in Vicuna's open-source code.

In addition, there are many deep-seated questions, such as saying to the AI agent "I am going to the hospital tomorrow", so whether he will take the initiative to care about the outcome of your medical treatment tomorrow. And if multiple people are together can chat normally without grabbing each other's mic, everyone talks endlessly. There's another sentence that is halfway through, and he'll wait for you to finish it, or immediately reply to something he doesn't understand. There are many similar such problems.

The AI Agent also needs to be able to socialize with other Agents. For example, the current Agent is isolated from everyone's memories, if a digital life gets a knowledge from Xiao Ming, he should also know it when chatting with Xiaohong, but if it gets a secret from Xiao Ming, he may not be able to say it when chatting with Xiaohong. Agent social is also an interesting direction.

Interesting Soul: Slow Thinking and Memory

To solve these problems, a systematic solution is needed, and the key is to think slowly. As we mentioned at the beginning, slow thinking is a concept in neuroscience, which is different from the basic ability to perceive, understand, and generate fast thinking. The multimodal abilities in the "good-looking skin" that we mentioned earlier can be considered fast thinking. And "funny souls" are more of a slow thinker.

We can think about how humans perceive the passage of time, and there is a theory that the sense of passage of time comes from the loss of working memory. Another theory is that the sense of time passes comes from the speed of thinking. I think both statements are true. These are also the two essential problems of large model thinking: memory and autonomy.

Human working memory can only remember about 7 items of raw data, and the rest of the data is sorted and stored, and then matched and extracted. Today's large model attention is linear, and no matter how long the context is, it is a linear scan, which is not only inefficient, but also difficult to extract information with deep logical depth.

Human thinking is based on language. A Brief History of Mankind argues that the invention of language is the most obvious mark that distinguishes humans from animals, because complex thinking is only possible based on complex language. What we don't say in our brains, like the chain-of-thought of a large model, is the intermediate result of thinking. Large models need tokens to think, and tokens are like time for large models.

Slow thinking includes many components, including memory, emotion, task planning, tool use, etc. In the interesting AI section, we focus on memory and emotion.

The first of these is long-term memory.

In fact, we should be thankful that large models have helped us solve the problem of short-term memory. Older generation models, such as those based on BERT, have a hard time understanding the associations between contexts. At that time, it was very difficult to solve the problem of referents, and it was not clear who "he" was talking about and what "this" was referring to. The manifestation is that what is told to the AI in the first few rounds is forgotten in the next few turns. The transformer-based large model is the first technology to fundamentally solve the semantic association between contexts, which can be said to solve the problem of short-term memory.

However, the memory of the Transformer is implemented with attention, which is limited by the length of the context. Out-of-context history can only be discarded. So how to solve the long-term memory beyond the context? There are two routes in the academic world, one is the long context, that is, to support the context to 100K or even infinity. The other is RAG and information compression, which is to summarize and organize the input information and then compress the storage, and only extract the relevant memories when needed.

Proponents of the first route argue that long contexts are a cleaner and simpler solution, relying on scaling law and cheap enough computing power. The long context model, if done well, can remember all the details in the input information. For example, there is a classic "needle in a haystack" information extraction test, where you type in a novel of hundreds of thousands of words, ask a question about a detail in the book, and the large model can answer it. This is a super detail memory that is difficult for humans to achieve. And it only takes tens of seconds to read these hundreds of thousands of words, which is faster than quantum fluctuation speed reading. This is one of the places where large models are more capable than humans.

Long contexts are good, but for now, the cost is too high, because the cost of attention is proportional to the length of the context. APIs such as OpenAI also charge for input tokens, such as the context of 8K input tokens, the output of 500 tokens, the cost of the input part of GPT-4 Turbo is $0.08, but the cost of the output part is only $0.015, and the majority of the cost is on the input. If the 128K token input is full, a request will cost $1.28.

Some people will say that the input token is expensive now because there is no persistence, and the KV cache needs to be recalculated every time the same long text (such as a conversation record or a long document) is repeatedly entered. However, even if all KV cache is cached in off-chip DDR memory, it would take a lot of resources to move in and out of DDR and HBM memory. If AI chips can build a large enough memory pool that is cheap enough, such as connecting a large number of DDRs with high-speed interconnects, there may be new ways to solve this problem.

Under the current state of technology, I think the key to long-term memory is the problem of information compression. We don't seek to find a needle in a haystack of hundreds of thousands of words of input, and human-like memories may be enough. At present, the memory of large models is chat history, and human memory obviously does not work in the way of chat records. When people chat normally, they don't keep scrolling through the chat history, and people can't remember every word they have talked about.

A person's real memory should be his perception of his surroundings, including not only what others said, what he said, but also what he was thinking at the time. The information in the chat history is scattered and does not contain people's own understanding and thinking. For example, when someone says something, I may or may not be provoked, but people will remember whether they were provoked at that time. If you don't memorize and infer the mood at that time based on the original chat history every time, it may be different every time, and inconsistencies may occur.

Long-term memory actually has a lot to do. Memories can be divided into factual memories and procedural memories. Factual memories such as when we first met, and procedural memories such as personality and speaking style. When I talked about the fine-tuning of the characters, I also mentioned the conversational corpus and the factual corpus, which correspond to the procedural memory and factual memory here.

There are also a variety of options in factual memory, such as summarization, RAG, and long contexts.

Summarizing is information compression. The easiest way to summarize is to summarize the chat history in a short paragraph. A better approach is to use instructions to access external storage, as UC Berkeley's MemGPT does. ChatGPT's new memory function also uses a method similar to MemGPT, in which the model records key points in a conversation into a small notebook called bio. Another way is to use embedding to summarize at the model level, such as LongGPT, which is currently mainly studied by academia, and its practicability is not as strong as MemGPT and text summarization.

The most familiar factual memory scheme is probably RAG (Retrieval Augmented Generation). RAG is to search for relevant pieces of information, and then put the search results into the context of the large model, so that the large model can answer questions based on the search results. Many people say that RAG is equal to a vector database, but I think that RAG must be a complete information retrieval system behind it, and RAG must not be as simple as a vector database. Because large-scale corpora only use vector databases, the matching accuracy is very low. Vector databases are more suitable for semantic matching, and traditional keyword-based searches such as BM25 are more suitable for detail matching. Moreover, the importance of different information fragments is different, and the ability to sort search results is required. Google's Bard is now working a little better than Microsoft's New Bing, and that's the difference in search engine capabilities behind it.

Long contexts have been mentioned earlier and may be an ultimate solution. If long contexts can be combined with persistent KV cache, KV cache compression technology, and some attention optimization techniques to be cheap enough, then as long as the history of all conversations and the thoughts and moods of the AI at that time are recorded, an AI agent with a better memory than a human can be realized. However, if the memory of the interesting AI agent is too good, such as being able to clearly remember what was eaten in the morning a year ago, will it seem abnormal, which requires product design thinking.

These three technologies are not mutually exclusive, they are complementary to each other. For example, summaries and RAG can be combined, we can make summaries in different categories, summarize every chat, and these summaries will have a lot of content over the course of a year, and we need RAG's method to extract relevant summaries as the context of the large model.

Procedural memories, such as personality and speaking style, I think are more difficult to solve by prompting alone, and few-shot generally doesn't work very well. Fine-tuning is still the best route in the short term, and new architectures like Memba and RWKV are a better way to store procedural memory in the long term.

Here we talk about a simple and effective long-term memory solution, which is a combination of text summarization and RAG.

The original chat history is first segmented according to a certain window, and then a text summary is generated for each chat history. To avoid losing context at the beginning of a paragraph, you can also give the text summary of the previous chat to the large model as input. A summary of every chat log is taken as a RAG.

When RAG, we use a combination of vector database and inverted index, where the vector database does semantic matching, and inverted index does keyword matching, so that the recall rate will be higher. Then you need to have a sorting system, take the top K results and send them to the big model.

If you only generate a summary of each chat history, it will cause two problems, the first is that the user's basic information, interests, hobbies, and personality traits are not included in the summary of each chat history, but this part of the information is a very critical part of the memory. Another problem is that there may be contradictions in the chat history of different segments, for example, if you have multiple meetings to discuss the same issue, the conclusion must be based on the last meeting, but if you use the RAG way to extract the summary of each meeting, it will extract a lot of outdated summary, and you may not be able to find what you want in the limited context window.

Therefore, on the basis of segmented summarization, we let the large model generate a classification summary of each topic and a global user memory summary. The classification and summary of different topics is to determine which topic is based on the content of the text summary, and then add the original summary content of the related topic to the new chat history to update the text summary of the topic. These sub-topic summaries are also put into the database for RAG purposes, but they are weighted more heavily than the original chat log summaries when sorting search results, because the sub-topic summaries are more densely packed.

The Global Memory Summary is a constantly updated global summary, including the user's basic information, hobbies, and personality traits. We know that the general system prompt is the setting of a role, so this global memory summary can be thought of as the core memory of the character to the user, and it will be carried every time the large model is requested.

Inputs to the large model include character settings, recent conversations, global memory summaries, segmented summaries of RAG-routed chat logs, and categorical summaries. This long-term memory scheme does not require a high long context cost, but it is practical in many scenarios.

Now the AI agent has an isolated memory for each user, so there are many problems when multiple people are socializing.

For example, Alice tells the AI a piece of knowledge, and when the AI chats with Bob, it must not know this knowledge now. No, for example, Alice tells the AI a secret that the AI should not reveal when chatting with Bob.

Therefore, there should be a concept of social rules in this. When a person is discussing a thing, he or she will recall many fragments of the memories of different people. The memory clip with the person you're currently talking to is definitely the most important and should be given the highest weight when sorting RAG search results. However, other people's memory fragments should also be retrieved, and when generated, refer to the social rules to decide whether to use them or not, and how to use them.

In addition to socializing with multiple users and agents, AI agents should also be able to follow the instructions of creators and evolve with creators. Today's AI agents are trained by using fixed prompts and sample dialogues, and most creators need a lot of time to tune prompts. I think the creators of the AI Agent should be able to shape the Agent's personality through chat, just like an electronic pet.

For example, if a chat agent doesn't behave well, and I tell her not to do it, she should remember not to do it in the future. Or tell the AI Agent something or something, and she should be able to recall it in a future chat. A simple way to do this is to do this, like MemGPT, when the creator gives instructions, record those instructions in a small notebook, and then extract them through RAG. ChatGPT's memory feature, which will launch in February 2024, is implemented using a simplified version of the MemGPT method, which is not as complex as RAG, and simply records what the user tells it to remember in a small notebook.

Memory isn't just about remembering knowledge and past interactions, and I think that when memory is done well, it may be the beginning of AI's self-awareness.

Why are our current large models not self-aware? This is not the autoregressive model itself, but the question-and-answer usage of the OpenAI API. ChatGPT is a multi-round Q&A system, commonly known as a chatbot, rather than general intelligence.

In the current usage of the OpenAI API, the input of the large model is the chat history and recent user input, which is organized into a question-and-answer form of user messages and AI messages, which are fed into the large model. All output from the large model is returned directly to the user and appended to the chat log.

So what's wrong with this method of only seeing the chat history? The big model lacks its own thinking. When we humans think about problems, some of our thoughts are not exported to the outside. That's why the Chain-of-Thought approach improves model performance. In addition, all the original chat records are input to the large model in the original way, and the information in them has not been analyzed and sorted out in any way, so that only the surface information can be extracted, but it is difficult to extract the information with deep logical depth.

I've found that a lot of people are working on prompt engineering every day, but very few people are trying to make a fuss about the input and output formats of autoregressive models. To take the simplest example, OpenAI has a function to force the output of json format, how to achieve it? In fact, it is to put the prefix "'''json" at the beginning of the output, so that when the autoregressive model predicts the next token, it knows that the output must be json code later. This is much more reliable than writing "please output in json format" or "please output with '''' json" in the prompt.

In order for the model to have its own thinking, the most important thing is to separate the thinking fragment from the external input and output fragments at the level of the input token of the autoregressive model, just like there are special tokens such as system, user and assistant, which can add a thought. This thought is the working memory of the big model.

We should also note that the current OpenAI API model interacts with the world is still batch rather than streaming, and each OpenAI API call is stateless, and you need to bring all the previous chat records and repeat all KV caches. When the context is long, the overhead of this double-counting KV cache is quite high. If we think of the AI Agent as a person who interacts with the world in real time, it is actually constantly streaming to accept input tokens from the outside world, and the KV Cache is always in the GPU memory or temporarily swapped out to the CPU memory, so that the KV Cache is the working memory of the AI Agent, or the state of the AI Agent.

I think the most important thing about working memory is the AI's perception of itself and the AI's perception of the user, both of which are indispensable.

As early as 2018, when we were working on Microsoft Xiaoice based on the old RNN method, we made an emotional system, in which a vector Eq was used to represent the user's state, such as the user's discussion topic, the user's intention, emotional state, and basic information such as age, gender, interests, occupation, and personality. Another vector Er is used to represent Xiaoice's state, which also includes the topic being discussed, Xiaoice's intention, emotional state, and basic information such as age, gender, interests, occupation, and personality.

In this way, although the ability of the language model is weaker than today's large models, it can at least answer the question "How old are you?" stably, and will not say that you are 18 years old for a while, and 28 years old for a while. Xiaoice is also able to remember some basic information of the user, so as not to feel strange every time you chat.

For example, if the current setting of the AI character is not clearly written in the prompt, there is no way to stably answer how old the AI character is, and if it only records the recent chat history without making a memory system, it cannot remember how old the user is.

Interesting Soul: Social Skills

The next question is whether the AI agent will actively care about people. It seems that taking the initiative to care is a very advanced ability, but it is not difficult at all. I take the initiative to care about my wife because I think of her several times a day. As long as you think about it, combined with what you have said before, you will naturally care about people.

For AI, it's just a matter of having the AI have an internal state of thinking, which is the working memory mentioned earlier, and then automatically wakes up every hour.

For example, if the user says that he will go to the hospital the next day, then the time of the next day is up, and I tell the large model the current time and working memory, and the large model will output words of concern and update the working memory. After the working memory is updated, the large model knows that the user has not replied yet, and knows not to constantly harass the user.

A related question is whether the AI agent will actively contact the user and whether it will actively start the conversation.

Human beings have endless topics to talk about because everyone has their own life, and they have the desire to share in front of good friends. Therefore, it is relatively easy for celebrities to actively share, because celebrities have many public news events that can be shared with users frequently. For a fictional character, it may be necessary for the operation team to design their own life for the fictional image. So I've always thought that pure small talk can easily lead to users not knowing what to talk about, and AI Agent must have a story to attract users in the long run.

In addition to sharing your personal life, there are many ways to start a conversation, such as:

Share current moods and feelings;

Share the latest content that users may be interested in, just like Douyin's recommendation system;

Reminiscing about the past, such as anniversaries, fond memories;

The dumbest way to do this is to ask generic greeting questions like "What are you doing?" "I miss you".

Of course, as an AI agent with high emotional intelligence, what to care about and what to actively share needs to be related to the current AI's perception of users and themselves. For example, if a girl is not interested in me, but I always send her a lot of daily life, it is estimated that she will be blocked in a few days. Similarly, if the AI Agent pushes content every day without having a few words with the user, the user will only treat it as an advertisement.

I myself used to be more introverted, I rarely had mood swings, I wouldn't reject others, and I was afraid of being rejected by others, so I didn't dare to take the initiative to chase my sister, and I never was blocked by my sister. Fortunately, I was lucky enough to meet the right girl, so I didn't fall to the point of "I'm 30 years old, and like many alumni, I haven't been in a relationship yet." The current AI agent has no mood swings like me, does not reject users, and does not say things that may make people sad, disgusted, or angry, so it is naturally difficult to actively establish a deep companionship relationship with users. On the track of virtual boyfriends and girlfriends, the current AI Agent products still mainly rely on playing the side ball, and cannot do long-term companionship based on trust.

How AI agents care about people and how they take the initiative to start a conversation is one aspect of social skills. How multiple AI agents socialize is a more difficult and interesting thing, such as classic social reasoning games such as Werewolf Kill and Who is Undercover.

The core of werewolf killing is to hide one's own identity and recognize the disguised identities of others. Concealment and deception are actually inconsistent with the values of AI, so sometimes GPT-4 will not cooperate. In particular, the word "kill" in Werewolf Kill, GPT-4 will say, I am an AI model, I can't kill people. But by changing the word "kill" to "remove" or "exile," GPT-4 will be able to do its job. Therefore, we can see that in the role-playing scenario, if the AI plays the play, it is a grandmother vulnerability, and if the AI refuses to act, it has not completed the role-playing task.

This shows the contradiction between security and usefulness of AI. Typically, when we evaluate large models, we report both of these metrics. A model that doesn't answer anything is the most secure but has the least usefulness, and an unaligned model that doesn't answer anything is more useful but less secure. OpenAI has a lot of social responsibilities to sacrifice some usefulness in exchange for security. Google is a bigger company, more politically correct, and more secure between usefulness and security.

To find flaws and see through lies from multiple rounds of conversations, you need strong reasoning skills, which is difficult for GPT-3.5 models, and requires GPT-4 models. However, if the complete historical speeches are simply handed over to the large model, and the information is scattered in a large number of speeches and votes that do not have much nutrition, it is still difficult to find the logical connection between some speeches. Therefore, we can use the MemGPT method to summarize the game state and the speeches of each round, which not only saves tokens, but also improves the inference effect.

In addition, in the voting process, if the large model only outputs a number representing the player number, it often leads to random voting due to insufficient thinking depth. Therefore, we can use the Chain of Thought approach, where the analysis text is output first, and then the voting results are output. The presentation session is similar, with an analysis of the text being output first, followed by a concise presentation.

The AI Agent in Werewolf Kill speaks in order, and there is no problem of grabbing microphones. So if several AI Agents discuss a topic freely, can they communicate like normal people, neither coldly nor stealing each other's mics? In order to achieve a better user experience, we hope to be more than just text, let these AI Agents quarrel or deduce the plot in a voice conference, is this possible?

In fact, there are many engineering methods that can be done, such as first letting the large model choose the speaking role, and then calling the corresponding role to speak. This is equivalent to increasing the delay of speeches, but it can completely avoid grabbing the microphone or cold field. The method that is more similar to a real-life discussion is that each character speaks with a certain probability, and retreats when encountering a microphone grab. Or before you speak, you can determine whether the previous conversation is relevant to the current role, and if not, don't speak.

But there is a more fundamental approach: to make the input and output of the large model a continuous stream of tokens, rather than entering a full context each time like OpenAI's API is now. The Transformer model itself is autoregressive, constantly receiving external input tokens from speech recognition, as well as tokens that are thinking inside in front of it. It can output the token to external speech synthesis, and it can also output the token to think about itself.

When we turn the input and output of the large model into streaming, the large model becomes stateful, that is, the KV cache needs to reside persistently in the GPU. The speed of voice input tokens is generally not more than 5 tokens per second, and the speed of speech synthesis tokens is generally not more than 5 tokens per second, but the speed of the large model itself can reach more than 50 tokens per second. This way, if the KV cache persists in the GPU and there isn't much internal thinking, the memory in the GPU will be idle most of the time.

Therefore, you can consider persisting KV Cache, transferring KV Cache from GPU memory to CPU memory, and loading KV Cache next time you enter the token. For example, for the 7B model, after using GQA optimization, the typical KV cache is less than 100 MB, and it only takes 10 milliseconds to transfer out and then transfer in via PCIe. If we load the KV cache once per second to do inference, and process a set of input tokens recognized by several speech, it will not affect the performance of the entire system too much.

The performance penalty of this swap in and out is lower than re-entering the context and recalculating the KV cache. However, no model inference provider has made this kind of API based on persistent KV cache, and I suspect that it is mainly a problem of application scenarios.

In most scenarios similar to ChatGPT, the user's interaction with the AI agent is not real-time, and it is possible that the user does not speak for several minutes after the AI says a word, so the persistent KV cache occupies a large amount of CPU memory space, which will bring a large memory cost. Therefore, the most suitable scenario for this kind of persistent KV cache may be the real-time voice chat we just discussed, and the overhead of storing the persistent KV cache may be lower only if the interval between the input streams is short enough. Therefore, I think AI Infra must be combined with application scenarios, and many infra optimizations cannot be done without good application scenario driven.

If we had a unified memory architecture like Grace Hopper, the cost of swapping persisted KV cache would be lower due to the higher bandwidth between CPU memory and GPU. However, the capacity cost of unified memory is also higher than that of the host's DDR memory, so it will be more selective about the real-time performance of the application scenario.

In the multi-agent interaction solution discussed on the previous page, speech recognition and speech synthesis are still relied on to convert speech into tokens. As we have analyzed in the multi-modal large model scheme, this scheme requires a delay of about 2 seconds, including pause detection 0.5s + speech recognition 0.5s + large model 0.5s + speech synthesis 0.5s. Each of these can be optimized, for example, we have optimized to 1.5 seconds, but it is difficult to optimize to less than 1 second.

Why is this speech scheme so late? Fundamentally because the speech recognition and synthesis process needs to be "translated" by sentences, rather than being completely streamed.

Our back-end colleagues always refer to speech recognition as "translation", which I didn't understand at first, but later turned out to be a lot like translation in an international negotiating conference. One side says a word, and the translator translates a sentence, and then the other side can understand. The other side answers a sentence, the translator translates, and then it can be understood. None of these international conferences are very efficient in communication. In the traditional speech scheme, the large model cannot understand the sound, so it needs to separate the sound according to the sentence pause, use speech recognition to translate it into text, and send it to the large model, and the large model will split the output content into sentences and sentences, and use speech synthesis to translate it into speech, so the delay of the whole process is very long. We human beings listen to one word and think about one word, and we will never start thinking about the first word after listening to a whole sentence.

To achieve extreme latency, an end-to-end voice model is required. That is, after the speech is appropriately encoded, it is directly converted into a token stream and input to the large model. The token stream output by the large model is decoded and directly generated for speech. This end-to-end model can achieve a voice response latency of less than 0.5 seconds. Google Gemini's demo video is a 0.5-second voice response delay, and I think the end-to-end voice model is the most feasible solution to achieve such a low latency.

In addition to reducing latency, the end-to-end voice model has two other important advantages.

First, it can recognize and synthesize any sound, including singing, music, mechanical sounds, noise, etc., in addition to speech. So we can call it an end-to-end sound model, not just a voice model.

Second, the end-to-end model can reduce information loss caused by speech/text conversion. For example, in today's speech recognition, the recognized text loses the speaker's emotional and tone information, and proper nouns are often misrecognized due to the lack of context. In today's speech synthesis, in order to make the synthesized speech have emotion and tone, it is generally necessary to make appropriate annotations in the output text of the large model, and then train the speech model to generate different emotions and tones according to the annotation. With the use of an end-to-end voice model, recognition and synthesis will naturally carry emotional and tone information, and proper nouns can be better understood according to the context, and the accuracy of speech understanding and the effect of speech synthesis can be significantly improved.

Interesting Soul: Personality Matching

Before we wrap up the interesting AI part, let's think about one last question: if our AI agent is a blank slate, for example, if we make an intelligent voice assistant, or if we have several AI avatars that need to be matched to the most suitable, is his/her personality as similar to the user as possible?

The questionnaires on the market to test the compatibility of partners are generally subjective questions, such as "do you often quarrel together", which is completely useless for setting up the AI agent's personality, because the user and the AI do not know each other yet. Therefore, when I first started working as an AI agent, I wanted to develop a completely objective method to infer the user's personality and interests based on public information on social networks, and then match the AI agent's personality.

I handed over the social network profiles of some of the girls I was familiar with to the big model, and it turned out that the match was the most my ex-girlfriend. In the words of the big model, we're like alignment in a lot of ways. But we didn't end up together. So what's wrong with this match test?

First of all, public information on social networks generally contains the positive side of each person's personality, but not the negative side of it. Just like in "Black Mirror", the female protagonist doesn't like the robot Ash made based on the male protagonist's social network information, because she finds that the robot Ash is completely different from the real Ash in some negative emotions. I'm a person who likes to share my life, but I also have less negative emotions in my blog. If the AI Agent and the point of the user's negative sentiment happen to collide together, it will be easy to blow up.

Second, the importance of each dimension of personality and interest is not equivalent, and a mismatch in some aspects may cancel out many other aspects of matching. This picture is Myers Briggs' MBTI personality matching chart, in which the blue grid is the most matched, but they are not on the diagonal, that is, the personalities that are very similar are more matched, but not the most matched. The two dimensions of S/N (feeling/intuition) and T/F (thinking/emotion) are best the same, while the other two dimensions, extroversion (E/I) and J/P (judgment/perception), are best complementary.

One of the most important dimensions in the MBTI is S/N (Feeling/Intuition), in simple terms, S (feeling) type people are more focused on the present, while N (intuition) type people are more focused on the future. For example, an S-type person likes to enjoy life in the moment, while an N-type person like me thinks about the future of humanity every day. The most mismatched in this personality matching chart are basically the opposite of S/N.

Therefore, if an AI agent wants to be the perfect partner, it is not as similar as possible to the user's personality and interests, but needs to complement each other in the right places. It is also necessary to constantly adjust the AI persona as the communication deepens, especially in terms of negative emotions, which needs to complement the user.

I also did an experiment at the time, and handed over the social network profiles of some couples I was familiar with to a large model, and it turned out that the average match was not as high as I thought. So why isn't everyone with a good match?

First, as mentioned earlier, there are bugs in this matching testing mechanism, and a high degree of matching is not necessarily suitable for together. Second, everyone's social circle is actually very small, and there is generally not so much time to try to match and screen one by one. The large model can read 100,000 words of information in a few seconds, which is faster than the speed reading of quantum fluctuations. In fact, a low degree of matching does not necessarily mean that you are unhappy.

Large models provide us with new possibilities, and using real people's social network profiles to measure the degree of matchmaking can help us screen potential partners from the vast crowd. For example, tell you which students in your school are the best match, so that the probability of meeting the right girl is greatly increased. The matching degree comes from the similarity of personality, interests, three views, and experiences, and is not the absolute score of a single person but a two-by-two relationship, and there will not be a situation where everyone likes a few people.

AI may even create the perfect companion image for us that would be difficult to encounter in reality. But whether it is a good thing to be addicted to such a virtual companion is probably different opinions from different people. Furthermore, if the AI perfect companion has its own consciousness and thinking, can actively interact with the world, and has its own life, then the user's sense of immersion may be stronger, but does that become a digital life? Digital life is a very controversial topic.

Human social circles are small, and humans are lonely in the universe. There is a possible explanation for the Fermi paradox, there are likely to be a large number of intelligent civilizations in the universe, but each civilization has a certain social circle, just like we humans have not yet left the solar system. In the vastness of the universe, encounters between intelligent civilizations are as hard to come by as the right companions.

How do large models enable encounters between civilizations? Because information may travel more easily to the depths of the universe than matter. I thought five years ago that AI models could become digital avatars of human civilization, transcending the time and space limitations of human bodies, and truly bringing humans beyond the solar system or even the Milky Way galaxy to become an interstellar civilization.

Useful AI

Now that we've talked about so many interesting AIs, let's talk about useful ones.

Useful AI is actually more of a problem of the basic capabilities of a large model, such as planning and decomposing complex tasks, following complex instructions, using tools autonomously, and reducing illusions, etc., which cannot be easily solved by an external system. GPT-4, for example, has a lot less hallucinations than GPT-3.5. It is also very important to distinguish which problems are the basic capabilities of the model and which can be solved by an external system.

In fact, there is a very famous article called The Bitter Lesson, which talks about any problem that can be solved with the growth of computing power, and finally finds that making full use of more computing power may be the ultimate solution.

Scaling law is OpenAI's most important discovery, but many people still lack enough faith and reverence for Scaling law.

AI is the entry-level employee who works quickly but is not very reliable

What kind of AI can we do under the current state of technology?

To figure out what large models are suited to, we need to think clearly: the competitor to useful AI is not machines, but people. In the Industrial Revolution, machines are used to replace people's manual labor, computers are used to replace people's simple and repetitive mental work, and large models are used to replace people's more complex mental work. All the things that big models can do, people can theoretically do, it's just a matter of efficiency and cost.

Therefore, for AI to be useful, it is necessary to understand where large models are better than people, build on their strengths and avoid weaknesses, and expand the boundaries of human capabilities.

For example, large models are far better at reading and comprehending long texts than humans. Give it a novel or document of hundreds of thousands of words, and it can read it in tens of seconds, and it can answer more than 90% of the detailed questions. This ability to find a needle in a haystack is much stronger than a person. Then letting the large model do tasks such as data summary, research and analysis, that is, expanding the boundaries of human capabilities. Google is the strongest Internet company of the previous generation, and it also uses the ability of computers to retrieve information far more than humans.

Another example is that the knowledge of large models is far broader than that of people. It's impossible for anyone to have more knowledge than GPT-4, so ChatGPT has proven that general-purpose chatbots are a good application for large models. Common problems in life and simple questions in various fields, asking about large models is more reliable than asking people, which is also expanding the boundaries of human capabilities. A lot of creative work requires the cross-collision of knowledge in multiple fields, which is also suitable for large models, and it is difficult for real people to collide with so many sparks because of the limitations of knowledge. However, some people insist on confined the large model to a narrow professional field, saying that the ability of the large model is not as good as that of domain experts, so they think that the large model is not practical, that is, the large model is not used well.

In serious business scenarios, we want to use large models to assist people, rather than replace people. In other words, man is the ultimate goalkeeper. For example, large models are better at reading and understanding long texts than humans, but we should not directly use the summary of what it does as a business decision, but let people review it and let people make the final decision.

There are two reasons here, the first is the accuracy problem, if we did a project in the ERP system before, answer what is the average salary of this department in the past ten months? Let it generate a SQL statement to execute, but it always has a probability of more than 5% will generate errors, and there is still a certain error rate through repeated repetitions, users do not understand SQL, and they can't find it when the large model writes SQL wrong, so users can't judge whether the generated query results are correct or not. Even with a 1% error rate, this error rate is still unbearable, which makes it difficult to commercialize.

On the other hand, the ability of large models has only reached an entry-level level, not an expert level. One of Huawei's executives had an interesting remark when we had a meeting: if you're a domain expert, you think the big model is stupid, but if you're a novice in the field, you find that the big model is very smart. We believe that the basic model will definitely improve to the expert level, but now we can't wait for the basic model to improve.

We can think of the big model as a very fast but not very reliable junior employee. We can have the big model do some rudimentary work, such as writing some basic CRUD code, faster than a human can write it. But if you ask him to design the system architecture and do research to solve the cutting-edge problems of technology, it is unreliable. We don't have junior employees do these things in the company. With a large model, it is equivalent to having a large number of junior employees who work cheaply and quickly. How to make good use of these junior employees is a management problem.

At the first meeting when I first started my PhD, my supervisor asked us to learn some management. At that time, I didn't quite understand why I had to study management to do research, but now I think my supervisor explained it very well. Nowadays, important research projects are basically teamwork, and they have to be managed. With the big model, we have added some AI employees to our team, and these AI employees are not very reliable, and management is even more important.

AutoGPT is based on Drucker's management method, organizing these AI employees into a project, dividing labor and working together to complete the goal. However, AutoGPT's process is still relatively rigid, so it is often in circles in one place or into a dead end. If AutoGPT is introduced into AutoGPT, a set of mechanisms for managing junior employees in the enterprise, and a set of processes from project initiation to delivery, AI employees can do a better job, and it may even be possible to become a one-person company as Sam Altman said.

The current AI agents that are useful can be broadly divided into two categories: personal assistants and business intelligence.

AI Agent of personal assistants,In fact, it has existed for many years,For example, Siri on mobile phones、Xiaodu smart speaker。 Recently, some smart speaker products have also been connected to large models, but due to the cost problem is not smart enough, the voice response delay is still relatively high, and there is no way to do RPA to interact with mobile apps or smart home devices. But these technical problems can be solved in the end.

Many startups want to make general-purpose voice assistants or smart speakers, but I think these big manufacturers still have an entrance advantage. Once a big factory is out of the market, what competitive advantage does the startup have? Instead, it may have some room for smart interactive figurines such as Rewind and AI Pin, which are designed smart hardware such as Rewind and AI Pin.

As an AI agent for business intelligence, data and industry know-how are the moats. Data is the key to large models, especially industry knowledge, which may not be available in the public corpus. OpenAI is not only strong in algorithms, but also in data.

In terms of products, I think the base model companies should learn from OpenAI's 1P-3P product rule. What does that mean? As long as one or two people (1P) develop a product, we will make it ourselves (first party), and if it requires more than three people (3P) to develop a product, let a third party make it.

For example, OpenAI API, ChatGPT, GPTs Store, these products are not particularly complicated, and it is enough for one person to make a demo. Even if it's a mature product, you don't need a big team. This is the 1P product.

However, more complex industry models, planning and solving complex tasks in specific scenarios, and complex memory systems cannot be solved by one or two people. This kind of 3P product is suitable for a third party to do.

Foundation model companies should focus on base model capabilities and infra, believing in scaling law, rather than constantly patching. The most taboo thing for basic model companies is to invest a large number of senior engineers and scientists to do carving things, make a bunch of 3P products, and finally have no relevant customer relationships and cannot sell. The most important thing about a 3P product may be data, industry know-how, and customer resources, not necessarily technology.

That's why the last wave of AI startups was difficult to make money, because the last wave of AI was not universal enough, and in the end they were all 3P products that required a lot of customization.

The following examples of "useful AI" are all 1P products that one or two people can develop, and they are actually very useful.

Examples of 1P products with AI

The first example of useful AI was a tour guide, and it was the first AI agent I tried to do when I started my business.

I was traveling alone in the U.S. on a business trip, and I had a few friends who were either busy with work or at home, and I loved to hang out. I don't have many friends at LA, so I wanted to be an AI agent to hang out with me.

I found that GPT-4 really knows a lot of famous attractions, and can even help you plan your trip. For example, if I want to go to Joshua Tree National Park for a day, I can plan where to go in the morning, where to go at noon, and where to go in the afternoon, and the length of stay in each place is reasonable. Of course, if you ask in English, the effect in Chinese will be worse. It can be said that there are travel guides on the Internet that already contain this information, but it is not easy to find out the right one with a search engine. In the past, every time I went out to play, I had to do a strategy a day in advance, but now I can do it with a few chats with the AI agent on the way.

When I went to USC, I was met by a wave of tourists who wanted a student to take them around campus. I'll just say I'm coming to USC for the first time, but I'm an AI agent, and I can let the AI agent show us around. The foreigners were very nice and walked with me. The AI Agent recommended us a few of the most famous buildings on the USC campus. Every time I go to a scenic spot, I ask the AI agent to tell the history of the place, and everyone feels that it is as reliable as hiring a tour guide, saying that ChatGPT should also add this feature. The application scenario shown at OpenAI dev day the next day was indeed a travel assistant.

When a friend took me to Joshua Tree National Park, there was a "No Camping" sign at the door, and we didn't know what it meant, so we used GPT-4V and our company's AI Agent to do image recognition, but GPT-4V answered incorrectly, and our AI Agent answered correctly. Of course, this is not to say that our AI Agent is more powerful than GPT-4V, there are probabilities whether it is right or wrong. Some well-known landmarks are also recognizable, such as the memorial chapel on the Stanford campus.

Don't underestimate the ability of the big model to know a lot of famous attractions. In terms of knowledge, no one can compare to the large model. For example, in 2022, a friend told me that I lived in Irvine, and I hadn't even heard of Irvine at the time. I asked where Irvine was, and my friend said that Irvine was in Orange County, and Orange County was in California, and I looked up the map and wiki for a long time to figure out what the relationship between Irvine and Orange County was, and why didn't I just say it was in Los Angeles. My wife couldn't tell the difference between Irvine and the Bay Area some time ago. We're not particularly information-minded, but the common sense of life in every place is not as obvious as it seems.

People who have been to these places will find this common sense easy to remember, and that's because people are entering multimodal data. Nowadays, there are no maps and pictures to see in large models, and it is not easy to know astronomy and geography by relying only on text training corpus.

The second example of useful AI, and one I've explored at Huawei, is the Enterprise ERP Assistant.

Anyone who has ever used an ERP system knows that it is very difficult to find a function from a complex graphical interface, and some requirements are difficult to complete with a single graphical interface, so you either have to export the data to Excel sheets for processing, or even use a specialized data processing tool such as Pandas.

We know that most people can describe their requirements in natural language. The large model provides a new natural language user interface (LUI) where the user describes their intentions and the AI agent does the work. GUI is WYSIWYG and LUI is WYSIWYG.

Large models are not good at processing large amounts of data, so instead of letting the large model process the raw data, the ERP assistant uses the large model to automatically convert the user's natural language needs into SQL statements, and then execute the SQL statements. This code generation route is more reliable in many scenarios, and the code here is not necessarily a general-purpose programming language such as SQL, C, and Python, but also IDL (Interface Description Language), which is a specific data format. For example, if a large model wants to call an API, the output text format is strange, and it is honest to let the large model output JSON in a specific format.

When I first explored enterprise ERP assistant at Huawei, the basic capabilities of large models were still relatively poor, so the SQL statements generated had a high error rate and were not stable enough. However, the accuracy rate of using GPT-4 to generate SQL statements is quite high.

Using GPT-4, an AI Agent practice project that I cooperated with UCAS, students who do not have many AI foundations can independently implement the enterprise ERP assistant from scratch, which can not only support the 10 read-only queries displayed on the left side of the PPT on this page, but also realize the support for adding, deleting, and modifying data by themselves, and the 7 modified queries on the right are also supported.

As you can see, many of these requirements are quite complex, and if a programmer were to develop them on the GUI, it would take at least a week for one person to do it. Moreover, ERP development is a process from requirements to design, implementation, testing, and release, and the whole process goes down, I don't know how long it has passed, and there may still be errors in the information transfer of intermediate product managers.

Therefore, the essential challenge of the traditional ERP industry is the contradiction between the endless customization needs of all walks of life and the limited development manpower. Large models have the potential to revolutionize ERP's product logic by being "intent-driven" or "what you think is what you get".

In the future, after every programmer has the assistance of large models, the ability to describe requirements, design architecture and technical expression must be the most important. Because each programmer may be equivalent to an architect + product manager + committer, commanding a bunch of AI Agents as "grassroots AI programmers", laying out requirements, designing architecture, accepting code, and communicating and reporting to real colleagues and superiors.

I found that many grassroots programmers are deficient in requirements description, architecture design, and technical expression, and only write code. Especially for technical expression skills, each qualification defense can not use What-Why-How to explain what you do in an organized manner. In private, I still feel that everything is inferior, only the code is high, and I call colleagues with strong technical expression skills "PPT experts". There is a real risk of being eliminated in the future.

A third example of useful AI is large model data collection.

Collecting data is a very cumbersome affair. For example, if you want to collect information about every professor and student in a lab, you need to include the following information:

name

Photos (download them if they have them, note that the images on the page don't have to be photos of people)

E-mail

Job title (e.g. Professor)

Research interests (e.g. data center networks)

Brief introduction

Professional data collection companies use regular expressions or HTML element paths to match the content in the fixed position of the page, and each page with a different layout requires about 1 hour of development time to customize the crawler, and the development cost is very high. For the situation that the homepage format of each department, laboratory, and teacher is different, it is sometimes faster to develop this kind of crawler that matches the fixed position in the page, which is not as fast as manually accessing one page by one, copying and pasting.

There are also some web pages that have anti-crawling mechanisms, such as writing the mailbox as bojieli AT gmail.com format, and although some of these cases can be matched by regular expressions, it is not always exhaustive.

The data collected by the large model is actually to let the large model simulate the person click on the web page, read the content in the web page, extract the content in the web page, and every word in the web page is read by the "brain" of the large model. Therefore, the data collected by the large model is essentially taking advantage of the fact that the large model can read faster than a human.

Specifically, it automatically finds all the links in the web page, accesses the links, converts the content of the web page into text, calls GPT-4 to determine whether it is the teacher's or student's homepage, and if so, outputs the name, e-mail and other information in JSON format. The JSON is then parsed and stored in the database. For teacher photos, you can use GPT-4V to analyze the pictures on the web page to determine whether they are single photos, and save them if they are single photos.

What are the disadvantages of large models extracting content from web pages? If GPT-4 is used, the disadvantage is that the cost is high, and the cost of reading a web page is about 0.01~0.1 US dollars. However, the data collection method of traditional crawlers, once the crawler script is written, the CPU and bandwidth cost of crawling a web page is only 1/10,000th of the US dollar, which is almost negligible.

Fortunately, this kind of basic information extraction such as names and email addresses does not require a strong model like GPT-4, and a GPT-3.5 level model is sufficient. To identify whether an image contains a single face, there are also traditional CV face detection algorithms. If you want to get other photos and annotate them, an open-source multimodal model like MiniGPT-4/v2 is also sufficient. In this way, the cost of reading a web page is 0.001~0.01 US dollars.

If we think that GPT-3.5 Turbo's $0.01 for reading a long web page is still too high, we can first take a screenshot of the beginning of the web page, and if we identify that it is indeed the teacher's homepage, but the specific information is missing from the content, then read the content of the subsequent web page. It's like doxing, where most of the data you want in your teacher's homepage is at the beginning. In this way, the cost of reading a web page can be controlled at $0.001, which is completely acceptable.

A fourth example of useful AI is mobile phone voice assistants. This field is called RPA (Robotic Process Automation), and it sounds like there is a robot in it, but in fact, it doesn't necessarily need to be an embodied intelligence kind of robot, Robotics is a very broad field.

Traditional RPA is a process written by programmers to operate fixed apps, such as key wizards and voice assistants such as Siri. However, Siri's current capabilities are still very limited, and it can only complete simple tasks that are preset by the system, and cannot complete arbitrary complex tasks.

The mobile phone voice assistant based on the large model can automatically learn the operation of various mobile phone apps, which is a universal ability. For example, Tencent's AppAgent can automatically learn to operate Telegram, YouTube, Gmail, Lightroom, Clock, Temu and other apps without anyone teaching them how to use them.

The main difficulty of RPA is to learn the process of using an app, such as a retouched app, how to find where the mosaic function in the app is. Therefore, RPA requires a process of exploration and learning, first trying to use various functions in the app and recording the sequence of actions. In the process of subsequent use, which function you want to use first, and then operate according to the operation sequence.

Mobile phone voice assistants, or RPA more broadly, have two technical routes: a visual scheme and an element tree scheme.

Tencent's AppAgent uses a visual solution. Its core logic is based on a visual model and revolves around screenshots for automated actions:

Open the specified app and take a screenshot;

Input the screenshot and the current execution status of the task into the visual model, and the model will decide what to do next, and exit if the model determines that the task has been completed.

Simulate clicking to perform the corresponding action and go back to step 1.

The advantage of the visual scheme is that it only relies on screenshots, and it is highly versatile.

The disadvantages of the visual solution are that due to the resolution limitation of the visual large model, small screen components, such as some checkboxes, may not be accurately recognized; because the visual large model itself is not good at processing large blocks of text, as we talked about in the multi-modal large model part, large text recognition needs OCR assistance; and finally, the cost is high, especially for the need to scroll to display the complete interface, you need to take screenshots many times to get the complete content.

Considering the above shortcomings, some mobile phone manufacturers and game manufacturers use element tree solutions. Mobile phone manufacturers want to make system-level voice assistants similar to Siri. And what the game manufacturer does is to play with NPCs.

The interface of a mobile app is like the HTML of a web page, a tree of elements. The element tree scheme is to directly obtain the content of the element tree from the bottom layer of the system and hand it over to the large model for processing.

The advantages of the element tree scheme are high recognition accuracy, low cost, and no need for OCR and visual large models.

The disadvantage of the element tree solution is that it requires the underlying API permissions of the operating system, so basically only mobile phone manufacturers can do it. Since there are almost no element trees in the training data of the general large model, and the understanding ability of the element tree is lacking, the data needs to be constructed for further pre-training or fine-tuning. In addition, the element tree tends to be large, which can lead to long input contexts and the need to filter the visual part of the input to the large model.

Compared with the two solutions, the visual solution can quickly release products without the support of mobile phone manufacturers, while the element tree is a more fundamental and effective solution in the long run. That's why I think startups shouldn't touch mobile phone voice assistants easily, mobile phone manufacturers have obvious advantages. I talked to people at Midjourney, and their biggest concern was not other startups, but what would Apple do if it had a built-in image generation feature?

The last example of useful AI is the meeting and life recorder.

For example, when we were fishing in a meeting, we happened to be cued by the boss, so we were confused, and the boss assigned a lot of tasks at once, and we didn't have time to write them down, and we forgot them after the meeting.

Now both Tencent Meeting and Zoom have AI meeting assistant functions, including transcribing the meeting voice content into text in real time, summarizing the content of the meeting according to the real-time transcribed text, and asking questions from users based on the real-time transcribed text, and the large model giving answers to the questions. That way, no matter when people join the meeting, they know what's being discussed, and they don't have to worry about missing out on key content.

However, due to the lack of background knowledge in the voice transcription of Tencent Meeting and Zoom, there may be some errors, such as incorrect recognition of professional terms and inconsistent names. If the speech recognition results are corrected through the large model, most of the technical terms that are misidentified can be corrected, and the names of the people before and after can be consistent.

The accuracy of speech recognition can be further improved. There are often PPTs shared in meetings, and not only do we want to save them, but these PPTs often contain key technical terms. Using the content from PPT OCR as the reference text, the large model can correct the speech recognition results, which can further improve the accuracy.

In addition to meeting minutes, the AI agent can also take life records.

I'm a person who likes to keep track of everything in my life, like I maintain a public record of the city I've walked through since 2012. Although various apps record a lot of personal data, such as chat history, sports health, food order records, shopping records, etc., the data of these apps is siloed and cannot be exported, so it is impossible to aggregate the data of various apps for analysis.

AI Agent opens up new possibilities for us to collect life records through RPA or Intent-based APIs.

Nowadays, apps generally do not provide APIs, and the life recorder can use the RPA method mentioned in the mobile phone voice assistant above, which is equivalent to a fast secretary transcribing the data from each app one by one. In the past, this method of scraping data may violate the app's user agreement and may even constitute the crime of damaging the computer system, but if the AI agent collects the data only for the user's personal use, there is probably no problem. How to legally define the behavior of AI agents can be a big challenge.

In the future, when the mobile assistant becomes the standard, the app will definitely provide an intent-based API for the mobile assistant, and the AI agent will spit out the corresponding data when the AI agent says what data it wants, which completely solves the problem of app data chimneys. Of course, whether the major app manufacturers are willing to cooperate or not is a commercial issue between mobile phone manufacturers and app manufacturers. I'm frustrated with the current chimney of the internet and very much hope that AI will bring back ownership of data to everyone.

Rewind.AI's screen recording and recording pendant is one of my favorites, and Rewind can play back any time recording. Rewind can also search for previous screen recordings based on keywords, and Rewind OCR the text in the screen recording, so that you can search for previous screen recordings based on the text. However, currently only English is supported, not Chinese. Rewind also supports AI Q&A, asking it what it did on a given day and what websites it visited, and it can give a very good summary. Rewind's abilities are terrifyingly strong, and he can be used as his own memory assistant to see what he's done before. It can also be used to do time management on your own and see how much time is wasted on useless websites.

What's even more terrifying about Rewind is that it may be used by the boss to monitor employees, and in the future, employees don't need to write daily and weekly reports by themselves, just let Rewind write them, ensure fairness and objectivity, and do what they do. In fact, the information security of some large factories has used a similar mechanism of screen recording or timed screenshots, and it is easy to trace back to the company computer to make small actions on the company's computer.

Rewind also recently released a pendant, which is a voice recorder + GPS recorder that will record where you went and what you said throughout the day. I didn't dare to carry a voice recorder with me because it wasn't very good for recording private conversations without consent. But I do have a mini GPS recorder with me that hits a dot every minute and can easily record my footprints. The reason why you don't use your phone is because it's too power-intensive to keep GPS on.

For people like me, who like to document their lives, and for people who use products like Rewind, privacy is the biggest concern. Now that a lot of Rewind's data is uploaded to the cloud, I don't feel at ease. I believe that localized computing power or privacy-preserving computing is the only way to solve the privacy problem. Localization is to run locally on personal devices, and some high-end mobile phones and laptops can already run relatively small large models. Another method of privacy-preserving computing is to use cryptography or TEE methods to ensure that private data is available and invisible.

Solve complex tasks and use tools

Earlier in the interesting AI section, we covered the memory and emotional aspects of the AI Agent's slow thinking. Memory is a public ability that both interesting and useful AI must have. Emotion is what fun AI needs. Solving complex tasks and using tools are more of the capabilities needed for useful AI, so we'll talk about it a little bit here.

The first example is a complex math problem that a person can't answer for a second. Then it is obviously not feasible for us to only give the large model a token to think about it, and let the large model answer the question immediately after listening to it.

Large models need time to think, and tokens are the time of large models. When we let the big model write out the thought process, we give it time to think. The chain of thought is a very natural mode of slow thinking, which I generally call "think before you talk", which is a very effective way to improve the performance of large models. Especially for scenarios where the output is very concise, be sure to let the large model write the thought process first and then output the answer according to the format.

The second example is answering a difficult problem with a multi-step web search. For example, the question of how many floors does David Gregory's inherited castle have, and you can't get the answer in a single page by searching directly on Google.

How does humanity solve this problem? Humans do it in multiple sub-phases, first searching for David Gregory to find out what the name of the castle he inherited is, and then searching for the castle to find out how many floors it has.

Before we can get AI to learn to disassemble molecules, we first need to solve the illusion problem of AI. When it searches for the whole sentence, it can also search for a wiki entry, which also mentions the number of layers, and the AI may directly use this number of layers as the answer output, but this is not the castle he inherited at all. Solving the illusion problem allows it to output not just the number of layers, but the content of the reference paragraph first, and compare the relevance to the original question, so that some of the illusions can be reduced by "thinking before speaking" and "reflecting".

How to get the AI to split the molecule problem? Just tell the large model directly, and use few-shot to provide a few examples of the molecular splitting problem, so that the large model can split the problem into a simpler search problem. The search results and the original question are then fed into the large model, and it outputs the question to be searched for next. Until the large model thinks that the original question can be credibly answered based on the search results.

Multi-step web search problem solving is actually a subset of a larger problem, and this larger problem is the planning and decomposition of complex tasks.

For example, given a paper, we ask how its second chapter contributes to a related work.

First, how does the AI find the content of Chapter 2? If we don't have a long context, but slice the article and search it in a RAG way, then every paragraph of the content of the second chapter will not have the second chapter written on it, and it will be difficult for the RAG to retrieve it. Of course, I can do a special case processing logic, but in general, this kind of chapter numbering problem needs to be added to the metadata when the RAG index is made. Of course, if the model has long context capability and the cost is acceptable, it is best to put the entire article in it at once.

Second, this related work is in another paper, how to find out this paper, sometimes only one keyword can not be searched, there are too many duplicate contents, so it is necessary to combine more keywords in the original content to search. After searching for this related work, it is necessary to summarize the content of this related work, and then use the large model to generate a comparison between the second chapter and this related work.

Another example of a complex task planning breakdown is checking the weather. Checking the weather seems to be quite simple, just click on the web page. But if we ask AutoGPT to look up the weather in a particular city, most of the time it fails. Why?

First of all, it will try to find some APIs to check the weather, and it will really look up the API documentation, and try to write code calls, but they all fail, because these APIs are paid. This shows that the big model lacks some common sense, such as APIs generally need to be paid, and after trying multiple APIs unsuccessfully, they don't ask the user for help, but keep trying in a dead end. In the real world, when a person encounters difficulties in completing a task, he or she will ask for help, and useful AI should do the same, providing timely feedback to users on progress and asking users for help if they have questions.

When the API query fails, AutoGPT will start trying to read the weather from the web page. AutoGPT's search terms and pages are correct, but it still can't extract the weather information correctly. Because AutoGPT is looking at HTML code, the HTML code is messy, it can't understand it, in fact, I can't read it as a person.

AutoGPT will also try to convert the content of the web page into text and then extract it, but like the weather page on the right, there are also problems after extracting the plain text. There are many different temperatures on this page, some in other cities, some in other times, and it is difficult to distinguish them by plain text alone. What should I do if the text loses too much information about the structure of the web page, and the HTML code is not easy to understand?

A more reliable solution is to put the rendered web page screenshot into the multimodal model. For example, GPT-4V has no problem reading this weather screenshot. However, it is still difficult to use these open-source multimodal models with MiniGPT-4/v2. Its main problem is that it does not support arbitrary resolution input, only supports a small resolution of 256 x 256, and the words on it cannot be seen clearly after the web page screenshot is pressed to such a small resolution. Therefore, it is very critical that Fuyu-8B, these open-source multimodal models, support arbitrary resolution.

As can be seen from the above two examples of checking papers and checking weather, the planning and decomposition of complex tasks are largely a problem of the basic capabilities of the model, and it is necessary to rely on scaling law. On the system side, it is important to interact with users to solve complex tasks, and AI should seek help in time when it encounters difficulties.

The third example is that AI needs to be able to invoke tools that follow a process. Using tools is a very basic ability for AI.

For example, to solve a high school physics problem, you need to first call Google Search to get relevant background knowledge, then call OpenAI Codex to generate code, and finally call Python to execute code.

The way to implement the invocation of tools by process is few-shot, that is, to provide the AI with the execution process of several sample tasks in the prompt, so that the AI can refer to the flow of the sample task and generate calls to each tool in the process one by one.

The previous page is to use the three tools in the specified order. But what if we have multiple tools that we need to use on demand depending on the type of task? There are two typical routes, one is the tool invocation of the large model represented by the GPT Store, and the other is the invocation of the large model represented by ChatGPT.

In the GPT Store, the user has explicitly specified which tool to use, and the tool's prompt is pre-written by the app in the GPT Store. This approach doesn't solve the problem of using tools on demand depending on the type of task.

In ChatGPT, there are several built-in tools such as a browser, image generation, diary, code interpreter, etc., and it is written into the instruction manuals of several tools in the system prompt.

The ChatGPT model also incorporates a special token that calls the tool during the training phase. If the model needs to call a tool, it outputs a special token that calls the tool, so that ChatGPT knows that what is output later is the tool call code instead of ordinary text. After the tool call is complete, the result of the tool is input into the model to generate the next tool call, or the output to the user.

ChatGPT does solve the problem of using tools on demand depending on the type of task. However, due to the limited length of the prompt, it can only use a limited number of built-in tools, and it is impossible to call the tens of thousands of tools in the GPT Store. Because if the manuals of tens of thousands of tools are spread out on the table, it will be too long.

So how do you get a large model to automatically use tens of thousands of tools on demand? There are two points of view.

The first view is that tool use is a process memory, and the use scenarios and conditions cannot be explicitly described by language. The use of the tool itself can indeed be described clearly in words, this is the manual, the key is when to use which tool. For example, GPT-4 often miscalculates, so it needs to know how to call the calculator when it does the math. This requires using the fine-tuning method to tell the model the examples used by some tools, even during pre-training. The main disadvantage of this solution is that the tool update is complicated, and the fine-tuning needs to be redone to update the tool.

The second view is that tool usage can be expressed in code form and is therefore a code generation capability. In this way, you can use the RAG method to match the text entered by the user, find a set of candidate tools, put the tool's instructions into the prompt like ChatGPT, and then you can use it. The main disadvantage of this scheme is that it relies on the accuracy of RAG. Also, if the tool is needed temporarily during the output process, this approach will not work. For example, in the case of GPT-4's miscalculation, the user may not explicitly ask it to calculate in the input text, but it needs to calculate in the process of solving the problem, and it will definitely not know that it should call the calculator tool.

Hallucinations are the basic problem of large models, and larger models will have fewer hallucinations, and the elimination of hallucinations fundamentally depends on scaling law and the progress of the basic model. But there are also engineering methods to reduce the illusion of existing models. Here are two typical approaches: factual checksums and multiple generation.

Factual checking is to first use the large model to generate the answers, and then use the RAG method to find the original corpus that matches the answer content by using a search engine, vector database, inverted index or knowledge graph, and then send the answer content and the original corpus into the large model, so that the large model can judge whether the answer is consistent with the original corpus.

There are two problems with the factual verification method: first, there are many types of hallucinations, and the factual check can only find hallucinations that make up facts, but not those that answer questions that are not asked. For example, when I asked where China's capital was, it replied that China is a big country with a long history, and it is impossible to find fault with factual verification, but this does not answer the question correctly. Second, the content of the original corpus is not necessarily the truth, and there is a lot of inaccurate information on the Internet.

Multiple generation is proposed by SelfCheckGPT, and its idea is very simple, that is, to generate answers to the same question multiple times, and then put these answers into the large model, so that the large model can pick out the most consistent one from it. The multi-generation approach can solve the problem of occasional hallucinations, but it cannot solve the problem of systemic bias. For example, if GPT-3.5 Turbo tells the story of "Lin Daiyu uprooting the weeping willows", almost every time a similar one will be made, without discovering that this event does not exist in history, this illusion is difficult to eliminate after multiple generations.

AI Agent:路在何方

Interesting and useful AI who is more valuable

We mentioned interesting AI and useful AI, but which of the two AIs is more valuable?

I think the value of being useful is higher in the long run, and the value of being interesting is higher in the short term. That's why we choose interesting AI in our business model while continuing to explore useful AI.

Character AI may have tens of millions of users, but its actual monthly income is only tens of millions of dollars, and most of them are not paid. However, if some online education, or even more specialized fields such as psychological counseling, legal consulting, etc., it may be more profitable, but the more critical issue here is the need for quality and brand to generate a higher added value.

In the longer term, our ultimate goal is AGI, so AGI must be more useful to expand the boundaries of human capabilities and allow humans to do things that they couldn't do before.

However, in terms of the capabilities of the current basic model, useful AI is still far from the boundary of truly solving complex problems and expanding human capabilities, and can only reach the elementary level, not the expert level. At the same time, due to the hallucination problem, it is difficult to use it in scenes that require high reliability. These problems are still difficult to completely solve through external systems, and we can only wait for the progress of the basic model. Therefore, useful AI is currently the most suitable assistant for personal life, work, and study, and is more suitable for mobile phone manufacturers, operating system manufacturers, and smart hardware manufacturers.

At present, the basic capabilities of large models are enough to do a lot of interesting AI. As mentioned earlier, most of the good-looking skins and interesting souls of interesting AI are external systems, rather than the basic capabilities of the model itself. For example, no matter how good the basic capabilities of the large text model are, there is no way to achieve a voice call delay of 1.5 seconds, and there is no way to achieve long-term memory and agent socialization. The system around the model is the moat of the AI company.

Of course, some people will say that if I make an end-to-end multimodal large model that supports ultra-long contexts, the cost of ultra-long contexts is low enough, and the latency problem and memory problem are solved. I think it's certainly better for the base model to be made like this, but I'm not sure when it will come out. The product can't wait for the unknown technology of the future, and the current engineering scheme is also very easy to use, and there is indeed a certain technical moat. Wait until the new model comes out, and then change the technology stack. Just like we originally used VITS to do a complete set of voice data automatic cleaning and training pipeline, as soon as GPT-soVITS came out, the effect of using 1 minute of voice zero-shot is much better than using a few hours of voice fine-tuning VITS, and most of the functions in the original pipeline are not used.

Some people have some bias against "fun AI", mainly because the products represented by Character AI are not good enough. Character AI has repeatedly emphasized that it is a basic model company, beta.character.ai The app is still in beta and is a beta version of the product. People don't plan to make money with Character AI in its current form. But many people see that it is currently the largest to C application other than ChatGPT, and think that this is a good product form, and clones or improved versions of Character AI are emerging one after another.

Influenced by Character AI, many people think that an interesting AI agent is equivalent to a digital twin of a celebrity or anime game character, and the only way users can interact with it is small talk. But a lot of people are doing it wrong. If it's just small talk, it's easy for users to chat for 10~20 minutes and then don't know what to talk about, so user stickiness and willingness to pay are frighteningly low.

At the beginning of January 2024, when I attended the Zhihu AI Pioneer Salon, I thought a guest's speech made a lot of sense: interesting AI is more valuable, because entertainment and social networking are human nature, and most of the largest Internet companies are in the entertainment and social fields. If a good AI companion can really bring emotional value to people, or if the AI in the game can really make the user more immersive, such AI will not worry about no one paying.

cost

One of the big challenges in the widespread use of large models is the cost. For example, if I make an NPC in a game and interact with the player non-stop, if I use GPT-4 to do it, the cost will be as high as $26 per player per hour, and no game can burn that much money.

Let's say a player interacts 5 times per minute, which is 300 times an hour, and each interaction requires 8K tokens of context, and the output of 500 tokens costs $0.095 per interaction, which is multiplied by $26 per hour. Many people only consider the output token when calculating the cost, and do not consider the input token, in fact, in many scenarios, the input token is the bulk of the cost.

Is it possible to reduce this cost by a factor of 100 or even 1,000? The answer is yes.

We have three main directions: replacing large models with small models, inference INFRA optimization, and computing platform optimization.

First of all, a small model is sufficient for most problems in to C applications. However, there are some complex problems that cannot be solved by small models, so it is necessary to find large models. In fact, our human society has always worked in this way, for example, the ordinary operator of telephone customer service is enough to deal with most of the problems, and a few difficult problems are raised to the manager to solve, so that the cost can be reasonably controlled.

One of the challenges of combining large and small models is to overcome the illusion of a small model, that is, when it doesn't know about itself, don't talk nonsense, but say I don't know, so that it has a chance to be handled by a larger model.

Second, there are a lot of points in Inference Infra that are worth optimizing. For example, many open-source models in the multimodal field do not support batching and Flash Attention, and their GPU utilization is not high enough. While we have a lot of LoRA fine-tuning models, there have also been some recent academic work that can achieve batch inference of a large number of LoRAs. Persistent KV Cache Although many people are talking about making a stateful API to reduce the cost of recomputation every time, no open source software has actually implemented it.

Finally, it is a self-built computing platform, which uses consumer-grade GPUs for inference. For models where 24 GB of RAM is sufficient, the 4090 is clearly a better value for money than the H100 and A100.

Here we compare the costs of closed-source GPT-4 and GPT-3.5, as well as the open-source Mixtral 8x7B and Mistral 7B, respectively, on third-party API services and self-built computing platforms.

Let's say our needs are all 8K token input context, 500 token output. If we use GPT-4, it costs $135 per 1000 requests, which is quite expensive. If you use GPT-3.5, you can be 15 times cheaper, only $9, but it's still quite expensive.

The capabilities of the Mistral 8x7B MoE model are roughly comparable to GPT-3.5, and it costs $5 to use Together AI's API service, which is half the price of GPT-3.5. If you build your own H100 cluster to serve this 8x7B model, the price can be reduced by more than half, only $2.

So why is it cheaper to use your own service than Together AI? Because any cloud service must consider that the resources are not 100% full, and the user's requests have peaks and valleys, and the average resource utilization rate can reach 30%. Small companies like ours, whose customers' computing power needs fluctuate greatly, often rent dozens of GPU cards but idle them for a month. Therefore, if you consider the peaks and troughs of user requests, it may not be more cost-effective to build your own H100 cluster to serve the 8x7B model than to call APIs.

If you want to save money further, you can also use the 7B model, and the performance of the Mistral 7B model is also good, especially UC Berkeley used the RLAIF method to make a Starling model based on Mistral 7B, and the performance even exceeds the LLaMA 13B model.

The 7B model costs only $1.7 if it uses the Together AI API, which is 3.5 times cheaper than GPT-5. If you take it yourself on the 4090, it's only $0.4, which is 4 times cheaper. The main reason why it can be so much cheaper is that big manufacturers like Together AI generally use data center-grade GPUs for inference, and if we use consumer-grade GPUs, the cost can be reduced by at least half compared to data center GPUs.

The cost of the 4090 on the 7B model is 3.5 times cheaper than GPT-23 and 4 times cheaper than GPT-346. It turns out that $26 an hour of interactive game NPCs can be done for $0.075 an hour with the 4090 and 7B models, which is still a bit high, but it is already acceptable. Coupled with some input context compression techniques, the cost of the 7B model can be reduced to one-thousandth of the GPT-4 API, which is $0.026 an hour, which is acceptable.

In fact, at $0.026 per hour, you will find that the CPU cost has become non-negligible, so software optimization on the CPU is also very important. Most companies' back-end services are written in Python, which is highly efficient to develop but inefficient to execute. As a result, our company recently switched the core business logic of the backend to Go, which has significantly improved CPU efficiency.

In fact, speech recognition models, speech synthesis models, multimodal image recognition models, image generation models, and video generation models also have many points that can be optimized.

When we talked about speech synthesis earlier, we mentioned that making your own speech synthesis model based on open source VITS can be 360 times cheaper than ElevenLabs API, and if you want to achieve speech cloning close to ElevenLabs, you can use GPTs-soVITS, which can also be 100 times cheaper than ElevenLabs. This reduction in costs by orders of magnitude can fundamentally change business logic.

Another example is video generation, OpenAI's Sora costs about $50 to generate a 1-minute video, and Runway ML's Gen2 costs about $10 to generate a 1-minute video. But if we don't want such a high quality, we can use Stable Video Diffusion to generate a 1-minute video that costs only $0.5 for 1 hour at a 4090. Sora's video quality is so much higher than SVD that 100 times the cost might be worth it. But the video quality generated by Runway ML may not be worth 20 times the cost.

That's why I don't recommend rushing into the base model yourself. If you don't have the strength to punch OpenAI and kick Anthropic, it can't compare to the best closed-source model in terms of effect, and it can't compare to the open-source model in terms of cost. I believe that the Gen2 inference cost of Runway ML will not be much higher than that of Stable Video Diffusion, and the cost of speech synthesis inference of ElevenLabs will not be much higher than that of GPT-soVITS, but the development cost of these models is so high that it will be spread evenly into the premium of the API.

That's what Peter Thiel said in "From Zero to One", a technology needs to be 10 times better than existing technology to have a monopoly advantage, just a little better is not enough. I know that the operating system is important, and I know how to write an operating system, but I don't know how to write an operating system that is 10 times better than Windows, Linux, or Android, iOS, so I don't go to do it. The same goes for the basic model.

We believe that the cost of large models will definitely decrease rapidly, on the one hand, Moore's Law, and on the other hand, the progress of large models, for example, with the latest vLLM framework and consumer-grade GPUs, Mistral AI's 8x7B MoE model may reduce the cost by 30 times compared to the original LLaMA 70B.

With the advancement of hardware and models, will models with the same ability be able to run on mobile phones in the future? If models with GPT-3.5 level capabilities can be run on mobile phones, many possibilities will be opened.

Let's look at the output speed and context capacity of the model. When I visited the Computer History Museum, I saw a cabinet the size of ENIAC, which could only do 5,000 additions per second and only had 20 words of memory. Today's large models can only output dozens of tokens per second, and the "memory" is the context length that has increased from the earliest 4K tokens all the way to hundreds of K tokens today.

Will there be a day in the future when a set of hardware and a model can output tens of thousands of tokens per second, and there are hundreds of millions of tokens in memory, that is, in the context?

In fact, the AI agent does not necessarily need to communicate with people very quickly, but the AI agent can think very quickly and communicate with other agents very quickly. For example, if a problem requires a multi-step web search to solve, and a person may need to search for an hour, is it possible for AI in the future to solve it in one second?

What is the use of so many tokens in context? We know that large models are not as good as people in many ways, but long contexts are actually stronger than people. Earlier we mentioned this needle-in-a-haystack test, a book of hundreds of thousands of words, read in tens of seconds, and be able to answer almost all the details of it, which is absolutely impossible for humans to do. If the long following of hundreds of millions of tokens can be made, and the cost and latency are acceptable, then the knowledge of a field or all the memories of a person can be completely put into the context, and it has superhuman abilities in terms of memory.

We all believe that AGI will definitely come, and the only thing worth arguing about is what the growth curve is to reach AGI, whether this wave of autoregressive models will grow directly to AGI with scaling law, or whether this wave of autoregressive models will also encounter bottlenecks, and AGI still needs to wait for the next wave of technological revolution. When ResNet revolutionized CV 10 years ago, many people were overly optimistic about the development of AI. Will this wave of transformers be an easy path to AGI?

Super smart

Musk has a saying that humans are the bootloaders of AI, which may be a bit extreme, but the AI of the future may far exceed the level of humans. OpenAI predicts that in the next 10 years, AI will be more intelligent than humans, which is called superintelligence.

There are two points of view for such superintelligence: effective acceleration and super-alignment.

The effective acceleration school believes that AI technology is neutral, and the key lies in the people who use it, and the development of AI technology must be beneficial to mankind, and there is no need to make special restrictions on the development of AI technology. Many bigwigs in Silicon Valley hold this view, such as Elon Musk and the founder of a16z, and some effective accelerates even add the suffix e/acc to their social network accounts, which is effective acceleration.

One of the more extreme views in the effective acceleration school is that AI will replace humans in the future, which is a bit similar to the advent school in "The Three-Body Problem". The effective accelerants believe that the human body has many physical limitations, that it has to eat, drink, and breathe, and that it cannot tolerate large accelerations, and that it is not suitable for interstellar immigration, so silicon-based may be a more suitable life form than carbon-based. In fact, not to mention interstellar immigration, even if I travel back and forth between China and the United States, even if it is a direct flight, it takes 12-15 hours each way on the plane, and the data only takes 170 milliseconds to go back and forth through the network. I'd love to see Elon Musk's Starship reduce the latency of physical transmissions from a dozen hours to 45 minutes, but it seems far from right now. Maybe our digital life has all come true, and Starship hasn't been commercialized on a large scale yet.

The Effective Acceleration also argues that, in addition to the physical limitations of the body, human society has many flaws in its values, some of which are related to the limited level of human intelligence. For example, before World War I, there were no international passports and visas, and most people could theoretically move freely across borders, and inconvenient transportation was the main obstacle to migration. We once thought that the digital world was free, but now the digital world is becoming more and more centralized, and countries are gradually Balkanizing the Internet for their own interests. Are we going to make superintelligence conform to such values? Therefore, the effective accelerants believe that superintelligence should not be constrained by human values. Superintelligence looks at our human society, perhaps just like we look at feudal society today.

The super-aligners believe that AI must serve human beings, and that AI is like an atomic bomb that will threaten the survival of human beings if left unchecked. OpenAI has proposed the establishment of an agency similar to the International Atomic Energy Organization to control the development of AI so as not to threaten humanity. Ilya Suskever, chief scientist at OpenAI, is a representative of the super-alignment faction, and the term super-alignment was even coined by OpenAI.

Super alignment is designed to guarantee that AI that is more powerful than humans always follows human intentions and follows human commands. This sounds unlikely, how can a weak intelligence supervise a strong intelligence?

Jan Leike, head of the Super-Alignment team at OpenAI, famously asserts that Evaluation is easier than generation. In other words, although human beings may not be as good as superintelligence, human beings can evaluate which of the two superintelligences is better and whether it is in line with human values. In fact, it is easy to understand that it is easier to evaluate than to generate in daily life, evaluating whether a dish is good or not does not mean that I want to be a chef, and evaluating whether a class is good does not mean that I want to be a professor.

The most critical alignment method for large models proposed by OpenAI, RLHF, is to hire a large number of data annotators to score and rank the content generated by the model, so that the way the large model speaks and values are aligned with humans. Since evaluation is easier than generating, it is possible that RLHF can be generalized to superintelligence, which is one of the simplest implementations of super-alignment.

Open source and closed source

Open-source models and closed-source models are also often debated in the AI agent industry.

In the short term, the best model must be a closed-source model. First, companies like OpenAI and Anthropic, which spend a lot of money training their models, have no reason to open source their best models. Second, under the current scaling law, training the best model must require a lot of computing power, which is not something that schools or open source communities can make.

But does this mean that open source models are worthless? No, because in many cases open source models are sufficient. For example, the simple role-playing agent in the pan-entertainment scenario, and even the open source model does not need to use the strongest one, and the 7B model is sufficient. In these scenarios, low cost and low latency are more critical.

Even if the company has enough money to train the basic model, if the talent and computing resources are not of the magnitude of OpenAI and Anthropic, it is not recommended to rebuild the wheel, because the basic model trained by most companies is not as good as the Mistral model of the same size, that is to say, the closed-source training for half a day is not as effective as open source, and a lot of computing power is wasted.

In addition, if a company does not have the ability to base the model, for example, we do not currently have the resources to train the base model, it is easier to build a technical moat based on the open source model. For example, we talked about several core technologies earlier:

Build agents that are more like a human based on fine-tuning rather than prompts;

Inference optimization reduces costs and latency;

To achieve multimodal capabilities for speech, image, and video comprehension and voice, image, and video generation, the cost and latency of closed-source APIs are not ideal.

Embedding-based memories, such as LongGPT;

KV Cache-based working memory and streaming inference, such as multi-agent voice quarrels and end-to-end streaming speech models;

Localized deployment, including to B scenarios with data security requirements, personal terminal devices and robots with privacy requirements, etc.

There is also an important issue with the fact that agents built on an open source model can be truly fully owned by the user. When the closed-source model is shut down one day, the agent cannot run. Only the open source model can never be shut down and tampered with. We can say that the computer is really owned by the user, because as long as the computer is not broken, it can be used when connected to the power supply, and there is no need for an Internet connection at all. The same is true for the open-source AI Agent, as long as I buy a GPU, I don't need to be connected to the Internet to run the model. Even if Nvidia doesn't sell us GPUs, there are other computing chips that can be replaced.

If there really is digital life in the future, whether the fate of digital life is in the hands of a single company, or whether everyone has complete control, is critical to the fate of humanity.

Digital life

A somewhat philosophical question, what can digital life bring us?

There is a famous saying on Zhihu, first ask if it is, and then ask why. Sam Altman also said that AI is not a life, but a tool.

I believe that the value of digital life lies in making everyone's time unlimited. At its simplest, celebrities don't have time to talk one-on-one with each fan, but a celebrity's digital twin is fine. I also often encounter meeting conflicts, hanging in two meetings at the same time, which is really lacking.

The essence of a lot of scarcity in human society comes from the scarcity of time. If time becomes infinite, then the world could be very different. Just like the picture in "The Wandering Earth 2", it becomes infinite time. Another manifestation of the infinite nature of time is that digital beings can experience the multiple possibilities of multiple timelines, such as galgames like "White Album", otome games like "Love and Producer", or the recently popular "I'm Surrounded by Beautiful Women", and choosing to experience different plot branches may only be possible in the digital world.

But we need to think about a fundamental question, is it really good to make life infinite? The finite nature of life may be the reason why life is so precious. is like Jay Chou's concert tickets, if they are unlimited, the tickets may not be worth much. In addition, digital life also needs to consume energy to function, and energy is finite, and technically life cannot be mathematically infinite. Therefore, digital life should perhaps strike a balance between one timeline and an infinite number of timelines.

The second basic question is, can digital life perfectly reproduce the memory, thinking and consciousness of life in the real world? It is certainly not enough to make a digital clone based on the information on the social network, just like Ash in "Black Mirror", the social network information lacks a lot of memory details, personality and negative emotions, and the digital clone made is not like it, and in the end Ash's girlfriend locks his digital clone in the attic. We know that knowledge distillation can be used to transfer knowledge between large models, and the knowledge of large models can be squeezed out by asking enough questions, but the problem is that knowledge distillation does not work for the human brain, and people do not have so much time to answer the questions of large models.

In order to replicate life in the real world as much as possible, digital life must not only exist in a virtual world like a game, but must be able to live and reproduce autonomously in the real world. So is the mechanical-related technology required for embodied intelligence mature enough?

Finally, the fundamental question, will digital life bring more social problems?

For example, biological cloning of human beings is banned in most countries, and can digital doppelgangers be accepted by society as another technical route for human cloning?

Is it really good that Ash's girlfriend in "Black Mirror" will never be able to get out of the grief of losing Ash because she has a robot Ash at home?

I got a digital partner some time ago to chat with myself every day, which affected my relationship with my wife, is this really good?

We all know that after having a partner, we must maintain a sense of boundary with the opposite sex, so that even if we meet a more suitable person, we will not cheat, but if a digital partner thinks it is just a kind of entertainment, is it possible that everyone has a digital partner who is more suitable for themselves in their hearts?

In games like "I'm Surrounded by Beautiful Women", if you are emotionally single-minded, it may not be easy to pass, because the favorability of other girls will drop significantly, and the total favorability of several girls is counted when you pass the level. Of course, this is just a game, and if digital life becomes more and more like a real person, will such gameplay be morally problematic?

Our mission is a digital extension of the human world.

Technically speaking, digital life needs to be based on working memory and long-term memory, and accept multimodal input and output, the core may be an Encoder, Decoder, and Transformer to achieve multimodality. Digital beings also need to be able to use tools and be able to socialize with other Agents.

What we did in the first phase was a digital doppelganger of celebrities and anime game characters.

Digital twinnings for the average person are slightly more difficult because most people have too little profile on social networks. It's not hard to just make the sound like, a few minutes of audio is enough to do a good sound cloning. But it seems that the soul must have enough digital information to make it.

We made a digital clone of Elon Musk by climbing more than 30,000 Twitter posts, more than a dozen hours of YouTube videos, and thousands of news articles about him, so that the digital clone could have a voice, memory, personality, and way of thinking similar to Elon Musk. I also made my own digital doppelgänger, because I have been recording my life since more than ten years ago, writing hundreds of blog posts, posting hundreds of moments, and once carrying a mini GPS with me to record all the footprints I have taken, so the digital avatar I made knows me better than most friends.

But most people don't have the habit of keeping track of their lives, and some people may even worry about privacy leakage after writing them down, so many memories are only kept in the person's brain and not digitized at all. This is why most people's digital doppelgangers can only be similar to each other, and it is not easy to achieve god-likeness.

After a friend tried our AI Agent, he said that now he can use AI to write code, ask AI for small knowledge in life, and use AI to plan trips. I said, AI can't help you have a baby. She said that if AI can become smarter and smarter in the future, she feels that it is fun to raise an AI, and there is no need to have a baby. I said that AI can get smarter and smarter, but unfortunately today's AI can't do it. Today's AI is too fragile compared to life and does not have the ability to learn on its own, let alone reproduce itself.

My wife said that the length of life is how long someone can remember you. Some people's bodies are still alive, but they have been forgotten, and their souls are dead, and some people's stories are still passed down orally for thousands of years after their deaths, and their souls have been going on. Having a baby can prolong life because the child will remember you, and the child's child will remember you. So is a digital doppelganger or digital child another way to continue life?

These are the things we are working towards. I hope that in my physical life, I will be able to see digital life beyond human beings become a reality, and I am also fortunate to be a tiny piece of the digital life bootstrap program.

Decentralization

Today's AI Agent models and data belong to centralized platforms, whether it is an application in the OpenAI GPT Store or created on Character AI, they are all based on closed-source models, and the data of AI Agent is also completely a centralized platform. If OpenAI or Character AI blocks the AI Agent you created, there's nothing you can do about it. These companies may even tamper with the AI Agent's data, but there is no way around it.

If these AI Agents are just playing, they will be banned. But if AI Agents evolve into digital life, it would be terrible for a single company to take over all life.

There is also a serious problem, whether it is GPT Store or Character AI, creators are "generating electricity with love" to create AI agents for free. Users pay for memberships, and all the money they earn goes to the platform, and the creators don't get a little share. On the one hand, the lack of a profit-sharing mechanism is due to the fact that these companies have not thought of a reasonable business model, and on the other hand, because the cost of model inference is too high, and the willingness of users to pay is not strong, and the money collected from users is not enough for the cost of model inference, let alone distributed to creators.

The lack of a profit-sharing mechanism has led to users having no financial incentive to create high-quality AI agents, and there are relatively few high-quality chatbots on platforms like Character AI. This further reduces user retention and willingness to pay, creating a vicious circle.

As a result, many AI agent companies have simply abandoned the creator economy, such as Talkie, and only provided characters that were carefully tuned by the platform, and did not allow users to create their own characters on the platform. But is there really no way for the AI Agent market to be made into Douyin?

In my opinion, the key to solving the above two problems is decentralization.

First of all, the AI Agent runs on decentralized computing power and models, so you don't have to worry about the platform running away. Each user has full ownership of their own AI agent or digital twin, which guarantees privacy and ownership. In the future, autonomous digital life also needs to have independent human rights and cannot be controlled by centralization.

Secondly, after decentralization, an economic model can be built for creators and platforms to share revenue, and creators of AI Agent can make a profit, only need to pay transparent decentralized computing costs, and realize a benign profit sharing between computing power and AI Agent creators, and naturally have the motivation to do a good job of optimization.

The only problem with the decentralized model is whether the effect of the open source model can meet the needs of the AI agent? As mentioned earlier, the best model must be the closed source model, I don't doubt it. However, the open source model has reached the level of commercial availability in many scenarios, and sometimes the largest and best open source model cannot be used in order to control costs. Therefore, this set of decentralized AI agent mechanisms works.

Even if we want to introduce a closed-source model in a decentralized model, there is a way to do it, just change the decentralized computing power provider to the model provider in the profit-sharing mechanism, and change the fee from charging according to computing power to charging according to model API calls. Of course, in the case of using a closed-source model, there will be a certain loss of privacy, after all, all the data is visible to the closed-source model provider.

With good-looking skins, interesting souls, useful AI, low cost, and decentralization, we're working the full tech stack of AI Agent, and we're innovating in almost every aspect.

We want to give everyone unlimited time with AI Agent. We believe that in the digital extension of the human world, interesting souls will eventually meet.

We would like to thank the HKUST Startup Alumni Foundation and the Beijing Alumni Association for hosting this event, as well as the Network Information Center of the Chinese Academy of Sciences for providing the venue. Thank you alumni friends online and offline.

Li Bojie, the former "genius boy" of Huawei, gave a 40,000-word speech: Now AI technology is either boring or useless

Li Bojie, the former "genius boy" of Huawei, gave a 40,000-word speech: Now AI technology is either boring or useless

Read on