Why is the SenseTime model amazing?

Core points

1. Parameters and data should be combined to look at together, in order to truly represent the capabilities of artificial intelligence, the product of the two is the amount of calculation. SenseTime currently has 27,000 GPUs running, with a total computing power of 5,000P and a localized computing power of 500P, making it one of the largest computing centers in Asia so far. It can simultaneously support the parallel calculation of 20 models with a scale of 100 billion parameters. At the same time, SenseTime has made large model capabilities into services to serve customers, including automated annotation, and the ability of automated annotation is about 400-500 times higher than manual annotation. This year's goal is to support the training of trillion parameters, which can support the training of 500 billion dense parameters. There can also be incremental training, which reduces the cost of incremental training by 90%, which is 1/10 of the original.

2. The SenseTime model is named "Daily New", and the model system integrates SenseTime's natural language model, Wensheng graph model, perception model, and incremental service of the model.

3. The large natural language model is named "Consultation", which can carry out multiple rounds of interactive interaction, answer questions through learning complex PDFs, and generate stories based on human cues. At the same time, it has an AI code assistant, which can realize functions such as code completion, expansion, translation, refactoring, correction, comment, and complexity analysis. The code assistant can improve the efficiency of code writing by 62%, and the first-time pass rate of the humaneval test set is 40.2%. In the medical field, the cause of the patient can be judged according to his condition.

4. Generative AI applications include Miaohua, Ruying, Qiongyu, Gewu, etc. Miaohua can generate pictures according to instructions, and the model can be trained by entering other pictures to make the output pictures more in line with the requirements. With only 5 minutes of video recorded by mobile phone, Nu Ying can have an exclusive digital human to help enterprises and individuals quickly and efficiently produce high-quality video content. Qiongyu can quickly generate city-level digital twins, improving scanning efficiency by 400% and reducing the original cost by about 95%. Lattice generates high-quality object models in real time.

5. If you want to get an excellent large model, you can't do without data. The SenseTime OmniObject3D dataset contains more than 6,000 objects in 190 categories, with a large number of scanned data of real objects, supporting tasks such as neural rendering, surface reconstruction, 3D generation, and point cloud recognition. SenseTime provides automatic data annotation services, and there are 12 models, including general models, as well as some professional models in professional fields. Based on the SenseTime model, it can automatically annotate data, and this service can be called through API interfaces.

A full summary of the press conference is attached

Dr. Xu Li gave a speech

In the era of large models of artificial intelligence, in fact, many people say that we discuss a variety of large models, and usually distinguish the models by the number of parameters of the large model, which has reached a certain level. But in fact, the model of artificial intelligence, I think we should consider the parameters plus the data it is trained on. Parameters and data must be viewed together to truly represent the capabilities of artificial intelligence.

Then the product of two things, in fact, we can say what we think of as the amount of calculation, this is the new formula we want to say in the new era of algorithm data and computing power, that is, a formula of multiplication. When we measure a model today, I don't think we can simply look at the number of parameters of this model, and allocate resources to parameters or its training data under the premise of limited computation, because your total computation is limited, and we can also see later that in fact, the requirements for the amount of computation in the future of the large model and its data are explosive. So today's computing amount we also emphasize the amount of GPU computing and its operating efficiency problems, that is, a large part of the above software system, to bring you a high concurrency efficiency of an infrastructure.

Let's first look at the number of parameters and find the neural network - today's protagonist, we can see that first of all, in the past 10 years, it is the best algorithm, and every two years an order of magnitude of the requirements are rising, in fact, in the past, we have also seen that the number of parameters is almost 10 times per year.

Recently, we have also seen the first possible amount of parameters that can reach hundreds of billions or even trillions. But if we humans are higher creatures, where do you think we are smarter than other creatures? It may be that our human neural network has more connections than other organisms, so it is about the order of 150 trillion.

So from this point of view, then these computer systems that we call ANN artificial neural networks are actually little brothers. But we will definitely continue to improve our current artificial neural networks, which is definitely a research direction. Secondly, many people will also say that there will be superintelligence, if the number of parameters is consistent. A few years ago, it was said that our carbon-based organisms may be the bootstrappers of silicon-based organisms, so regardless of what happens in the process, but at least today we see that as the amount of parameters increases, as the iteration of technology, we see greater changes taking our parameters up.

So let's look at the amount of data, then the public data of GPT-3 is about 500 billion words. It can be understood that in a person's life, if you have been listening to words, how much can you listen? 1 billion to 2 billion words. So it can already be seen that today we say that the amount of knowledge that artificial neural networks can touch or see is far greater than the number of words that a person can hear in his lifetime, so how much data is the largest neural network trained today? Trained 2 trillion. At this time, you may not have a concept. So give a reference figure, according to statistics, the total number of words of high-quality human languages is about 9 trillion, and a competitive interval comes into play. Then the largest network has trained 2 trillion data, that is, as the multiple goes up, it will soon face a situation where high-quality corpus is digested.

So let's look at how humans learn? Obviously, we humans are actually visual animals, and 80% of human information is obtained through the eyes. The neurons and our neural network connections we just talked about, the two connections of the human brain, the ratio of the number of connections to process vision and processing language is 10:1, that is, although we have 150 trillion parameters, most of the parameters are processing vision, and we must be advanced vision before defining language. Language is a compressed expression of the world invented by our ancestors, so we can understand the world faster through language, but there will be more information from vision.

For example, in the picture on the right, we want to convert it into language, which may have to be written into a lot of text, and it must be written clearly in large paragraphs. In fact, you can see that even today with artificial intelligence to parse it, it can be divided, such as the road, the street, separate them, and then mark. Each box can see the artificial annotation of the box plus the position, and then there are some other artificial interpretation annotations that we call images, which is what we at SenseTime have been working on to use a large generic model to make all the original unstructured data structured. And this structured process we also added a lot of elements, because many of our annotations were actually through human feedback, so we have accumulated a lot of visual information of human feedback in the past, if this kind of information is only input into our larger network, forming a multi-module input, may bring a completely different input base.

So counting the product of these two things, in fact, if you want to discuss the big model in the future, everyone wants to discuss its calculation amount, which is his ability. What is the amount of computation? That is, the horizontal axis is the amount of data we process, the vertical axis is the parameter of the model, and their product is the area it occupies, which is the presentation of his ability. So today you will even find that when the amount of computation is limited, you may have to allocate more computing resources to the data, not necessarily completely parameters, because the network may still be undertrained. How to do such a thing is actually an element of algorithm design behind. So we say that the three elements of the new era have actually undergone some transformation, so the more we look at the upper right corner, the stronger the ability.

So let's take a look at what extent some of the legendary networks have reached, or the capabilities of larger networks that we may face in the future, the lower left corner is definitely the weaker the ability, the stronger the upper right corner through our ability, then under the capabilities of such a very powerful some general networks, it brings us some paths that we call possible general artificial intelligence. Then the current industry actually has a large number of demand for basic computing power and infrastructure, and it can be seen that if the area just now is getting bigger and bigger, the technical operation efficiency requirements are also very high.

The SenseTime Lingang installation where we are today also took everyone around for a circle. At present, there are 27,000 GPUs running, with a total computing power of 5,000P and a localized computing power of 500P, which is one of the largest computing centers in Asia so far. So how much can synchronization support? It can support a model with 20 hundred billion scale parameters, and move forward in parallel with kilocalories. Then at the same time we will provide, for example, we make our large model capabilities into services to serve our customers, including automated annotation, our automated annotation capabilities are about 400-500 times higher than manual standards.

Then there is the inference deployment of large models, what is the heavy cost of large models? is the deployment cost. The efficiency of our reasoning has been improved by more than 100%, so the parallel training of large models, many people today talked about hundreds of billions of parameters. But I can say that if you connect 4,000 cards to train a dense model, the optimal configuration is at 500 billion parameters. It is because we have large installations and experiments that we dare to say this. So our goal this year is to support trillion-parameter training, at least from our point of view, we can support dense 500 billion parameter training, and then we can also have incremental training, we put the model on the platform, can do a lot of incremental training, and reduce the cost of incremental training by 90%, which is 1/10 of the original. At the same time, we will open up our models, model supermarket, model tools and our developer tools, so as to greatly improve the quality and efficiency of our development.

We have been deeply engaged in this industry for many years, in fact, our big model is made from 2019, and our chief scientist will give you another detailed introduction to our big model. Today I want to talk about our big model system, we will do such a release to launch our large model system, the name is called Riyinxin, why is it called Ridayxin? Because our name is Shang Tang, the soup plate inscription: "Gou is new, every day is new, and every day is new." "It's that SenseTime has a plate that says that Gou is new, every day is new, and every day is new. It's not nice to call Gou Rixin. The meaning of daily new is to say that every day to update, and new to be new, very similar to the big model, you see that the big model is to absorb a large amount of token data in weeks, is to absorb such token data as 100b, then in this process, its iteration is actually updated every day, and said that his ability is increasing day by day, then tell our classmates about our called daily new, students may not have read "University", this is "University", Everyone says that this thing sounds like the name of a supermarket, or okay, it's a new supermarket.

This supermarket means that we can provide a large model supermarket for everyone, and can provide the entire industrial chain. So today our latest model system is new, which integrates our natural language large model, literary graph large model, perception class large model, model incremental service. Then we will introduce our new models and systems to you one by one.

So based on this, we very much welcome our partners to access our big model system to iterate our next stage of big model together.

So let's take a look at some of the products in each sector. First of all, everyone may have a saying that we have a 100 billion natural language model, then our self-developed natural language model is called Sensechat. We think that the ability of natural language big models is that you dig out his solution to the problem between the interaction with him, he is not a simple question and answer, he is interacting with multiple people, so our slogan is "discuss and discuss, can solve". He couldn't be right all at once, he had to discuss it with him slowly.

Some of the characteristics of our self-developed large model, including some understanding of the logic and strength of the long text, are actually reflected in the discussion to some extent, you need to interact with him for many rounds, in order to really dig out the logic behind him and the problem-solving ability he can bring, and at the same time we also have a module with knowledge update, which can make it more clear.

So without further ado, let's take a look at our dialogue with the discussion, here we have invited our person in charge of the discussion, Professor Wang Liwei, to come on stage to talk about the discussion.

Good afternoon everyone, I am the person in charge of SenseTime's language model, my name is Wang Liwei. Today I am very honored and excited to introduce our language model SenseChat. We gave our big model of language a very down-to-earth name called Consultation. Because we hope that everyone can consult with us in everything, and we can solve it through consultation. Let's take a look and discuss what we can do. So today is the release of our natural language model, let's first use our discussion to help our language model write an advertising message.

It's a very simple question and answer where he says that when language becomes your strength, the world will open its doors for you. It is written with great atmosphere, which means that if you can talk, you will eat. This has nothing to do with discussion, let's say if our language model is called discussion, how should this advertising slogan be written? See if the tagline has changed.

He is also more direct, calling hand in hand with consultation, making language an advantage and opening up infinite possibilities. Then I decided to slogan with such an advertisement.

Today we have a lot of partner guests here, we have invited so many customers and partners, so can we use it to write an invitation, but also make the invitation use this slogan, and then send out such an invitation to everyone to see if it works.

In fact, these are real scenes, and when we use it, it is not a one-time question and answer, it needs to have some input constantly to make it possible to gradually find some good scene content.

Well, it formed a standard template, which is very interesting points, we see that the customer and partner are written on the title, and then he mentioned this is called the new product launch, we call it Tech Day, but because there is no input, he saw that our dialogue above thought it was a product launch, here is also talked about discussing hand-in-hand, and then wrote this slogan, there is also a standard time and place, and then there is a common growth and so on a series of unfinished.

So can we ask this model to help us fill in, April 10, 3 p.m. this time, then fill in the location, the location is in SenseTime Lingang AI device, yes, in the end I see that the name of this company is not written, can it also be written? SenseTime Lingang AI device has a discoverer, right? Look at the prompt word is written about the location and the sender is SenseTime Lingang Installation, this depends on whether he understands that the text is talking about the same thing, and it can fill in the same position.

First of all, the time and place were filled in, and the sender also filled in, and the interesting thing is that he also helped us change the first sentence: I am honored to invite you to participate in the new product launch of SenseTime Lingang AI device. He directly replaced the original participation in us with SenseTime Lingang AI device, in fact, this is the point of interaction in the real world, that is, you keep giving him new input, he can understand the meaning of the product behind it.

Well, I believe that all VIPs have received such an invitation, let's take a look at some new applications, when the guests can try our new applications, for example, Liwei wants to tell stories to children at home, sometimes there will be some bottlenecks in creating stories, so can we use it to create stories? For example, let's go in one prompt after another to see if we can create a children's story, and you and I will alternate between each sentence. For example, let's start by talking about Little Flower being a kitten, which is the beginning of a standard children's story.

He has a cat and he immediately thinks of eating fish, and he wants to go fishing and can't catch it, maybe a little weak kitten, and then his mother has to tell him to be patient in fishing and can't rush.

So we're going to leave it to our colleagues, you continue to create this story, and we'll look at the other features we're talking about, and on many levels, we're going to be faced with, for example, we need to solve some financial expertise, we need to solve legal expertise, we need to solve the expertise of the financial industry, we need the engine to have an understanding of long texts, and play it as an expert in this direction, read the professional text, we turn to the complete file that we negotiate to connect.

Let's click on this box, upload our PDF, here we choose a patent law counsel, then this patent law is a patent law published in 20 years. Let's drag it down and let you see how long this patent law is.

It is a 24 pages of 82 rules of such a patent law, if everyone has to read it, it is more difficult to give a professional question and answer, then for example, we ask a question, such as whether we all care about the patent application after the patent right has a patent right, our technical team is very concerned, because we can get the patent application bonus after we have the patent right.

So everyone pays special attention to when exactly there will be. Looking at our discussion to go to the engine, he directly told everyone that it is not, because according to the provisions of articles 39 and 40 of the law, balabala a lot, and finally concluded that after the patent application needs to be examined by the State Intellectual Property Office to make authorization, the decision to register and publish the patent right before having patent rights. If you look at the two articles 39 and 40, one is the application for an invention patent, and the other is the application for a new type of design patent.

If you just look at the text, in fact, we don't really understand whether it is or not, but it can give you some answers that are more understandable than us ordinary people through the abstraction of these documents. So this is a 20-year release, in fact, listen to our patent colleagues, every year some rules of patents will be updated, let's see if there is any update in this year 23, for example, we asked about the application cycle of invention patents, are there some new changes in this year's regulations?

In fact, it is not in the knowledge base of such a document, it needs to be connected to some knowledge update modules. Let's see how he answered, and his answer said yes. He knew that this year is 2023, and the State Intellectual Property Office has done some to compress the patent examination cycle to 16 months, and future invention patent applications will be profitable within two years at the earliest, which may be a very good change.

So let's also look at the new knowledge of model fusion, and can very practically combine a lot of the long text content analysis that we upload.

Let's switch to the story editor just now and take a look at our story. First we said that we were a cat, then he just wanted to fish, he couldn't catch fish, so we followed him and told us that we went fishing in the river, and then he didn't stop fishing, and he watched the crabs go, but then my mother was very unhappy, so she had to educate Xiaohua, only focus can be successful and so on.

Originally, this story may have ended here, we continued to say that he actually played, found another bird, and then he was going to catch birds again, after catching birds, this story had to pull him back, and my mother was even more unhappy, saying I was fishing today, you caught crabs and birds for a while. We took it and said that Xiaohua still regretted listening to it, and after regretting it, he realized the mistake, then he finally concentrated on the same and really caught a big fish, and he was very happy, and then he devoted himself to it. The story is finished.

The final summary of this story is called Little Flower Fishing Notes, and then summarize this matter, you can see that we can guide him, he originally wanted to end can bring some new prompt words, its story extends bigger and bigger, the farther and farther, you can control the content of this story, then in fact, the eight classics give everyone a co-creation space people write very few words.

Let's take a look at discussing other applications, some open domain applications, we connected the discussion to the programming code, can we open our programming experience, we invited our programmer Zhang Tao to introduce it to us. Let's take a look at our topic and discuss access procedures.

Everyone commonly uses vs code such an IDE, almost all of us programmers will do some IDE plugins, we also entered the discussion, then let's open a new file.

Let's first demonstrate some simple content, we usually have to do elementary school math problems, children often encounter, such as can I ask him here to help us write a greatest common divisor that calculates two numbers. It can be seen that the process of writing code is actually the input process of natural language, and we have to accept Chinese to write code. He started writing it directly, if you can read this code, it is actually a tossing and turning division, that is, recursive tossing and turning, to express the greatest common divisor.

We can continue to ask the question of whether two numbers are coqualified.

Let's code to directly generate whether the two numbers are coqualified, and we will simply call it to see if this function can run through the eight meridians.

Originally, most of our code writing is focused on the code writing itself, and now the focus is on how to interact with him, and how to debug, in fact, the process of debugging can also use our engine, as you can see later, now the user enters two numbers to determine whether it is coqualified, if it is, print coquality. Otherwise, print non-comatter, in fact, we can use Chinese and English to write such a problem at the same time, that is, prompt words, are no problem, and then he directly compares, calls this number, interestingly saying that you let him print coprime and non-comatter, he writes the Chinese in the result, but because most of the training programs are in English, so it is input in English, more interesting, it doesn't matter.

Let's take a run and see if we can get through. It can output the first number, you can casually enter a larger number, and then enter a number, these two are not coprime, it is difficult for humans to distinguish whether it is coprime or not, he tells you that it is coprime, you don't know if it's right. Two even numbers, you enter a larger even number, this should be non-coqualified. Good non-coplasma, we have the right to think it's right.

This is an automatic process with almost no modification, let's look at an example used in practice, we usually call, for example, our company's own developed interface, for example, we have to solve a handwritten OCR, we have a lot of handwritten notes, right? Handwriting this word we want to do recognition, how to create a complete system, to mobilize such an interface, this interface has its mobilization document how to do, what are the things.

Let's start a new project, let's have this terminal call our own interface, and then complete a handwritten OCR recognition content.

Just start a project, we cut back at the same time. The right is the interface he is operating now, you can also pay attention to him writing code, because it is not written in a while and a half, so let's talk about it, we think that after discussing access to natural language programming, it will change the original law of two eight, 20% of the code abstraction of the public library, everyone knows that 80% of the things are done by people, and all the code can be read to do 80%. But in the future, 80% of things will be machine-generated and 20% will be read through cue words. What's the most interesting thing about this?

The most important thing for programmers is that if you fine-tune the program of an industry, he helps the programmer share their experience, which is the model you use in many cases, you have the experience of AI personnel to mobilize AI programs. It is very important for this company, if we use all the libraries of our company to run such a model through, then the new employees of the company immediately enjoy the knowledge of the original employees, and can increase the development ability. We tested it, and I found that employees uploaded 62% more code after using the platform, not that they did a lot of useless things, but that the machine really helped you improve efficiency and substantially improved efficiency in the course of work.

So we are in the test set, the accuracy rate is 40.2%, this accuracy rate is currently the highest proportion available to everyone, higher than Copilot, then of course, now GPT4 is out, GPT4's programming ability is also very high, and then higher, then in the future Copilot will access GPT-4, then it may still be improved, but I think that with the iteration and improvement of our large model capabilities, as we access some of the code in more industry scenarios, It is very useful for enterprise users, because no one thinks of sharing the code base inside their enterprise to the outside, but we can do incremental training.

What are our capabilities? Code completion, expansion, translation, refactoring, correction, comments, and complexity analysis, this is very important, because sometimes the space-time complexity of the code is actually a very core reason that restricts our code to run in a real lot of cases, including test cases, we have a lot of things to write test cases written incomplete, write to some edge corner cases are not written, but this test case is not very complete.

So in such a case, we also support multilingual Chinese and English programming, we can try it later, if you want to write a code program, you can use our system to complete it completely.

Let's look down, then access to natural language, our new ideas access to new scenarios, such as access to medical scenes, medical scenes are completely different, just said open a question and answer story, everyone can open the chat. And medical treatment is very rigorous and cautious, do anything can only be after thousands of questions to give an answer, let's try it.

The process by which we talk to the big language model of healthcare. Let's cut it, change to our medical section, let's ask a more common and interesting disease, everyone stays up late and stays up very hard, sometimes staying up late eyes will be yellow, and then ask what the specific reason is.

Recently, overtime has been very hard, and a scene of overtime has been implanted, mainly because the eyes are yellow, and it has nothing to do with overtime.

You see he doesn't answer directly, he will ask questions, he will say if there are other discomforts, he is actually medical rigor like this. It is said that the eyes are not uncomfortable, but they are a little yellow, because this is a typical symptom. Then he said that he had not made a judgment, in fact, the doctor had already said at this time, and he would ask you how long this thing lasted? It seems to be a month, and it is certainly not overtime.

And then what did he suggest? Make an appointment for liver function tests, and then want to remind you to pay attention to rest to maintain a healthy lifestyle, I think this is very important, in fact, the big language of medical treatment is to guide you, and can give advice, of course, our colleagues still want to see a doctor, just ask which department to see, he said to go to internal medicine or gastroenterology, of course, he is also considering also going to the ophthalmology department to do a checkup. It is still a relatively complete one, so medical treatment has two points, first, it has to guide you to constantly ask you more information. Second, he wants you to complete an eye exam.

Let's look at another case, or what subject to hang, for example, he has always said that it is easy to sweat, we will ask what to do about this matter when he is usually nervous.

Many people will have this thing, he is very rigorous, he will ask you if you have others, whether you are still sweating feet, this relatively speaking, with these two will become more certain, and say that you are a hyperhidrosis, he will also ask you if you take medicine, I believe most people should not, he said that he can recommend the use of antiperspirants and antiperspirants, I think this is very interesting, he said that hyperhidrosis will actually affect everyone's quality of life and affect social disorders, Therefore, it is recommended that everyone try to go to the doctor and treat it as a disease.

The old rule is to ask what department neurology or dermatology should be seen, in fact, these normal patients are difficult to understand through symptoms, but medical treatment is more rigorous dialogue and guidance finally excavated, so why is it called discussion, more talk right? He can't give you a good answer all at once.

Let's take a look at some other cases, and we won't hit them one by one. Here are some things to say before, for example, my physical examination often has a variety of high indicators, after the high I don't know what to do, the doctor always said let me go to review this kind of thing, now say that your indicators are high, you interact with him, this indicator is high, what should be seen, what symptoms should be recommended, what department should be recommended to do the examination. If you are interested, you can actually study such a scene application, which is very interesting.

Okay, let's cut back, this scene is actually we have landed convenience services in Xinhua Hospital, if natural language to access this kind of question and answer, he gave a very rigorous reason. If with the gradual iteration and evolution of big language ability, in fact, it can do more, follow-up, health information, Internet consultation, auxiliary diagnosis and treatment, we will further promote the launch of Xinhua, Zhengda First Affiliated Hospital, Jiahui Hospital, Ruijin Hospital and West China Hospital, etc.

Okay, in theory, it should be finished just now, let's introduce it to you, open our program, and see what you write and what you write by machine. Our program follows what we just said, we need to use a combination of our own AI open platform, identify the interface, and then draw its output response as a picture of the result. The third prompt I asked him to do a main function, and then drive the above two functions. Then most of the prompts are prescriptive and some information given from the documentation.

Can you explain which ones you wrote and which ones were written by machines?

Because after the three-step prompt, I also cached the text result returned entirely by the AI, and then roughly modified the part, one is the path of some resources that we prepared before, such as the part like ours that can be obtained. The red line is changed, you can compare it immediately, for example, some libraries of input are not, the picture text is not correct, and then there are some processes. Okay, the change is not big, in fact, what I just said is more than 40%, in fact, many situations in the prompt can be used well, can we take the test?

Just now, because this thing is a text recognition, I took this photo, handwritten words, and just sent it to Zhang Tao on WeChat. This input photo, then let's change the name of the input photo in the program, and he will directly call our online interface.

You can see that we can access a lot of vertical scenarios, and we hope that we can intervene in various vertical scenarios of customers, of course, we can also use it to empower many of our original deep-level platforms.

Just talked about if we want to do video tools, in fact, there are a series of platforms and tools, we will also talk about how to put some of our in-depth content on the big platform today, we have Wen Sheng diagram, we call it second painting; Like a shadow, this digital man; There is Qiongyu, the generation of three Ds of space; Lice objects, three-D object generation.

Then this whole set can help everyone quickly complete the production of video content.

First of all, let's look at the AI content creation community platform second painting, then first of all, what kind of problems can our large model solve, one is fast reasoning, everyone uses text to generate a graph is too slow, high-definition pictures put in two seconds.

The second very important is the subdivision of the small model and logo above the self-made large model, because there are many times when you let it generate something, the large model has not seen, it is generated, but that model ability is strong enough, you can build a person's small model and large model on the side to do reasoning, and this reasoning we complete in a few minutes, and then we demonstrate, downstairs everyone can try to generate the content, there are machines can go to play, very fun, About 20 training photos can be completed, and all of them do not require your programming ability, you just need to drag and drop.

Third, the entire generated community, in addition to our own large model, we have gathered tens of thousands of such models in the community as a whole, and have different model choices for different prompts, and then provide creators with more perfect tools, and the future ecology also hopes that everyone can build together, building bigger and bigger.

Then finally we can also encapsulate the B-side API, allowing our downstream customers to build their own survival community, which is very different. So if any customer wants to use our capabilities to generate his own platform community, we will give obvious support, we are selling computing power for large devices. Well, let's take a look at our platform.

Okay, this is our second art platform, we have our own many models, and then there are some community models and so on, very rich, everyone is different, because the style is different, go down and look at the photos we generate, take a look at these are the photos created There are different kinds of styles, characters, evolution, mirroring, and then what shadows, lighting, cartoons, futuristic.

Come we turn to the top, we click on the photo, the photo on the far left, come we click on any photo, there will be a lot of prompt words, to know that to produce such a photo, in fact, it is not a simple few sentences can be written, is a more complex complete prompt word.

Let's copy the hint, okay? First of all, look at the light and shadow of this girl is quite beautiful, looks very realistic, use our natural model to generate it, use the model to generate descriptors, first choose to generate 4 photos, what coverage, etc., this can be filled in. It's called French blonde here, let's change to a Chinese woman Well, the same problem does not change.

Shall we let him change his clothes? Just don't wear a suit, can you let him wear Hanfu? For example, in fact, to generate such a picture, the core is its prompt word, so you can click on our picture and copy its prompt word, which will be very efficient.

After producing a batch of Hanfu, we sublimate it again, we hope to generate Hong Kong style. For example, wearing modern clothes, and then being able to show the style of Hong Kong stars in the 80s, see if you can do it.

He produced a few photos that felt a little old, but in fact, it was not so Hong Kong, what to do? Just said that it is possible that I have never seen old photos of Hong Kong, how to produce Hong Kong style? It can't be produced. Let's take a look at our training platform, it's very simple, you don't have to pay to program the training model, you just need to enter the model prompt word, such as Hong Kong wind, give your model a name, and then prompt the 80s Hong Kong style, and then we upload photos, about 20, a little more. He collected a bunch of photos of Hong Kong stars and started training. It's as simple as that, you don't have to know anything, as long as you know this, click it to start, in a few minutes, let's go back and take a look at these photos.

Let's take a look at this parrot, the key word for parrot is called a parrot with pearl earrings, in fact, I also admire it, called Vermeer style, because Vermeer has a painting called a girl with pearl earrings, called the Northern Mona Lisa. So they spoofed a famous painting, but you can see that this light and shadow are still quite fun, right? You can spoof the others, and later I'll generate a pair of cacti with pearl earrings all the time.

Then look at the picture behind, click on the one written on it called standing, sunlight, portrait photography, in 2023, this can not be seen is a Chinese 20-year-old woman on the streets of Shanghai in 2023, looks like Audrey Hepburn, and then this is interesting, you can see that Kodak's Portra 800 105mm F1 camera, this lens is a large aperture, so it generates an aperture effect at this point, and you can look closely at the filaments of the hair, Its understanding of the physical world actually shows its ability to generate light, which is very interesting.

Okay, let's look at one more. This is called half-porcelain, half-cyborg fractal armor. Anyway, this is a word, I can hardly describe it, I can't write such a prompt word, but you can see that this is quite delicate, overall it has details and depth of field.

I think this one is quite good-looking, you see this writing prompt is about the image of a cute Chinese woman with fairy wings, you can see that there is a small detail, that is, the hairpin on the head has a front and back, so it is also in line with the physical depth of field effect.

We just trained about the same, after the training will choose a photo as the cover, and then click on it, see that you can use this model to generate, but also generate 4 photos, everyone can go downstairs to play.

Since this will be online directly below, let's generate a look at this descriptor exactly the same. After the Hong Kong wind gave so many rising stars in the 80s, what will happen to the production? The last eyebrows are a bit like Hong Kong stars in the 80s, right, it's fun. So this is our Wen Sheng diagram, in fact, you can see that Wen Sheng Tu has mastered a lot, not only to generate some of the content of the physical real shadow, all the pictures in our PPT are all generated by this platform, the key is to have good technical resources, but the core is that if we connect the power of the entire community, it is actually infinite.

So let's take a look at our digital generation platform called Ruying, why is the name Ruying? This year we asked us and said you name me, I'm going to do digital people, and then we want to do cloud services. Then on these two keywords, he said that the name is called Ruying number, I asked him why, he said that the idiom Ruying follows is that one thing is accompanied by another thing, so it fits your number, and then the two words Ruying feel very light, see his name is very good, I changed my name to call Ruying number.

So let's take a look at our image through a video.

Hello everyone, I am a digital person, this video you see is completely generated by AI, my movement expressions and what I say are produced through neural networks, I can live in the digital world all the time, you can too.

With just 5 minutes of video recorded by your phone, you can have a digital person of your own. Every digital person created on SenseTime's platform has undergone strict trusted authentication to ensure that it will not be stolen or tampered with. SenseTime AI Video Platform is a full-stack AI video generation platform developed by SenseTime to help enterprises and individuals produce high-quality video content quickly and efficiently. The platform integrates SenseTime's self-developed AI algorithm, which can be combined with AI copywriting to generate AI documents.

Okay, actually, we've seen a lot in a lot of live broadcasts, and we have asked our digital leaders to come on stage.

Hello everyone, I am the person in charge of the Ruying platform, in fact, since I have a workforce, let me introduce it to you together with my digital people. So let's show you the product first.

So our digital people in general, in fact, there are two characteristics, the first is that there are many styles, our 2D 3D cartoons are similar to each other, everyone wants to engage, what you want, this is the first. The second tool is more, we can have the material of the text map, we can have the text to access the real-time Q&A, and so on a series of materials, so these two of our digital human platforms have become better. So let's ask Lina to show you our digital people and give them to Lina.

A common application of our Ronin platform is to generate its marketing videos for a variety of products, and we also jointly developed a digital cultural and creative product with the Forbidden City this year, so we first generated an introduction video for this product.

As if it was just now, let's give a brief introduction to our big model, we jointly develop cultural and creative products with the Forbidden City, and we ask him to help us generate a video copy to see our direction. As you can see, we have generated an introductory copy based on what I just described, and we will simply adjust the content of the copy.

We then choose a template for him that fits its scene, and you can see that the header on our template is also automatically converted according to the copy we generate.

What I bring you is the cultural creativity jointly created by SenseTime and the Forbidden City, which allows you to immerse yourself in this magnificent palace in the world of virtual and real integration, and truly feel the shock of traditional culture. With our big model, we can actually solve some of the pain points in daily video creation, including not having to rack our brains and grind, and no longer have to work hard to find materials.

Let's show you how to upgrade more easily with our large model. Now many parents and children may have some cross-border or some cross-cultural exchange scenes, for example, our children and a school in an Arab country to do cultural exchanges, want to share their own culture with each other, I think the Silk Road is a particularly good scene, that is, we come to generate a period of things like the Silk Road. Let's have our big model help us generate footage for some videos about the Silk Roads.

Okay, you can see that there is a paragraph on the left that introduces the Silk Road, and he also helped me generate a pattern that fits the background of this scene through my content, and I think this digital person can be a little more formal and then change to a formal outfit. Then give me a digital person who is relatively more like a storyteller and take a look at this video that Ruying synthesized for us. I can start playing it on the side.

The Silk Road is an ancient trade route connecting the East and the West. On this road, the West through trade and cultural exchanges, promoted the continuous integration of different civilizations, in history Zhang Qian sent an envoy to open the earliest Silk Road, since then the Silk Road merchants have traveled through deserts and mountains again and again to communicate with wind and sand, Chinese silk, porcelain, tea and Indian Buddhism, Greek philosophy, etc. have been fully inherited and developed on this road.

For international communication, we asked him if the big model could generate an English version. That model helped me generate a more original English version, and in the process also helped me choose an English pronunciator, let's take a look. A more international version of the video. I can actually try to speak a little English, but if it's Arabic, I really can't help it, so I'm going to ask Ruying to help me introduce it in Arabic. You can see that our big model gave us an Arabic version, and let's look at this again.

As you can see, in fact, we have a lot of rich templates here, including many different types of numbers, in fact, I also do a lot of usual training, we have a lot of other combined data, including some 3D cartoons of various styles, of course, and our upcoming later, my display is here, thank you.

Arabic is not understood. Of course, the advantage is that it can indeed be used to generate interactive videos in various processes, and just saw that our background can be generated with pictures, but if the background is to interact in 3D, we must use our platform for all cultural relics. Take another look at our 3D generative platform scene design that might restore those spaces.

Everyone knows, for example, to build such a 100-square-kilometer urban scene, artificial modeling is very time-consuming and laborious, each building is queued, but there is our Qiongyu system for two days, and high fidelity and high restoration of the details of the scene, and is centimeter-level realism, just see this many people want to say whether such a thing can actually do real interaction, in fact, we will show you later. Traditional algorithms, we can solve some of the original real, and say that in this process you can see the resolution of these buildings, because it is not very clear on the big screen, our panel is very clear a difference, all the resolutions include these buildings in the lower right corner.

You can see the emissions of the building itself, and recently our office building, using our own building to make a presentation, you can achieve real-time rendering and interaction of large scenes. Let's look at it, for example, because our algorithm is efficient, we can achieve the integration of multiple data, and there are ultra-fine details, such as everyone pay attention to these words, the text on the wall and data can be clearly depicted, including these patterns, so that everyone is portraying this fine detail.

With these, we can do real-time interaction, this is an interactive academy, is an Internet celebrity, you can see that this is a 3D printing comparison, and then you can look at the reflection on the wall, pay attention to this kind of lighting on the floor, in fact, if the whole is pure modeling, there is no such realism, or can be interactive, then we can do real-time editing in such a scene, here is a camera, on the position of the camera, so you can automatically learn such a mirror, Camera movement can also become part of the algorithm to complete the entire video.

The overall scenarios covered by our Qiongyu do not include digital twins of cities and parks, building design, film and television creation, and even a series of application scenarios of cultural tourism and e-commerce, let's take a look at a real scene. Scene changes during the day and night. More information panels now take out all the building details to complete a complete overlay of the scene. The difficulty is to interact and render multiple elements in the scene in real time.

Okay, just now you saw our 3D entire outdoor scene generation, then in the indoor scene object generation, we actually have a model to do this project. In the tradition, if you know 3D, there are several difficulties, first, the modeling of complex objects is often implicated in the background; Second, these things with glossy gloss often cannot be built, and the materials cannot be clearly distinguished. Then the comprehensive scanning efficiency of our system has increased by 400%, reduced the original cost by about 95%, and has a good space effect.

For example, the one on the left is a pot of flowers, and everyone knows that the branches are actually blocked and adhered to each other, and it is very difficult to build a good model of it. In the middle is an iron kettle with light and light, this iron kettle is old and looks different from this luster. On the right is the terracotta army, which is a stone tool, but you can see that you can feel the effect of the original stone from these alone. With these, we can have some new industry applications, such as space creativity, scanning things and putting them directly into our space, then we can do home decoration, and some film and television works are embedded. A series of variety shows, videos, or the placement of objects can be done. You can pay attention to the change of light on the surface of the object, showing the material of the object itself.

For example, a scene can be cut into, and the person in charge will make an introduction. Embedding the objects added by our lattice to make cuts, our mirror movement techniques, etc., made our editing enter a new era of large models. From the complex geometry of the environment, realistic lighting effects, and can be done in real time. If you want to restore a high-definition scene, we can reshoot the film and television works on it, and reshoot some interactive content can be done here. Therefore, 3D assets in some of the original film and television works can be done with neural rendering methods. For example, plush object model houses, e-commerce platforms, and even can directly point to objects in some places to interact, the reflection in the water of the outdoor scene can see the real situation to restore the real rebirth, this is our Qiongyu and lattice platform.

In fact, the video on the right may be more realistic, the left side is actually enlarged, you can see some real light and shadow effects, and some changes can be fully displayed in front of everyone.

So combined with these content platforms we just had, in fact, we can move all of this to the live broadcast room, and there can be a variety of AIGC-related content products in a live broadcast room, such as our live broadcast room objects, and the scene can be digitally generated.

"Welcome all babies to enter the live broadcast room, follow the anchor not to get lost, and go on the link."

What can I exchange? The role of the person can be changed, right? Can be replaced with real people. In order to make it clearer, it can be seen that the locals should actually have a distribution that is not necessarily assigned. Choice of scene. Dynamic, static, 3D.

We share a lot of such platforms, so we are on the big device, in addition to our own daily new self-developed large model system, in fact, we also have 7,000 GPUs, we now serve more than 8 large customers from outside, used to train their large-parameter models with a scale of more than 100 billion.

In 1956, the term artificial intelligence was invented in the same year as another word called the particle collider. Interestingly, the particle collider is today the most important infrastructure device in physics. So today we will make our AI computing power a large device, and we hope to say that it can become the most important infrastructure in the era of artificial intelligence big models, empower our industry, and promote the occurrence and arrival of AGI. Then we have invited Dr. Wang Xiaogang, the chief scientist and co-founder of SenseTime, to introduce the five models of Day-Day to the guests.

Wang Xiaogang, Chief Scientist of SenseTime, gave a speech

In the era of scenario-based models, a number of AI+ scenario applications have been born, each scenario has its own exclusive model, but its research and development cycle is relatively long and the cost is relatively high. So with the advent of ChatGPT, we say that general artificial intelligence has set off a new wave of technological revolution. It solves a large number of open-ended tasks in a more efficient way and brings new research paradigms. Well, it is based on a very large, very powerful multimodal base model, which is constantly unlocking new capabilities of our base model through human feedback and reinforcement learning.

So what do we say about general artificial intelligence? In existing AI systems, we can input multimodal data, and the output can be multiple tasks, but each of these tasks is predetermined. So when we encounter a new task, we have to redesign the AI system, collect new samples, and retrain the model.

Then under AGI's system, our input is a prompt word, then its output is to produce multimodal data plus various tasks described by natural language. So because we don't need to change the base model of AGI, we only need to choose the appropriate prompt word, which can cover a very wide range of open-ended tasks.

This includes a large number of long-tail problems, so this is of great significance for the large-scale promotion of artificial intelligence in a broader range. Let's take the autonomous driving scenario as an example. For example, given a picture, we want to ask whether an autonomous vehicle should slow down. Then in the existing AI system, first we need to detect the object, get the detection frame of the object, and then carry out text recognition to make decisions and judgments.

So in this system, each module in the pipeline is a predetermined task, but in a general artificial intelligence system, given an image, then we can ask the system any questions, such as what does the icon mean, what should we do?

Then AGI's model can give us the answer, and will give me his reasoning process, for example, he sees that this road sign has a speed limit of 30 kilometers per hour and 100 meters ahead, and there are children in front of the school, so we need to be careful and need to reduce our speed to below 30 kilometers per hour.

So we say that in the AGI system, what we go to achieve is the interaction between the model and people, which we call human-machine co-governance. Then in the existing AI system, we have realized the flywheel of data, so we obtain a large amount of data from the front end and terminal for annotation to update our model; This model feeds back to the front end to get higher quality data.

So in AGI's system, how do people interact with models? People actually make the output of the model better to align with people's intentions, then people give the model some better guidance, including some methodological output, to help him unlock more skills.

Then in turn, let's say that this model will be able to output higher quality content. We say that this process is called the flywheel of intelligence, then the emergence of human-machine co-governance has greatly accelerated the progress of science, technology and culture in human society.

So what do we say is a good good big model? It is like saying that a very talented athlete, then as a coach, human beings do not need to demonstrate to him one move at a time, we only need to output some methodology, give him proper guidance, and then give some guidance in key places. Then our very talented athlete, he is able to complete new movements, to create some things himself, called seeing tricks, no tricks to win. Therefore, in the actual practice, he can constantly solve new problems and challenges.

Then there are also some large models, if it is a large model with poor qualifications, he can also reach a certain level through his own hard work, then this requires the coach to demonstrate to him one move by one, and even need a talented model to show, one move by one to show everyone. So that means we need to collect more training data. Then it can also reach a certain level, but there may not be a way to bring us more surprises.

But if we use fixed standards to evaluate these two models, it is possible not to see such differences, so how do we develop an excellent large model, which requires us to have very rich scenes, very open tasks, to test whether a large model is very good. At the same time, we also need very rich data and task input to the large model to complete its training process.

So SenseTime, as an artificial intelligence platform company, we have a lot of rich industrial application scenarios, empowering hundreds of industries, then it reflects our strengths.

So we have been deeply cultivating our big model in the past 5 years, in addition to the infrastructure we just mentioned that we have a very strong artificial intelligence infrastructure, SenseTime has also realized our full-stack large model research and development capabilities.

So first of all, we do a lot of optimization for our large model, and we have done a lot of optimization for its underlying training. We know that this model is very large, a GPU card can not fit, then we need to do a variety of distributed training optimization, including data parallel and model parallel optimization, video memory optimization, mixed precision optimization and so on.

So on top of that, we have a series of such optimization techniques for our super-large models. This covers the design of the model, the training of the model, the optimization of the model, and the service of the model.

Then in order to get a very high-quality excellent large model is inseparable from data, you need to cover rich scenarios and high-quality data. Then we have also defined a lot of such tasks with our customers in the past landing process. Recently, we also contributed to our community one of the largest multimodal datasets based on realistic perception reconstruction, called OmniObject 3D.

So in this dataset, we include 190 categories, more than 6,000 objects, and a large number of scans of real objects. The quality of the data is also very high, each object contains 5 surround videos, shot with different trajectories and lighting conditions, all of which are full-view HD videos. It also supports multiple tasks, including what we call neural rendering, surface reconstruction, 3D generation, point cloud recognition, and more. So just showed you this series of AIGC, including our rendering, reconstruction of large scenes, and reconstruction of objects, these work is inseparable from such high-quality and data support.

Then in this data we also have rendered multi-view pictures, real videos, 3D point clouds, meshes and textures, which are multimodal data. This work was also selected as the best paper by CVPR this year, standing out from nearly 10,000 submissions.

Today we are releasing our SenseTime's new big model, in fact, we have been evolving in this direction for the past 5 years. So in 2019, we have the first 1 billion parameter-level large model for the field of human faces. In 2022, we have a vision model with 32 billion parameters, which is also the largest one to date.

The capabilities of the natural language big model we showed at today's conference, which is also based on a model of 100 billion parameters, then not long ago we also opened up Digital Intelligence 2.5 to our community, which is a multimodal model with 3 billion parameters.

At the end of last year, we already had a model of AIGC with 1 billion parameters, which could support literal and biographical graphs. So all these accumulations of these aspects, different modalities, its convergence prompts us to train a more powerful multimodal large model.

In addition, there is our decision-making intelligence, we have a decision-making intelligence model, in the StarCraft competition, our decision-making intelligence model is beyond Deep Mind's AlphaStar, but also defeated the champion of Greater China, and also landed in the fields of autonomous driving, energy and finance. So in the future, this will also be further integrated into our large model, multimodal large model. Therefore, SenseTime's general artificial intelligence large model system in the future includes our visual perception, language understanding, content generation and decision reasoning.

Our large model has covered the company's core business, and more than 20 scenarios in our 4 major sectors have achieved solid implementation in smart city, smart business, smart car, and smart life. Then you can actually see the richness of the scene, our data and tasks such a diverse synchronization, it also fully demonstrates the powerful capabilities and future potential of our large model system.

Next, let's take autonomous driving as an example, and we can see the value that large models bring to us. In 2021, under the guidance of our perception model, we developed the perception algorithm, which also won the championship with an absolute advantage in the challenge.

So this year we have a new work UniAD, which also realizes the integration of perception and decision-making in a model, end-to-end optimization. He also won the best paper of CVPR, so we hope to continue to promote our autonomous driving progress through our multimodal large model in the future.

Today we can see our BEV algorithm in 2021, so it is based on the surround view camera as input, using our Transformer to map the perceived data of these multiple cameras directly to get the final result, then the industry also has a very good influence, and it is also in the process of achieving mass production.

So to our V2 version, we made an upgrade, because its model architecture is a more powerful architecture upgraded to our 2.5, and it also achieves better alignment of the time domain, and it also got the first place on the relevant list.

In our UniAD work, we are the first end-to-end autonomous driving solution that integrates perception and decision-making. Then another surround picture, through our Transformer to map to the characteristics of the BEV, while we track the target, build a map online, predict the trajectory of our target, and predict the obstacle, so we can finally achieve our driving behavior.

So because we can do end-to-end optimization, we can see that we can clearly surpass SOTA in various key indicators. For example, we have improved the accuracy of multi-target tracking by 20% compared to SOTA, and our prediction accuracy in lane lines has increased by 30%, the prediction error of motion has been reduced by 38%, and the planning error has been reduced by 28%.

Then we still have a lot of potential in the future, using our multimodal large models to continue to promote our autonomous driving technology. For example, we can use AIGC to generate a large number of difficult pictures, and use the perception data and multimodal data of the surround view as the input of our large model to achieve the integration of perception and decision-making. In its output, we reconstruct our 3D environment through the environmental decoder, predict our path planning through our behavior decoder, and explain our motivation for autonomous driving. Then driven by the big model, I hope that in the future, the automatic driving system will be safer and more reliable, explainable, and closer to human driving behavior.

It is also with the blessing of our large model that we realize the closed loop of perception and decision-making data, because we can obtain a large amount of data from the car end. If these data are manually annotated, the efficiency is very low, with the existence of our large model, it can automatically annotate and then feedback the data, and can analyze such a model of our car terminal, so that it can become more powerful.

With the blessing of the new paradigm of artificial intelligence, our data labeling service has also been greatly improved. In the era of artificial intelligence 1.0, we rely on manual annotation, which takes a long time and has a high cost. So now we can implement automatic annotation based on our large model, which can reduce our cost hundreds of times and quickly iterative optimization

We can look at the perception of our model, and we are the only one of all open source models in the ImageNet classification task that can exceed 90% accuracy. Then in the industry-renowned COCO dataset, we are also the only one that can break through 65 in the task of detection.

Then this model has achieved the best results in more than 20 different scenarios, different tasks, single-modal and cross-modal public datasets. We've given some examples. For example, among these tasks and datasets, then now the one shown here seems to be the best, and his SOTA performance is selected in a single task, so the green one is that the generalist's large model and single model can achieve the best degree in each data set. Then the red is the scholar 2.5 he can achieve.

Based on our perception model, we have SenseTime's bright eyes and provide automatic data annotation services. If you may go to our website, we can see that we have 12 models, including general models, and some professional models in professional fields. So in this we can choose the model, for example, a structured detection, in which we upload our image. Based on our large model, it can automatically label the data, in which it can detect the target, and some attribute recognition, in the bottom and right of the inside are the information that will display our data. Then this is a more challenging case, in which we look at denser vehicles, as well as pedestrians, especially some with finer particles, we can also detect more accurately.

We can also see the detection of objects in 3D, as well as universal targets. In the general class detection, we have more than 900 categories that we can detect in a single model, as well as its output.

You can also go to our API website, and now the API method can be opened, so that you can apply for the API key to experience our automatic data annotation service.

So to sum up, it covers more than 1,000 different target categories, 2D and 3D have more than 10 industry-specific large models, and we are constantly adding our new models and new annotated categories. You can also go to the artificial intelligence seconds to draw these images and make more detailed data annotations, so that there is a closed loop, a steady stream of new data.

Based on our new big model system, we have opened up the API. It mainly includes our natural language generation API, image generation API, visual general perception task and annotation API.

For example, in the image generation, this API supports Wen Sheng Tu, Tu Sheng Tu, high speed, can support 6K high-definition images. We can also see different styles of images just now, and users can also use the API to self-help training according to their needs.

Then the natural language generation service supports Chinese multi-round dialogue, the ability to understand ultra-long text, and it can continue to learn and evolve.

Then our annotation service, perception support 2D and 3D visual tasks, it greatly improves efficiency and reduces our costs.

Finally, we look forward to the arrival of a new technological revolution in artificial intelligence. Therefore, its impact is actually extremely far-reaching, and it will surely promote SenseTime to continue to increase our infrastructure construction and reshape our entire R&D system.

We are also looking forward to working with our customers, our ecological partners, more aspiring young people and SenseTime to participate in the era of general artificial intelligence."

The big model of SenseTime is named Riyin, which is also our belief in pushing ourselves to break through and constantly innovate such a technology. Then our large model also has a new rate of evolution every day. Then in the days to come, we will continue to present new works to you, please look forward to it, thank you.

This article is reproduced from the Computer Renaissance public number for exchange and learning only