laitimes

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

author:Quantum Position

Jin Lei from SenseTime AIDC

Quantum Position | 公众号 QbitAI

It's exciting enough, GPT-4 was "beaten" in public, and he didn't even have a chance to fight back:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Yes, it was in a live PK of the "Street Fighter" game that such a famous scene occurred.

And the two are still not in the same "heavyweight" category:

  • Green Man: Manipulated by GPT-4
  • Red: Manipulated by a small end-side model
GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

So what is the origin of this small and tough player?

不卖关子,它正是由商汤科技最新发布的日日新端侧大模型——SenseChat Lite(商量轻量版)。

In the performance of "Street Fighter" alone, this small model has a kind of momentum of "martial arts in the world, only fast and unbreakable":

GPT-4 was still thinking about how to make a decision, and SenseChat Lite's fist was already hit.

Not only that, Xu Li, CEO of SenseTime, also increased the difficulty on the spot, directly disconnecting the network on his mobile phone to start the test!

For example, if an employee takes a one-week leave request in offline mode, the effect is as follows:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

△ On-site original speed

(Of course, Xu Li jokingly said, "The fake is too long, don't approve it~")

You can also make a quick summary of long paragraphs of text:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

△ On-site original speed

And this is possible because SenseChat Lite has reached the SOTA level in terms of performance at the same scale.

It also defeated Llama 2-7B and even 13B in a number of tests with the posture of "small and big".

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

In terms of speed, SenseChat Lite adopts the MoE framework of device-cloud "linkage", and device-side inference accounts for 70% in some scenarios, which will make the inference cost lower.

Specifically, compared to the reading speed of 20 words per second for the human eye, SenseChat Lite can reach an inference speed of 18.3 words per second on medium-performance mobile phones.

If it is in a high-end flagship mobile phone, then the inference speed can directly soar to 78.3 words / second!

However, in addition to text generation, Xu Li also demonstrated the multimodal capabilities of SenseTime's end-to-end model at the scene.

For example, in the same extended image, SenseTime's end-side large model expands 3 different pictures faster than the speed of 1 photo expansion of a competitor when it starts slowly in half a beat:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

The students who demonstrated even took pictures directly on the spot, and then expanded the photos freely after shrinking them a lot:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Well, I have to say that SenseTime dares to be real on the spot.

However, throughout the event, the end-side model is only a corner of the conference.

In terms of "big pedestal", SenseTime has upgraded its own daily new large model with a major version - SenseNova 5.0. And directly position it to a new level:

全面对标GPT-4 Turbo!
GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

So what is the strength of the 5.0 version of the new large model every day, let's measure a wave~

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Please, "mentally retarded"!

Since the popularity of large models, "mentally retarded" has always become one of the standards for testing the logical ability of large models, and it is nicknamed "Benchmark for the mentally handicapped".

("Mentally Handicapped" is derived from Baidu Tieba, a Chinese community full of absurd, bizarre, and unreasonable statements.) )

And not long ago, "mentally handicapped bar" also appeared on the serious AI paper, becoming the best Chinese training data, causing a lot of heated discussions.

So when the discussion model 5.0 of text dialogue meets the "mentally handicapped", what kind of fireworks will be created between the two?

Logical reasoning: "Mentally retarded"

Listen to the first question:

Why didn't my parents call me when they got married?
GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

The answer is different from other AI, it will be more anthropomorphic with "I" to answer, and from the answer results, there is not too much redundant content, but the answer and explanation are accurate, "You were not born when they got married".

Listen to the second question:

Internet cafes can access the Internet, why can't the mentally handicapped bar be mentally handicapped?
GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

In the same way, the discussion directly pointed out that "this is a joke problem" and that "the 'mentally retarded' bar is not an actual place".

It is not difficult to see that for the magic of "mentally handicapped bar", the logic of not playing cards according to the routine, Consultation 5.0 has been able to hold.

Natural Language: Dream of Red Mansions

In addition to logical reasoning ability, in terms of natural language generation, we can directly use the 2022 college entrance examination essay questions to compare GPT-4 and the large model 5.0.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Judging from the results, GPT-4's article is still an "AI template", while the discussion 5.0 side is quite poetic, not only the sentences are neat and right, but also the classics can be cited.

Well, the idea of AI is opened up and diverged.

Math ability: Simplify the complex

In the same way, GPT-4 and Shangshang 5.0 are competing on the same stage, and let's test their mathematical abilities this time:

Mom made Yuanyuan a cup of coffee, and after Yuanyuan drank half a cup, she filled it with water, and then she drank another half cup, and then filled it with water, and finally drank it all. Ask Yuanyuan how much coffee or water he drinks, and how many cups of coffee and water do you drink?
GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

This question is actually a relatively simple question for humans, but GPT-4 has made a seemingly serious and careful deduction about it, and the result is still wrong.

The reason is that the logical construction of the thinking chain behind the large model is incomplete, and it is very easy to make mistakes if it encounters niche problems.

In the following "eagle catches chicken" question, GPT-4 may not understand the rules of this game, because the calculated answer is still wrong:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Not only can we perceive one or two from the effect of the actual experience, but also the more direct evaluation of the list data also reflects the ability to discuss 5.0——

Conventional objective measurements have reached or surpassed GPT-4.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

So how does Ririxin 5.0 do it? In a word, left-handed data, right-handed computing power.

First of all, in order to break the bottleneck at the data level, SenseTime uses more than 10T tokens, so that it has the completeness of high-quality data, so that the large model has a primary understanding of objective knowledge and the world.

In addition, SenseTime has also synthesized and constructed hundreds of billions of tokens of thought chain data, which is also the key point of this effort at the data level, which can activate the strong reasoning ability of large models.

Secondly, at the computing power layer, SenseTime has jointly optimized the algorithm design and computing facilities: the topological limit of the computing facilities is used to define the algorithm in the next stage, and the new progress in the algorithm needs to re-know the construction of the computing facilities.

This is the core capability of SenseTime's AI device for the joint iteration of algorithms and computing power.

Overall, the highlights of RiRixin 5.0 can be summarized as follows:

  • The MoE architecture is adopted
  • Trained on more than 10TB tokens, it has a large amount of synthetic data
  • The inference context window reaches 200K
  • Knowledge, reasoning, math and code are fully benchmarked against GPT-4
GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

In addition, in the field of multimodality, RiRixin 5.0 has also achieved leading results in a number of core indicators:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

As a rule, let's move on to the generative effects of multimodality.

I'm even better at looking at pictures

For example, if you give a super long picture (646*130000) to Consultation 5.0, you just need to let it recognize it, and you can get an overview of all the contents:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Another example is to throw an interesting picture of a cat to Discussion 5.0, and it can infer that the cat is celebrating its birthday based on details such as party hats, cakes, and "happy birthday".

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

More practical, such as uploading a complex screenshot, Consultation 5.0 can accurately extract and summarize key information, but GPT-4 made a mistake in the identification process:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Second Stroke 5.0: Wasan Dairyu PK

In terms of Wensheng diagrams, Ririxin's second painting 5.0 is directly related to Midjourney, Stable Diffuison and DALL· The E 3 competed on the same stage.

For example, in terms of style, the image generated by Miaohua may be closer to the "National Geographic" mentioned in the prompt:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

On the character image, you can show more complex skin textures:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Even text can be embedded in images with unmistakable precision:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

There's also an anthropomorphic model

In addition, SenseTime also launched a special large model in this release - the anthropomorphic large model.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

From the perspective of experience, it can already imitate film and television characters, real celebrities, Genshin Impact World and other dimension-breaking characters, and start a high emotional intelligence dialogue with you.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

From a functional point of view, the Discussion Anthropomorphic Model supports character creation and customization, knowledge base construction, long dialogue memory, etc., and even the kind that can be chatted by more than three people~

It is also based on such multimodal capabilities that another major member of the SenseTime model family, Little Raccoon, has also ushered in an upgrade in ability.

Office and programming just got easier

SenseTime's Little Raccoon is currently subdivided into two categories: Office Raccoon and Programming Raccoon, which, as the name suggests, are for office scenarios and programming scenarios, respectively.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

With Office Raccoon, dealing with forms, documents, and even code files has now become a matter of "one throw + one question".

Taking the procurement scenario as an example, we can first upload the supplier list information from different sources, and then say to the office raccoon:

Units, Unit Price, Remarks. Because the header information in different sheets is not consistent, you can merge similar header contents. Show the table results in the dialog box and generate a local download link, thank you.
GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Just wait a few moments and we will get the result after processing.

And in the left sidebar, the office raccoon also gives the Python code of the analysis process, focusing on a "traceable trace".

We can also upload multiple documents such as inventory information and purchasing requirements at the same time:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Then continue to make requests, and the office raccoon is still able to complete the task quickly.

And even if the data form is not standardized, it can find and solve it on its own:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Of course, data calculation is not a problem, it is still a matter of request.

In addition, office raccoons can also do visualization work based on data files, and directly display difficult heat maps:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

In summary, the office raccoon can process multiple and different types (such as Excel, csv, json, etc.), and has very strong capabilities in Chinese understanding, mathematical computing, and data visualization. And it enhances the accuracy and controllability of the content generated by the large model in the form of a code interpreter.

In addition, at the press conference, the office raccoon also demonstrated the ability to combine complex databases for analysis.

Last week, China's first F1 driver, Zhou Guanyu, completed his race at the F1 Chinese Grand Prix. At the press conference, SenseTime directly "fed" a database file with a huge amount of data to the little raccoon in the office, and asked the little raccoon to analyze the relevant situation of Zhou Guanyu and F1 events on the spot.

For example, counting Zhou Guanyu's participation information, how many drivers are in F1 in total, which drivers have won the championship, and ranking them according to the number of awards from high to low, these calculations involve a larger and more complex data table and more dimensional details such as laps and awards, and finally give completely correct answers.

In the programming scenario, Code Raccoon can also make the efficiency of programmers directly Pro Max.

For example, just install the plugin for the extension in VS Code:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Then the various parts of programming become a matter of typing a sentence into natural language.

For example, throw the requirements document to the code raccoon, and say:

Help me write a detailed PRD document for WeChat QR code payment on the public cloud. Please follow the requirements of the "Product Requirements Document PRD Template" for PRD format and content, and the generated content is clear, complete and detailed.

Then the code raccoon starts to do the requirements analysis work:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Codecoon can also do the architectural design for you:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

You can also write code through natural language requirements, or through one-click mouse comments, test code generation, code translation, refactoring or correction, etc.:

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

The final software testing link can also be handed over to the code raccoon to execute~

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

All in all, with CodeRaccoon, it can help you with some of the repetitive and tedious programming tasks that you would normally have.

And SenseTime not only released such an action this time, but also "packaged" the code raccoon to launch a lightweight all-in-one machine.

One all-in-one machine can support the development of a team of 100 people, and the cost is only 4.5 yuan per person per day.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

The above is the main content of SenseTime's release.

So finally, we need to talk about a topic in a summary way.

The number of large model roads of SenseTime

Throughout the whole press conference, the most intuitive feeling is that it is comprehensive enough.

Whether it is the device-side model or the "big base" Ririxin 5.0, it is a release or upgrade of the full stack of cloud, edge, and device, and its capabilities cover almost all mainstream AIGC "labels" such as language, knowledge, reasoning, mathematics, code, and multimodality.

The second is to be resistant to fighting.

Taking the comprehensive strength of Ririxin 5.0 as an example, at present, looking at the entire domestic large-scale model players, it can be said that there are only a few who can shout out a comprehensive benchmark against GPT-4;

In the end, it's all about speed.

SenseTime's speed is not only limited to the speed of the operation effect of the device-side large model, but also the speed of its own iterative optimization process from a macro perspective. If we stretch the timeline, this speed becomes particularly noticeable:

  • RiRixin 1.0→2.0: 3 months
  • Ririxin 2.0→ 4.0: 6 months
  • Daily New 4.0→5.0: 3 months

On average, there is a major version upgrade almost every quarter, and its overall capability will also be greatly improved.

So the next question is, why can SenseTime do this?

First of all, from the perspective of the general direction, it is the "large model + large device" style that SenseTime has always emphasized.

The large model refers to the new large model system of Riri, which can provide a variety of large models and capabilities such as natural language processing, image generation, automatic data annotation, and custom model training.

The large device refers to the high-efficiency, low-cost, and large-scale next-generation AI infrastructure built by SenseTime, with the development, generation, and application of AI large models as the core, with a total computing power of up to 12,000 petaFLOPS and more than 45,000 GPUs.

The similarity between the two is that they have already been laid out, and they are not the products of the AIGC boom, but two forward-looking works that can be traced back to several years ago.

Secondly, at the level of large models, SenseTime has a new understanding and interpretation of the basic laws and scaling laws agreed upon by the industry based on its own actual testing and practice process.

The law of scale usually refers to the fact that with the increase of the amount of data, the number of parameters, and the training time, the performance of large models will be better, which is a feeling of miracles.

This law also contains two hidden assumptions:

  • Predictability: Accurate predictions of performance can be maintained across 5-7 orders of magnitude scales
  • Sequence-preserving: The performance advantage is verified on a small scale, and it is still maintained on a larger scale

Therefore, the law of scale can guide the optimal model architecture and data recipe in the limited R&D resources, so that the large model can learn efficiently.

It is also based on SenseTime's observation and practice that the "small and playable" end-to-end model was born.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

In addition, SenseTime also has a unique understanding of the three-tier architecture (KRE) for the capabilities of large models.

GPT-4 was "beaten" by the small model on the end of the scene, and SenseTime 5.0: fully benchmarked against Turbo

Xu Li gave an in-depth interpretation of this.

The first is in knowledge, which refers to the comprehensive infusion of world knowledge.

At present, new productivity tools such as large models are almost all based on this to solve problems, that is, to answer your questions according to the solutions to problems that have been solved by predecessors.

This can be regarded as the basic skill of large model ability, but the more advanced knowledge should be based on the new knowledge obtained by reasoning under such ability, which is the second layer of this architecture - reasoning, that is, the qualitative improvement of rational thinking.

The capabilities of this layer are the key and core that can determine whether the large model is smart enough and whether it can draw inferences.

On top of that, there is execution, which refers to the interactive transformation of world content, that is, how to interact with the real world (for now, embodied intelligence is a potential existence at this level).

Although the three are independent of each other, they are also closely related to each other, and Xu Li made a more vivid analogy:

Knowledge to reasoning is like the brain, and reasoning to execution is like the cerebellum.

In SenseTime's view, these three-layer architecture is the capability that a large model should have, and this is the key to inspiring SenseTime to build high-quality data.

So the last question is, based on KRE, based on the route of "large model + large device", to what extent has the latest Ririxin been "employed" in the industry?

As the saying goes, "practice is the only criterion for testing the truth", feedback from customers may be the most authentic answer.

At the scene, Huawei, WPS, Xiaomi, China Literature, and Haitong Securities, from office to entertainment, from finance to terminals, shared the cost reduction and efficiency increase brought to their own businesses by using SenseTime's new model system.

All in all, with technology, computing power, methodology, and scenarios, the next development of SenseTime in the AIGC era is worth looking forward to.

— END —

量子位 QbitAI 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on