Interview with Jiang Daxin: Scaling Law is a necessary but not sufficient condition for the path to AGI

Interface News Reporter | Wu Yangyu

Interface News Editor | Song Jianan

There are now six unicorn companies in China's general model field - only one of them, without any public valuation or even financing news, has been included in the "unicorn" team by default by the market. This company is Leaping Stars.

In the noisy and boiling "100 model war", the step star kept a low profile in a state of near silence for a year, and did not surface until March this year.

The Step series of large models released by it is a complete set of "combination punches": Step-100 billion parameter language large model, Step-1V 100 billion parameter multimodal large model, and Step-2 trillion parameter MoE (hybrid expert architecture) language large model preview.

Step-1V一经发布便登上了OpenCompass多模态大模型3月榜榜首，二三名分别是阿里的QWen-VL-Max以及谷歌的GeminiProVision，OpenAI的GPT-4V位列第四。

The Step-2 preview version is the first time that a domestic large model startup has disclosed a large model model with trillions of parameters. Theoretically, this is a key point that approaches the initial level of GPT-4's release.

doesn't say much, but the posture of blowing up the field as soon as he takes the stage is probably enough to describe the style of the step-leaping star.

The person who controls the tone of the company's behavior behind the scenes is Jiang Daxin. Before starting his business in 2023, Jiang Daxin was the global vice president of Microsoft and the chief scientist of Microsoft's Asia Internet Engineering Research Institute (STCA), where he led Microsoft's search engine Bing, as well as a series Microsoft of natural language understanding systems for Microsoft's leading products, such as intelligent voice assistant Cortana, Microsoft Cloud Azure, and 365 365.

His deep technical background makes him accustomed to rigorous and objective expression. When it comes to technical details, he will give a clear judgment: "If a large language model has trillions of parameters, the MoE architecture is almost a must-choose. ”

He believes in Scaling Law (the law of scale) and believes that in the foreseeable future, there are at least ten trillion and one trillion orders of magnitude, but he does not rule out the possibility that the field of neuroscience will one day find a way to AGI (Artificial General Intelligence) outside of Scaling Law and multimodality.

Some of the company's partners will use the word "real" to describe him. This description is contradictory and unified, because he will both threaten to "our multimodal understanding is the best in China" and admit that he is "still catching up with GPT-4, which was released last year". The logic behind it is the technological reality that he knows and believes in.

During his 16 years at Microsoft, Jiang Daxin said that the most indestructible cognition he formed was an open mind and a "growth mindset" - one of the cultures of his old employer, Microsoft, and the source of his mood when he decided to start a business.

He judged that "the previous generation of search has already done its best". From the "Boosting Tree" to the rise of neural networks, from CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory Network) to BERT (a natural language processing pre-training technology proposed by the Google team), Jiang Daxin has used each generation of technology to search, making it realize the transformation from "carriage" to "car".

It wasn't until ChatGPT was born in 2022 that he realized that this was a qualitative change from "running on the ground" to "flying in the sky".

So, if you don't know exactly what to expect from this new start-up company, expect how it will define the next generation of search, given that it has one of the most vocal teams in the era of traditional search engines.

How will Step Star use large models to define the next generation of search, and how will it catch up with the gap with GPT-4? Jiang Daxin gave his own answer.

The following is a transcript of Jiang Daxin's interview (slightly edited by Interface News):

Technology is only a window, not a moat

Jiemian News: You are the first company in China to announce a trillion-parameter MoE model, and you have been keeping a low profile before, what kind of cognition do you want to establish in the industry?

Jiang Daxin: Last year, China began to mention the "100 model war", but many companies have released large models for industries or application scenarios.

We believe that the general large model will go further, and there are two dimensions in the future: one is Scaling Law, from 100 billion to trillion parameters, and even to 100 billion;

Along these two paths, the company released the preview version of the trillion-parameter language large model Step-2, and the multimodal large model Step-1V. This represents two very important points of view after GPT-3.5, one is that the model should be larger, and the other is that the unity of multimodal understanding and generation is the only way to AGI.

For example, Minimax has just released abab 6.5, which is also a large model with trillions of parameters, and they will show the results of various open-source test sets.

Jiang Daxin: That's a very interesting question. At the World Government Summit in Dubai some time ago, OpenAI CEO Sam Altman had a point that was very out-of-the-encirclement, and I think it was very right.

He said that GPT-5 is stronger than GPT-4 in all dimensions. This sentence literally means "my general ability has become stronger", but on the other hand, it also means, "I can also be strong if I polish in the direction of a certain dimension". By sacrificing some dimensions to augment another, this dimension can be stronger than GPT-4. It's like a college student can surpass a college student in a certain dimension compared to a junior high school student who has studied welders and fitters. Therefore, brushing the list is not particularly scientific, because the questions are all public.

Some claims are even very misleading. For example, "90% of GPT-4's all-round ability" sounds great, but if GPT-4 is 90 points in a certain ability, you can achieve 90% of the 81 points, which seems to be good. But on the other hand, GPT-4's error rate is 10, and your error rate is 19, which is almost twice as high as others.

Jiemian News: Compared with the general capabilities of GPT-4, what is your conclusion on Step-2?

Jiang Daxin: GPT-4 is dynamic, and our current model is still in the final polishing stage, hoping that by the first half of this year, after the entire polishing is completed, it will be able to benchmark the level of GPT-4 that just came out last year.

Interface News: Are the domestic general large models still collectively catching up with the state that GPT-4 has just released?

Jiang Daxin: Yes. This is a relatively pragmatic goal, and we don't say we're over GPT-4 at every turn, and there's no need to attract attention. Because you can have all sorts of ways to surpass it in some small dimension, even in minutes, but what's the point?

Interface News: The MoE architecture adopted by Step-2 is also attracting attention now. This architecture has faster response speed and inference efficiency, but it also has problems such as training stability and communication cost. When and why did StepLeap decide to adopt this architecture, and how did you overcome some of the problems that may exist with this architecture itself?

Jiang Daxin: If you want to expand the model parameters to trillions, MoE is almost a must. Just like doing scientific research or engineering, many decisions are the best balance between various dimensions, and MoE is also the best choice under the trade-offs of performance, parameter quantity, training cost, and inference cost.

As for the many challenges it has to solve, I think this is the core technology of OpenAI, and if we want to continue to climb up, this problem will have to be solved sooner or later.

Our self-built computer room is a huge advantage because there can be all the hardware details. We are the system group, the algorithm group, and we do a joint optimization from the beginning of the hardware.

Jiemian News: In terms of business model, do you do toB (enterprise) and toC (consumer) together?

Jiang Daxin: No, our main force is still in toC. In the case of toB, we are not a typical way of playing with a single connection, but only choose some large industries. For example, we have set up a joint venture with Interface Finance Associated Press, a subsidiary of Shanghai Poster Industry, and this company will undertake the business, and we will provide algorithms and models.

Jiemian News: How do you view the business prospects of C-end products? Many people think that C-end is very easy to fight price wars, so what will it be like to establish a business form with healthy cash flow?

Jiang Daxin: Although there have been great changes in technology in this round, I think technology is not a moat, and technology can only give you a window period. During this window, you must build a moat for the company's products.

I don't think there's anything new, the business model is based on people's needs, and needs won't change for so many years. Now it's just that the technology has changed, and it's over to find a business model that can sell the product.

Always remember that young people are better than you

Jiemian News: Some people say that you are the last large-scale model company in China, what do you think?

Jiang Daxin: I don't think it's too late, and I don't think we must be the last, maybe one day another one will pop up.

Jiemian News: Why did you choose to register a company in Shanghai?

Jiang Daxin: Shanghai has an overall ecological layout for artificial intelligence, from chips to general large models, and then to applications in all walks of life, which is very clearly planned. The environment in Shanghai is also very suitable for starting a business, such as Xuhui and Binjiang, where there are many start-ups.

Interface News: In the 16 years at Microsoft, what is the most indestructible cognition and ability that you have accumulated?

Jiang Daxin: With an open mind, there is also the fact that Microsoft has a "growth mindset". It means not to be limited by some of the past cognitions, to have an empty cup mentality, and to look up at the stars.

Past knowledge may help you judge the value of a thing, but you should also listen to the opinions of others, selectively absorb them, and ultimately let yourself not live in the past. You must always know that young people are better than you, and this thing is very, very true in our company, and young people are also the most powerful.

Jiemian News: People who start businesses in the field of large models, such as Yang Zhilin with the label of "genius boy", Wang Xiaochuan who has successful entrepreneurial experience, and people like you who lead important business lines in technology giants, what do you think is the final fight between the leaders of this generation in this industry?

Jiang Daxin: Individuals have personal characteristics, and I think this is a benefit to the enterprise. Each company will have its own unique cultural values, as well as organizational genes, and have their own path to success.

Jiemian News: What kind of values or organizational genes may make the company different or advantageous in what things?

Jiang Daxin: It affects all aspects, and this thing is very empty to say, but it is very real, because it determines the way things are done.

For example, after listening to our introduction, some corporate partners will use "real" to describe us. I asked, "Do you mean this a positive or a derogatory term (laughs)?" What the other party means is that he visited a lot of companies, and basically every one of them will say that they are more or less better than GPT-4, and only when they come to us, I say that we are catching up with GPT-4 and admit that there is still a gap between us and GPT-4.

Jiemian News: StepLeap now has about 150 people, what is the growth curve of the team size?

Jiang Daxin: We recruited people quickly at the beginning, and we recruited people to train the initial version of the model. There was a period of relative slowness in the middle, and the product was still in the stage of exploration in small steps, and there were not many people during that time. When the product first started, there were probably 10 people. Later, in the second half of the year, we began to expand the number of students in the product and engineering system, and then there was bubbling.

Interface News: In the industry, such as Baichuan Intelligence, Minimax, etc., there are probably two or three hundred people, how do you understand the problem of talent density of large model companies?

Jiang Daxin: I agree with the talent density in this field. There are two things involved in this, one is the average talent density, and the other is that the top people in a company determine the height of the large model, and 100 people may not be able to do the 10 people, so we must have the top talents in the three dimensions of system, data, and algorithms.

Recently, I've been very happy that the algorithm team has grown a little bit bigger. Because from GPT-3.5 to GPT-4, on the one hand, you have to have the ability to algorithm, and on the other hand, the system has to keep up, but if you want to explore the path I said, from unimodal to multimodal, you need a lot of algorithm talents in different fields.

I've now found top talent in all directions, and that's the benefit of my "surfacing".

Jiemian News: Do you interview everyone who joins in person?

Jiang Daxin: I will do it for these leaders, and they are not interviews, they are really chatting and eating, and some of them have chatted several times and eaten several times.

Jiemian News: I feel that you are quite cautious in the release of financing information, but other companies are very lively. Why doesn't Stepleap mention financing so much?

Jiang Daxin: I don't think it's necessary. Our ultimate goal is to train the model so that we know the pace and method of our fundraising.

Jiemian News: But if there is high financing or high valuation, it should be easier to establish a head impression in the market?

Jiang Daxin: The advantage now is that there are indeed a group of investors in the field of artificial intelligence who know very well, they know that this is a long-term and relatively expensive project, and they are willing to believe in our technical strength.

The road to AGI is not the end of ten trillion

Jiemian News: How can you summarize the AGI you believe in?

Jiang Daxin: There is really no accurate definition of AGI at present, and I don't want to give it a (definition), I can only say very generally that achieving human intelligence is called AGI. And I think the most important word in this is "G" (general), universal.

Jiemian News: You have planned a path for the company to "unimodal-multimodal-multimodal-multimodal understanding and generation of unified-world model-AGI", is it possible to estimate how long each stage will take?

Jiang Daxin: It's hard to predict. Just like before I saw ChatGPT, I would still say that natural language processing, common sense and reasoning took ten or twenty years to solve, and it was solved overnight. Therefore, some scientific breakthroughs may have occurred long ago in terms of accumulation, but it is a bit like the "emergence" that everyone says, only to see the "bang" go up, which is a kind of jump. But this process of moving towards the goal, as long as that point is not reached, it is 0.

Jiemian News: What are the specific goals for next year?

Jiang Daxin: Working towards GPT-4.5/5, one is to expand the scale to 10 trillion parameters, and the second is to be able to make a breakthrough in the unification of multi-mode understanding and generation next year, and be able to understand and generate at the same time.

We have seen the whole line clearly, including what stage we are at and what is the certainty, and the next step will definitely be to move forward on the basis of the existing certainty.

Jiemian News: After reaching the trillion-parameter model, what is the most difficult thing to promote Scaling Law?

Jiang Daxin: The most difficult thing is not one, but four things, that is, what we call computing power, system, data, and algorithms.

Interface News: I feel that you are a person who believes in Scaling Law very much, does this law come to an end?

Jiang Daxin: That's a good question. In the future that is visible to the naked eye, I think there are at least two orders of magnitude more. We should not really climb to the trillion until the middle of the year, so 10 trillion is a definite event, and it will definitely climb.

Interface News: 10 trillion corresponds to GPT-5?

Jiang Daxin: I don't know if it's GPT-4.5 or GPT-5, it depends on what OpenAI's next model will come out of. In fact, GPT-4 was trained in October 2022, and it was released in April last year because it took another 6 months to polish it, which is similar to our current state.

Although OpenAI released Sora at the beginning of the year, it is not known whether the latest generation of large models is called GPT-4.5 or GPT-5, and what parameter scale it is. If we make a guess based on its capabilities, routes, and the cards it uses, it will be at least 10 trillion level.

Interface News: Then the second order of magnitude will be 100 billion.

Jiang Daxin: There is a reference here, that is, the neuronal connections of the human brain are 200 trillion, but I don't think it is a particularly effective reference, because there is no direct comparison between humans and machines now, and it may just give people a goal.

Interview with Jiang Daxin: Scaling Law is a necessary but not sufficient condition for the path to AGI

Interface News: In addition to this path, are there other technical routes that may overtake in corners?

Jiang Daxin: In addition to the two routes just now, there is a third path to interpret how the human brain works, that is, neuroscience.

Now the so-called brain neural network and the real neural network, I think it's a very far-fetched analogy - think of its structure as neurons, dendrites, axons, and so on, but the human brain probably doesn't work that way at all.

But recently there have been some interesting discoveries that some of the laws of the human brain seem to be reflected in our latest large-scale model design. For example, human intelligence comes from the cerebral cortex, which is a continuous repetition of a simple structure, which corresponds to a certain feature of the Transformer architecture. In addition, it believes that human intelligence comes from modeling the world, called the reference framework, which seems to logically correspond to our current work of inputting, compressing, and modeling the world to produce intelligence.

But from a physiological point of view, what is the process by which human neurons produce biological discharges and chemical transmitters? We now need to consume so much energy to train a junior high school student's level of intelligence, and the human brain consumes only two or three catties, and the energy consumption is just like that? There are also many mysteries of nature in this, which can actually help you feed back.

You either learn it, or you get inspiration from it to improve the big model, and maybe at some point, the robot is smart enough to read it and tell you, or it silently changes itself, and then I feel terrible (laughs).

Interface News: You mentioned earlier that "the previous generation of search has come to an end". Do you have any preliminary definition of search in this era, or what is the likely shape of search in the future, and who will it replace?

Jiang Daxin: That's a good question, I think it's evolving in stages.

First of all, it will make the user's search experience better. The previous generation of search engines forced users to develop a habit of only asking questions with keywords, because if they asked a natural language, users were sure that the search engine would not understand. But after the emergence of large models, not only can this matter be naturally linguized, but also multiple rounds (dialogues).

The large model will first help you read all the first dozens of documents and web pages, and then summarize a set of information for you, and can also provide the source, and the results are much better than the original, from simple information retrieval to knowledge acquisition. Therefore, the first stage of search (in the era of large models) is to solve the difficulty of asking questions and the efficiency of looking at answers.

Search itself is not an end-to-end task, most people search to complete a job or task. So how can search integrate external search, local search, and knowledge of the work environment in the future, and embed it into an end-to-end workflow, I think it may be a direction for future search.

Interface News: If you don't consider the fact that Google itself will evolve, the current general model is actually to replace the former Google?

Jiang Daxin: In terms of substitution, I think it is very interesting to deduce.

Many people are thinking, after the big model comes out, what kind of changes will search engine companies have? Many AI products, including our Yuewen, are aggregate searches, which are based on the existing search engines, and the Top 10 or Top 20 results are integrated into a web page.

So I think search engine companies like Google are here to stay for a long time. From the moment you come in from a query to the time I feed you back the top 20 results from 1 trillion pages, it's always going to be there.

As for which company it is, I don't know, but this thing is not going away, because the big model does not have the ability to read all the 1 trillion web pages in a very short time, it can only do the 20 web pages you gave me.

In terms of business model, it is very tricky for search engine companies - do I follow or don't follow this situation, I may lose advertising revenue, or I have to watch users go to others.

So one of its solutions might be to show ads at the same time, and it has a way to explore whether this business model is valid: I can not give you ads if you pay, you don't pay, sorry, you have to watch ads.

Jiemian News: Some time ago, there were a lot of controversies in the industry about R&D, application, commercialization, etc., after you read so many views, do you think it is more appropriate to your ideas?

Jiang Daxin: Actually, everyone is not wrong, from their point of view, they just said a different thing, and then after artificially pinching together, it seems that the views are very opposite, in fact, I don't think there is a contradiction.

Jiemian News: As an entrepreneur, what is your preference between R&D and application?

Jiang Daxin: We have always said that the model and the application should be done by the same company.

Jiemian News: So you think two-wheel drive is established?

Jiang Daxin: Yes. From day one, our company determined that both the model and the application should be done, because the model needs to be applied as a traction and a supplement to the data. When you are specific to a certain application, you must be deeply bound to it by the general model, so that the application can be extreme.

The reverse is also true, I don't think a company that only makes applications can do it to the extreme without a model that is deeply bound to it.

Jiemian News: But this will be a huge consumption for startups, and it will also test the financing ability of entrepreneurs in the later stage, such as the entrepreneur.

Jiang Daxin: Agreed, because of this round of large-scale model entrepreneurship, everyone will make an analogy with chips, and it is a pattern of re-betting on the open card, no way.

Jiemian News: What are the predictions for the first and second echelons of the global large model industry?

Jiang Daxin: The first echelon is now OpenAI's GPT, Anthropic's Claude, and Google's Gemini, each of which is a cloud.

After a year of competition, unless OpenAI immediately puts a GPT-4.5/5 to stun everyone again, otherwise in terms of their three models, the level is not much different, at least GPT-4 and Claude 3 are, from the perspective of Gemini Pro, it is close.

I think there are only two in the second echelon, x. AI's Grok and Meta's Llama, these two companies do have a lot of money, cards, and talents, but they are slow and can only be ready to open source at any time.

I judge that open source can't catch up with closed source, unless there is a major setback when closed source goes, such as GPT-4.5/5 can't be released, but I think it's a small probability event. The closed source will continue to move forward, and the momentum will not be too slow, at least to 10 trillion or even 100 billion, and it will continue at this rate.

Jiemian News: Benchmarking the world's first and second echelons, where is the overall level of domestic large models?

Jiang Daxin: Second echelon. The domestic progress is very fast, but there is still a certain gap from GPT-4 and Claude 3. In addition to the model itself, there is also a cognitive gap. OpenAI is accumulating for a long time.

Jiemian News: What kind of level will the domestic large model reach next year?

Jiang Daxin: It depends on two factors, one is how fast they (the first echelon) are, and the biggest variable now is what OpenAI's new model looks like, which will affect the pattern next year. The other depends on how many domestic companies can reach GPT-4 by the end of the year.