laitimes

ZOMI-chan: From an art student to a large-scale model training expert

author:Kōko Kōnen
ZOMI-chan: From an art student to a large-scale model training expert
Technical Forrest Gump is on the run.

Author |

Editor|Wang Bo

In Siberia in January, the wind is as cold as a knife.

ZOMI-CHAN STOOD ON THE EDGE OF THE ICE CAVE OF LAKE BAIKAL, TOOK A DEEP BREATH OF COLD AIR, FILLED HIS CHEST WITH COURAGE, AND THEN DIVED INTO THIS ANCIENT WATER.

The icy waters of the lake are wrapped in the sound of heartbeats and breathing, and the silence and cold of the water make you feel like you are in a different world, and the sun shines through the ice, dyeing the lake a dreamy blue – the color of Baikal is unique.

"It felt like an out-of-body experience. ZOMI-chan said to Koko Lightyear, "It's like my life, every choice I make is like an ice dive, and every ice dive is a metamorphosis." ”

ZOMI-chan: From an art student to a large-scale model training expert

ZOMI SAUCE (FIRST FROM RIGHT) AT LAKE BAIKAL, IMAGE SOURCE: PROVIDED BY THE INTERVIEWEE

At the intersection of code and art, ZOMI sauce lives here. ZOMI-chan started studying art at the age of 4 and is an art student, and he accidentally transferred to the computer major during the college entrance examination. Now he is an expert in Ascend large model training, and he is also a Bilibili (B station) AI science video UP master, and he has a rather "two-dimensional" name - ZOMI sauce. "It's called because it sounds like the real name, although it sounds a bit two-dimensional, haha. He explained.

Only three days after the release of Sora, he produced a video titled "The Most Detailed on the Whole Network: Analysis of the Principle of SORA Video Generation Large Model" based on the 32 papers behind Sora, which sparked heated discussions in the industry. This video was posted at 3:42 a.m. on February 20. Some netizens left a message below the video: "Don't you sleep?"

"I went home after working overtime in the evening, and then I started preparing for classes and making videos. Every word of ZOMI-chan's answer reads "roll".

ZOMI-chan: From an art student to a large-scale model training expert

The homepage of station B of ZOMI sauce, picture source: station B

ZOMI-chan always seems to be looking for a balance in the cross-border contradictions, and in the balance to be transformed, in his own words, it is "once it comes, it is safe".

From AI chips, AI compilers, AI frameworks to large models, etc., his videos cover all aspects of AI, because his knowledge of AI systems is professional and funny, and PPT design and video animation effects are quite beautiful, and some fans call him "the little godfather of AI". FOR SUCH PRAISE, ZOMI-CHAN SAID, "I DON'T DARE TO ANSWER AT ALL." Whether it's a joke or a joke, everyone can get something out of ZOMI-chan's video. ”

35-year-old ZOMI-chan has a strong "sense of youth", but he is actually an "AI veteran".

As soon as he was in graduate school, he plunged headlong into the field of AI. While working at a smartphone manufacturer, ZOMI-chan tried to promote the terminal AI situational awareness project, but did not receive support from the company. Even because of his persistence in the project, he was rated the worst "D" in the year-end evaluation.

ZOMI-CHAN'S DESCRIPTION OF THAT EXPERIENCE WAS "DOUBTFUL LIFE". Somewhat ironically, the idea he insisted on a few years ago but was not accepted is now an important strategic direction for the smartphone maker.

"Sometimes ideas that are too far ahead can be discouraged. ZOMI-CHAN SAID.

AFTER A FEW MONTHS OF DEPRESSION, ZOMI-CHAN REGAINED HER BEARINGS. He joined Huawei in June 2019 and is mainly responsible for the development of AI inference engines. At that time, the related technology was still a "trendy thing", and there were almost no other AI frameworks in China except for PaddlePaddle. After making the inference engine, he devoted himself to the development of the AI framework MindSpore.

"In the past, I mainly worked on algorithms, but after I came to Huawei, I switched to AI Infra (artificial intelligence infrastructure). In the past five or six years, I have basically explored the entire field of AI Infra. ZOMI-CHAN SIGHED, "THERE HAVE BEEN UPS AND DOWNS ALONG THE WAY. Now that I've figured it out, I just have to move forward, but do good deeds, don't ask about the future. ”

This state of mind is like when he broke the ice on Lake Baikal by boat - as the fog clears, the waterways open up, and the shores of the lake become more and more distant, the troubles disappear like a cell phone signal.

1. "I don't sell classes"

Jiazi Lightyear: Say, why do you want to be the UP master of AI popular science videos?

ZOMI-chan: Haha, it's impossible to work without saturation. I sometimes work very late, I used to watch short videos, and then I felt "very sinful", so I wanted to do something, to put it bluntly, I wanted to keep "rolling" myself.

My own work is related to AI Infra, so I drew up an outline and started working on it one by one. The cold start of hardcore tech content is quite difficult, after all, the audience is not large. Making videos is also a process of self-improvement, and I didn't expect to gain everyone's love by accident.

I've been sharing my knowledge about AI Infra, from AI frameworks, AI compilers, to large models. It's all out of my passion for technology, and I don't deliberately pursue the occasional hot spot.

ZOMI-chan: From an art student to a large-scale model training expert

ZOMI sauce popular science video outline, source: interviewee

Jiazi Lightyear: Large model expert and UP master, how do you balance these two roles?

ZOMI-chan: I don't feel like a professional UP master, I just occasionally share some technical content. I don't think I have a fandom either, and most of the people who follow me just for knowledge, not because of me. For the matter of becoming an UP master, it is like drinking water, there is not much special feeling.

As for how to balance the two roles, I try to keep my work and personal hobbies separate, making sure that I spend most of my time at work and concentrate on my work when I work. Although it is tiring at times, I will continue to do technical sharing, I hope to bring the knowledge of AI System to those who need it, and welcome everyone to participate in this open source project.

Jiazi Lightyear: Is there any gain from doing AI science popularization?

ZOMI-chan: Yes, I've made a lot of friends who are interested in AI. At first, the content I shared was more "hardcore", and I found that most of the people who followed me were college students who needed to learn related knowledge. For example, when I shared content about AI compilers, it happened to attract a lot of students who were taking the "Principles of Compilation" course, and my videos became their "extracurricular tutorial materials".

After Sora came along, I made a detailed analysis video about it, which made many friends outside the tech circle know about ZOMI sauce, and also inspired me to keep sharing.

Jiazi Lightyear: Do you have a team behind you?

ZOMI-chan: It's just me. The production time depends on my familiarity with the subject matter. If you're not familiar with it, it might take a week or two to go through the relevant literature and materials, and then organize the information into a script, and it usually takes a few hours to record and edit the video, and if you are familiar, it only takes me a weekend to prepare the PPT and make the video.

At that time, Li Yizhou was very popular, and someone invited me to sell courses together, but I wanted to achieve long-term career goals, not just short-term financial gains, so I wouldn't sell classes.

2. "The data problem is a 'big mountain' in the field of video generation in China"

Koko Lightyear: What were you doing when Sora was released?

ZOMI-chan: At that time, it was still the Chinese New Year, and I woke up in the morning to find that my circle of friends was talking about Sora. It is said that it not only produces a one-minute long video, but the effect is realistic and amazing. In particular, the video of indoor surfing showed a completely unimaginable scene, which was quite shocking to me.

Koko Lightyear: We consulted with some industry experts, and they didn't understand the technical principles behind Sora at first. How did you quickly make a Sora video profiling video?

ZOMI-chan: I don't think I'm doing it very fast. It's just that at that time, I found that why no one on the Internet did the relevant technical interpretation? OpenAI's official website did not publish too many technical details, and I mainly analyzed the official information and literature in detail, and then sorted out the technical route used by Sora.

Jiazi Lightyear: Why did you predict that the number of parameters for Sora was about 10B, and not higher?

ZOMI-chan: My understanding is relatively straightforward. Sora is a video generation model based on a large visual model. If we simplify the model and ignore the time dimension, we actually go back to the image model, like ViT (Vision Transformer).

At present, most of the parameters of image models with excellent results do not exceed 10B, usually in the range of 3B to 5B. When the time dimension is added, the number of parameters of the model does rise, but according to the Scaling Law, this increase is not exponential.

Given the practical feasibility of handling very large models and the evolutionary trend of GPT models, I predict that the number of parameters in Sora will grow gradually, i.e., from 2B/3B to 5B/10B, rather than a sudden big jump.

Jiazi Lightyear: Compared with the two phenomenal products of Sora and ChatGPT, which one shocked you more?

ZOMI-chan: ChatGPT has had a bigger impact on me. Not only because of its technological breakthroughs, but also because of the industry's starting to look at large models squarely. With the release of ChatGPT, the cloth parallel computing we are pushing is no longer a castle in the air, and the industry trend and product design have changed.

While Sora does prove that GPT's Scaling Law technology route is reliable, it is very helpful for AI processing language, pictures and videos. But when it comes to it, it's not quite as exciting as ChatGPT.

Jiazi Lightyear: How far do you think Sora is from landing on the C-end and B-side?

ZOMI-chan: I've discussed this with an engineer friend at Google, and for the C-side, we think that Sora is not far from opening, and it may be seen in 3 to 5 months. The current generation speed is slow and consumes a lot of computing resources, which is not sustainable from a business perspective. After solving these problems, it should be quick for Sora to be open to regular users. But it may take longer to reach a wide range of commercial applications and solve more fundamental technical challenges. And for the B-side, especially in a professional field like film and television, it may take longer.

Jiazi Lightyear: How has the launch of Sora influenced your team's technology research direction?

ZOMI-chan: The launch of Sora really influenced the direction of our research. We began to explore the differences and connections between multimodal large models and traditional large language models in more depth. Not only the technical level has changed, but also the research methods and product goals have been rethought, so that we can pay more attention to how to improve the computing power and efficiency of AI clusters to serve a wider range of application scenarios.

Jiazi Lightyear: Sora's framework is DiT (Diffusion + Transformer), the prototype of the Diffusion Model in the field of deep learning comes from the paper of Sohl-Dickstein et al., and the Transformer model comes from the paper of the Google team. So, what are OpenAI's original contributions to the Sora project?

ZOMI-chan: OpenAI's main original contributions to the Sora project are mainly two. One is their insistence on Scaling Law, and the other is video compression technology. In particular, the video compression technology solves the problem of converting long videos into easy-to-process Spacetime patches, which has a positive impact on the future of AI technology for video processing and content creation.

Jiazi Lightyear: After the emergence of Sora, what will happen to startups such as Runway and Pika?

ZOMI-chan: The emergence of Sora has forced them to innovate faster, and at the same time, it has also tested their adaptability and technical prowess. This progress has forced start-ups to accelerate their technology and find strategies to compete with big companies like OpenAI.

Jiazi Lightyear: Who do you think is most likely to make a Chinese version of Sora?

ZOMI-chan: It's hard to say, but it's only a matter of time before China makes its own version of Sora. Key success factors include having top talent, enough data, and powerful computing power.

We have no shortage of talent, but we still lack in terms of data accumulation. Data silos and the lack of high-quality Chinese datasets are major challenges. In addition, the effective use of computing power and the construction of computing infrastructure are also crucial. Whoever can lead in these three areas is most likely to succeed.

However, the problem of data is a "big mountain" faced by the domestic video generation field. The quality and availability of data directly affect the training effect and progress speed of the algorithm. Many teams are secretive about the source of the data, and there is a lack of open-source, high-quality datasets for the development of AI technology.

Jiazi Lightyear: Is the development path of DiT technology route clear?

ZOMI-chan: There are still a lot of unclear points about this route at the moment. There have been attempts to replicate or recreate Sora-like models, but there are still many details that have not been addressed.

For example, there is no unified solution on how to effectively compress the original video through the VAE encoder or how to select the most appropriate path to process video patches. Although the structure of the DiT model is important, there are still many problems to be explored on how to accelerate sampling in the process of diffusion.

Koko Lightyear: Can Sora and Gemini be considered simulators of the physical world, and how feasible are they?

ZOMI-chan: In order to demonstrate whether these projects can be used as simulators of the physical world, it is first necessary to clarify the definition of physical simulators. The physics engine we're talking about today is more based on simulations of intuitive physics than rigorous scientific calculations. As for AI models, such as Gemini, V-JEPA or Sora, they are generated by learning rules from data rather than from underlying physics. So these models are currently closer to data-based generators than to real physical world simulators.

Jiazi Lightyear: If you had to define a "physical world simulator", how would you define it?

ZOMI-chan: Based on my research in the field of reinforcement learning, I think the definition of a physical world simulator should be similar to the direction of Google Gemini, following the framework of reinforcement learning. In this framework, there must be an environment and one or more Agents, where the Agents interact with the environment, the environment gives feedback based on Agent Rewards, and the model then makes next actions based on those feedback.

A true physical world simulator should be able to simulate this complex process of interaction, reflecting the causality of each of our activities in the real world and the different outcomes it may produce. It's more of a complex system of interactions than just a tool for generating physical world phenomena.

Just as the extended-range car is a transitional solution on the road to full electrification, Sora may not be the ultimate answer to a fully accurate simulation of the physical world, but it represents an important step towards that goal. While this metaphor isn't entirely accurate, it illustrates where Sora stands: a phase of innovation that allows us to feel the change in the field over time, even if the change is not a one-step process.

Jiazi Lightyear: Some people say that the appearance of Sora is a "GPT-3 moment", do you agree?

ZOMI-chan: I think it was the turning point from GPT-2 to GPT-3. While it doesn't work like GPT-4 or ChatGPT, Sora does prove to us that the Scaling Law from GPT-1 to GPT-3 works. It is effective not only for text, but also for multimodal content such as images and videos, but I don't think Sora has completely disrupted the industry.

Jiazi Lightyear: Which is more difficult for domestic teams to catch up with ChatGPT or Sora?

ZOMI-chan: It's more difficult to catch up with ChatGPT. Although some people claim to be able to catch up with ChatGPT or Sora in a short period of time, in fact, since the release of ChatGPT, we have seen domestic and foreign companies try, but the effect of GPT-4 is not something that can catch up in a day. In contrast, people have a relatively high tolerance rate for video content, and it may be relatively easy to catch up. There are clear evaluation criteria in the NLP field, while the evaluation of video content is more subjective.

3. "In the face of doubt, the best response is to speak with technology and products"

Jiazi Lightyear: Many people regard Scaling Law as a guideline, how do you understand it?

ZOMI-chan: We often think that Scaling Law means that the bigger the model, the better the effect, but that's not the case. Taking the development of ChatGPT as an example, from GPT-2 to GPT-3, the growth of model parameters from billions to hundreds of billions has indeed brought a leap in efficiency, but it has also encountered the so-called "Grokkinng phenomenon".

The core problem is that when the size of the model grows to a certain extent, if there is not enough data to match the growth of model parameters, the parameters inside the model may not be able to be effectively learned. Just like the comparison between Google's PaLM model and GPT-3, PaLM has 540 billion parameters, and GPT-3 has 175 billion parameters, but it is not that the more parameters, the better the effect. Therefore, Scaling Law is not only about increasing the size of the model, but also about matching data, computing power, and models.

Jiazi Lightyear: Some problems can be solved by more than just heap computing power.

ZOMI-chan: Yes, the key to AI infrastructure is to improve the utilization and stability of the entire computing cluster. Even if you have surging computing power, if the utilization rate is not high, the computing power will be in vain. In addition, one of the challenges encountered in the process of large model training is the interruption of training, as long as one of the cards is broken, one node has a problem, or the network is congested, it may lead to the interruption of the entire business. Therefore, the stability of computing power is not only the powerful computing resources, but also the stability and efficient utilization of the entire AI platform.

Network topology, software drivers, and monitoring and early warning of hardware modules are also key aspects to ensure stability. Each layer needs to ensure its own stability and maintainability, from the hardware to the software to the algorithm level.

Jiazi Lightyear: How big is the challenge of maintaining this stability?

ZOMI-chan: The challenge is actually quite large, involving every level from computer room construction, network layout, to hardware and software. For example, network congestion and device temperature control can cause interruptions in the training process. Virtually every part needs to be precisely controlled and monitored in order to prevent and respond quickly to possible problems.

Jiazi Lightyear: Are there any differences in the methods of different vendors in maintaining the stability of cluster computing power?

ZOMI-chan: There isn't much public information on what Nvidia and Huawei are doing in this regard, but it's safe to say that each vendor has its own strategy and considerations. Basically, everyone's goal is to improve the utilization rate of cluster computing power and ensure the stability of the training process, but the specific technical details and O&M strategies will be different. Each vendor adapts and optimizes based on its own technology stack and product features.

Jiazi Lightyear: How to calculate the computing power required for large model training?

ZOMI sauce: There is a specific calculation formula. It was originally proposed by NVIDIA in an article on Megatron-LM and has since been adapted in practice. I also shared a video about the computing power and memory consumption of large model training and inference on Bilibili before. These calculations can help us decide how big the model to train, how much computing power we need, and how much cluster size we need. However, even if the required computing power is calculated, it cannot be simply equated with being able to train a high-quality model similar to OpenAI.

Jiazi Lightyear: How high are the requirements for computing power for AI models, especially multimodal models?

ZOMI-chan: The demand for computing power for video generation models is indeed increasing, but it does not require AI clusters at the level of 10,000 cards. Wensheng video may require a dedicated decoder or more CPU to be processed, especially when it comes to video and image encoding and decoding. At the same time, video generation may also rely on large language models to enhance the richness and accuracy of generated content, which may further increase the demand for computing power.

In short, the increase in computing power demand is not only reflected in GPUs/NPUs, but also in the role of CPUs and dedicated processors.

ZOMI-chan: From an art student to a large-scale model training expert

NVIDIA's latest GB200 Superchip integrates 2 Blackwell GPUs and 1 Grace CPU, image source: NVIDIA GTC

Jiazi Lightyear: How to solve the problem of shortage of computing resources in China?

ZOMI sauce: The main thing is to start from the level of computing power. Domestic computing power, especially Huawei, has provided important support for solving the problem of computing power shortage. More efficient algorithms can be explored at the model level to reduce computing power requirements, but the improvement and optimization itself are the key to solving this contradiction.

ZOMI-chan: From an art student to a large-scale model training expert

The whole process of Ascend AI, image source: Ascend

Jiazi Lightyear: What do you think of the U.S. chip export restrictions to China?

ZOMI-chan: From a personal point of view, I think this is good news, because this situation can promote the acceleration of the development of independent computing power and chip technology in China. Of course, there are concerns, but this kind of competition and restrictions actually provide opportunities and windows for domestic technology accumulation and AI chip manufacturers to develop, so that we have more opportunities to research and deepen technology, rather than relying entirely on imported computing solutions.

Jiazi Lightyear: Has the current domestic computing power ushered in spring?

ZOMI-chan: I think the spring of domestic computing power is on the way.

Jiazi Lightyear: Do you have anything to say to the voices that question domestic computing power?

ZOMI-chan: Don't say anything. In the face of doubt, the best response is to speak with technology and products. Through practical application and performance display, let the market and users judge the strength of domestic chips.

Jiazi Lightyear: What are your main areas of focus in your current work?

ZOMI: My work focuses on large model analysis, cluster linearity, and utilization improvement. It involves distributed algorithm optimization to ensure the efficient operation of the cluster to support the training and computing needs of large-scale models. By improving the computing power utilization and software and hardware synergy capabilities of the cluster, we are striving to maximize the potential of the domestic computing platform.

Jiazi Lightyear: What do you expect to happen in 2024 in the field of AI large models?

ZOMI-chan: I feel that there will be two significant trends in 2024: one is the maturity and expansion of multimodality, and the other is the ongoing discussion and controversy about AGI and the world model. Advances in multimodality will not be limited to audiovisual media, but will also encompass more novel modal combinations, while the concepts of AGI and world models will lead to new research and papers that will further advance the frontiers of technology.

In addition, there is an opinion that we are gradually approaching the realization of the so-called GPT Zero, which is the stage where artificial intelligence is trained using artificial intelligence. This ability to iterate on oneself represents a leap toward greater automation and autonomy in the field of artificial intelligence, so it is also an important trend to keep an eye on.

4. "I'm Forrest Gump Techno"

Jiazi Lightyear: Regarding the large model, a recent interview with Zhu Xiaohu has attracted a lot of attention. He is not optimistic about China's large model companies, citing a lack of application scenarios and data. As a front-line expert in large model training, what do you think of this view?

ZOMI-chan: I think there are two main points of view on the topic of large models: the technical school and the market school. Zhu Xiaohu's view may be more market-oriented, while I am technological, that is, more inclined to Sam Altman's faction.

The development and research of large models is necessary because the mastery of core technologies is crucial. Although domestic companies have not yet been able to fully match the performance of top foreign large models, if we do not invest in research and development, we will never be able to catch up. This is somewhat similar to the situation of Lenovo and Huawei 20 years ago, where companies that stick to technology research and development can eventually accumulate valuable technology assets and market competitiveness.

Jiazi Lightyear: Maybe in the short term, the market faction will feel that this thing is not beneficial, and feel that it is useless to do this. But technocrats are all long-termists.

ZOMI-chan: Yes.

Jiazi Lightyear: You have studied art since you were a child, is it your intention to engage in technology?

ZOMI-chan: It was a mistake, considering that art was not easy to find a job, so I chose to transfer to a computer major when I was admitted to university. At that time, computer science majors were tepid, and the most popular major was civil engineering. Later, I thought that computers were also very interesting, and I went all the way to my Ph.D. and did some research on automation.

Jiazi Lightyear: You said you want to keep "rolling" yourself, why?

ZOMI-chan: As soon as I stop, I feel anxious.

Jiazi Lightyear: Is it because technology is developing too fast?

ZOMI-chan: Not really, it's more about technology or life. To be honest, I don't have a clear idea of my future goals, and I can never think about it while lying on my phone.

I like "Forrest Gump" very much, and I have watched it more than ten times. In the movie, when Jenny leaves Forrest Gump, Forrest Gump doesn't know what to do, he keeps running. Later, more and more people followed him, until he found what he wanted to do, and he stopped running and went to do what he wanted to do.

I'm like that too, I feel like a tech Forrest Gump. I'm always on the road, learning different techniques, seeing different landscapes, and meeting different people. No, I also got to know "Jiazi Lightyear".

ZOMI-chan: From an art student to a large-scale model training expert

Image source: The movie "Forrest Gump"

Jiazi Lightyear: It's hard to run, is there any way to release the pressure?

ZOMI-chan: I like to see the mountains and the sea, hiking, surfing, and snorkeling. On New Year's Day this year, I also went to Lake Baikal and experienced ice diving. The moment I jumped into the water, kept diving, and felt like I couldn't breathe, I felt like I was reborn.

Jiazi Lightyear: If you had to describe yourself in three words, what would you use?

ZOMI-chan: It's kind of like an interview, haha. I think three words are too many, let's say two: the first is youth, and the second is humility. A teenager is always a very optimistic and curious attitude towards technology and related things, and humility is always an open learning attitude towards his field and knowledge, which is me.

(Cover picture source: provided by the interviewee)

Read on