laitimes

Why use foreign videos for domestic AI training?

author:酷玩实验室Coollabs

It's been more than a year since the wave of generative AI swayed.

If you want to talk about which type of model is the "crown jewel" in the AI field in this wave, it must be the Wensheng video model.

From a technical point of view, the core value of the video models of Sora and Vidu lies in the fact that they realize the synthesis and creation of cross-media information, thus forming a "great unification" of different modalities such as text, images, and videos.

And this kind of "great unification" may be the key to mankind's path to AGI.

Why use foreign videos for domestic AI training?

Under the framework of this "great unification", data is no longer limited by a single modality, but is understood and used as a synthesis of multi-dimensional information.

As Turing Award winner Yann LeCun, one of the Big Three in AI, proposed the "world model" theory, today's LLMs (large models) are only trained on text, so they can only understand the world very superficially.

Even though LLMs can demonstrate excellent text comprehension with a large number of parameters and massive training data, they still only capture the statistical rules of the text in essence, and do not really understand what the text means in the real world.

Why use foreign videos for domestic AI training?

AI三巨头之一 Yann LeCun

And if the model can use more sensory signals, such as vision, to learn how the world works, then it can understand reality more deeply. In this way, we can perceive those laws and phenomena that cannot be conveyed by words alone.

From this point of view, whoever can take the lead in allowing AI to grasp the laws of real physics through multimodal world models may be the first to break through the limitations of text and semantics and climb a big step on the road to AGI.

That's why OpenAI is currently so focused on Sora.

Although some time ago, the appearance of Vidu gave domestic video technology a long face, and straightened his waist in front of an industry overlord like Sora, but the big guys were rejoicing at the same time, and took a closer look at Vidu's demonstration video, and found a very interesting thing: there are a lot of foreigners' faces inside.

Why use foreign videos for domestic AI training?

At this moment, it can make the big guy think about it, and it feels like inadvertently pulling out a little pigtail of our collection of video data - the lack of high-quality data.

The data dilemma

If there is a hard threshold that restricts the development of video generation models at this stage, then such thresholds are nothing more than computing power, algorithms, and data.

And the first two of them, in fact, as long as there is money and talent, they can actually do it, but the data, once it falls, if you want to catch up, you have to work hard. Just like height, it's hard to catch up when you pull away.

To tell the truth, although in absolute terms, there is a lot of video content on the Chinese Internet, but the high-quality data that can really be used for AI training is not as rich as the external network.

Why use foreign videos for domestic AI training?

For example, in terms of video object detection, the YouTube video dataset VIS contains 2,904 video sequences with a total of more than 250,000 annotated target instances. Domestic video object detection datasets, such as Huawei's OTB-88, contain only 88 video sequences.

In terms of behavior recognition datasets, the internationally renowned HACS dataset contains 1.4 million video clips, covering 200 categories of human daily behaviors. In contrast, Alibaba Cloud's Tianchi behavior recognition dataset only contains 200,000 video clips, although it also covers 200 behavior categories.

Why use foreign videos for domestic AI training?

The reason for this gap, from the perspective of video ecology, is mainly because many mainstream video websites in China, such as Aiyouteng, publish most of the film and television dramas, variety shows, entertainment and other content.

And the short video platforms such as Douyin and Kuaishou, which have the largest traffic, are also full of funny jokes and life hacks, which are originally very short, and there are many editing, handling, and plagiarized works.

In this way, it is really not easy for AI to find some "serious rice" to eat.

Why use foreign videos for domestic AI training?

For video AI training, such videos are either too focused on specific types and lack diverse scenarios such as daily life, or the duration is too short and lacks depth and coherent narrative, which is not conducive to AI learning the coherence, story logic, and causality of long sequences.

In contrast, the content produced by professional teams such as movies and documentaries is often the high-quality data required by video AI.

Why use foreign videos for domestic AI training?

Because these themes are not only rich in variety, long enough, and very detailed, it is more conducive to AI models to capture the differences in light changes and object materials, so as to improve the accuracy of their generation.

In the field of video data, we not only lack high-quality content, but also have a headache - data labeling, which is a difficult bone to gnaw. No matter how high the quality of the video is, if you throw it directly to the AI, it will not be able to distinguish the items in it.

So after collecting the video data, someone has to be patient and tell the AI frame by frame: "See, this line is moving traffic, and the two-legged person is a pedestrian." ”

Why use foreign videos for domestic AI training?

To get the hard and massive work of data labeling, you can't do it without a little bit of power. For example, in order to improve the efficiency of annotation, a number of interactive video annotation tools have emerged abroad, such as CVAT, iMerit, etc. These tools integrate algorithms such as automatic tracking and interpolation, which can greatly reduce the workload of manual annotation.

On the other hand, in our country, because automatic labeling tools are not so popular, most of them still rely on crowd tactics, and a large number of labeling teams work overtime to manually liver.

Let's do it, although the amount of labeling has gone up, but the problem has also come - this batch of temporary army, there is no unified, objective standard, training is not in place, all based on personal feelings to judge right and wrong there, in this way, uneven data quality has become the norm, some places are better marked, some places may be so-so.

Why use foreign videos for domestic AI training?

What's even more head-raising is that this kind of work is not only boring and tiring, but also can't earn a lot of money, who do you say is willing to do it for a long time?

According to the feedback of a number of video data annotation companies, the monthly salary of most annotators is between 3000-5000 yuan, and the annual turnover rate of the domestic video annotation industry is generally between 30%-50%, and some companies are even as high as 80%.

In this industry, when the flow of personnel is like a marquee, the company has to keep recruiting and training new people. This directly confuses the quality and stability of data annotation.

Why use foreign videos for domestic AI training?

To tell the truth, in the case that the total amount of data, diversity, and annotation links are not as good as those of the Internet, how can domestic video AI overcome the difficulty of data if it wants to rise?

Synthetic data

If high-quality data is hard to find, is it feasible to take the path of synthetic data and use artificial materials to "feed" AI? Truth be told, there were people doing this before Sora came out, such as Nvidia's Omniverse Replicator released in 2021.

Why use foreign videos for domestic AI training?

To put it bluntly, Omniverse Replicator is a platform for compositing data, specializing in the kind of hyper-realistic 3D scenes. This thing is a cow, and the video data it creates, every detail follows the laws of physics, as if it was plucked directly from the real world.

Who works best with this thing? Oh, that's a lot of it, autonomous driving, robot training or whatever, or whatever you want AI to accurately understand the dynamics of physics.

Why use foreign videos for domestic AI training?

When compositing data, Omniverse Replicator first drags and drops 3D models, textures, and real-world materials into its platform, and then uses them to build scenes like building blocks, such as city streets, work workshops, or busy roads.

Why use foreign videos for domestic AI training?

Next, in order to make the data produced less "rigid" and "monotonous", Replicator has a powerful function, which is that it allows people to set a lot of changing factors. For example, where to put an object, which way it is facing, what it looks like, how the color changes, how the surface feels to the touch, and even how the lights hit, can make it change randomly on its own.

One of the big advantages of this is that it can make the final data varied, and the AI can see all kinds of situations. This is a crucial step for AI data synthesis.

Why use foreign videos for domestic AI training?

Then, to accurately simulate real-world physics interactions, physics engines like NVIDIA PhysX in Omniverse Replicator calculate changes in the motion of objects such as velocity, acceleration, rotation, and friction when they collide or come into contact, based on physics laws such as Newtonian mechanics.

Constraints such as gravity, elasticity, friction, fluid resistance, and more are added to bring the simulation closer to reality.

Why use foreign videos for domestic AI training?

While Omniverse Replicator can generate high-quality visual and dynamic 3D scenes, what it does best is dealing with things that follow the laws of physics, like how to make a virtual ball bounce the right way. And for those abstract, coherent logical and narrative content, it is beyond its competence.

For example, if people want to show a happy person in a video, they have to let the AI learn to "laugh" first, which is not something that physics simulations can do......

Why use foreign videos for domestic AI training?

For example, after people drink water, if the cup is not disposable, people will often put the cup back in its original place instead of throwing it away, which actually follows more human common sense than pure physical laws.

Theoretically, Omniverse Replicator can't generate all of the data needed to train a video model like Sora on its own, especially those that involve high-level semantic understanding, coherent storytelling, and highly abstract concepts, as well as complex human emotions and social interactions that are outside the scope of Omniverse Replicator's current design and capabilities.

Take a different path

In fact, in addition to Omniverse Replicator, using Unreal Engine 5 to generate relevant data is an alternative.

In the previous videos released by Sora, people have found that the effect of some video clips is a little different from the previous realistic and realistic painting style, and it looks more like a certain "3D style", such as the little white dragon with big eyes, long eyelashes, and cold air in the mouth.

Why use foreign videos for domestic AI training?

Although OpenAI officially did not admit it, sharp-eyed netizens felt it at a glance that this thing has the shadow of Unreal 5!

But even if this speculation is true, what Unreal Engine 5 can provide is most likely only simulation data of light, scenes, 3D information, and physical interactions, which is essentially the same as Omniverse Replicator, which can only provide some very "hard" physical simulations.

If you really want to create a world-class video hodgepodge dataset with everything, you have to think of new tricks.

One of the most extreme ways is to let the AI produce and sell its own and make its own videos to train itself. But there is a hole here, if these AI-made videos take up too much of the training materials, there will be a risk of "model autophagy".

In other words, it's getting worse and worse at what is being generated.

Why use foreign videos for domestic AI training?

In extreme cases, the continued use of self-generated data can lead to a sharp drop in model performance, or even complete model failure, as AI may magnify the flaws of previous models from generation to generation.

Last year, the Rice University and Stanford teams found that feeding AI-generated content to the model only led to performance degradation.

Researchers have come up with an explanation for this called "model autophagy disorder" (MAD).

The study found that after the 5th iteration of training using AI data, the model would suffer from MAD.

Why use foreign videos for domestic AI training?

Training an AI model on synthetic data will gradually amplify artifacts

The mechanism of this is very similar to that of biological defects in offspring due to "inbreeding".

Just as inbreeding individuals limit genetic diversity due to shrinking genetic pools, over-reliance on AI-generated data limits the diversity of model learning, as it reflects the inherent understanding of previous models rather than the original real-world diversity.

If a model is compared to a person, then any model, no matter how high the quality of the data, will always have scarce content, just as a person's genes will always have some scarce factors, no matter how good the genes are.

These "defects" are not obvious or acceptable in previous generation models, and with iterative training processes, these defects can still be amplified, especially in the absence of external diversity.

Why use foreign videos for domestic AI training?

The study also found that improving the quality of synthesis compiles the diversity of synthesis.

For large models, if they want to show better generalization ability (the so-called inferences), they need to constantly adapt to new data and scenarios, respond to new challenges, and summarize new laws and new associations.

That's why data diversity is so important to models.

Since the high-quality data on the Chinese Internet is not very much, and the road of synthetic data seems to be technically difficult to go, what other ways can the domestic video model surpass Sora?

Self-evolution

Wouldn't it be nice if there was a way for a model to generate its own data without falling into the vortex of "autophagy" and evolving on its own?

To tell the truth, some domestic AI companies have taken this path, such as Awaker 1.0, a new multimodal large model developed by the Zhizi Engine team.

Why use foreign videos for domestic AI training?

To put it simply, the reason why the Awaker 1.0 model can break through the previous data bottleneck is mainly due to its three unique functions: automatic data generation, self-reflection, and continuous update.

First of all, in terms of automatically generating data, Awaker 1.0 mainly collects data through two channels: the Internet and the physical world, that is, it not only searches around the Internet, watches news, reads articles, and learns things, but also can use the camera to see things, listen to sounds, and understand what is happening around it when it cooperates with smart devices in the real world.

Why use foreign videos for domestic AI training?

However, unlike simple data crawling, Awaker 1.0 can also understand and digest this information after collecting this multimodal data, and use it to generate new content, such as text, images, and even videos. After that, according to these "rumination" content, we will continue to optimize and update ourselves.

Next, the enhanced Awaker 1.0 can generate new data of higher quality and more creative, and so on, forming a closed loop of self-training.

Why use foreign videos for domestic AI training?

In other words, it's really a way to synthesize data dynamically, and external data just gives it the "seeds", and by constantly feeding itself, it can keep amplifying and expanding this initial data, and continue to generate new training data for itself.

It's like a powerful "extended-range engine" that cleverly uses a small amount of fuel (data) to produce a power output that far exceeds the energy of the fuel itself through a cyclic amplification process.

Why use foreign videos for domestic AI training?

At the same time, in order to correct possible biases in the data in this closed loop, Awaker 1.0 will not only score and reflect on the quality of the generated data, filter out samples of low quality, but also ensure the real-time and accuracy of the data through continuous online learning and iteration, based on new external data and feedback.

In this way, the model avoids being constrained by limited external data sources and the phenomenon of "model autophagy" that can lead to falling into purely synthetic data.

And this mechanism of self-feedback and learning actually alludes to the idea that the AI field should unify the understanding side and the generation side.

Since the advent of Sora, more and more voices have expressed the need to achieve a "great unity of understanding and generation" in order to achieve AGI.

This is because the essence of human intelligence is to understand and create the world, and current AI is often specialized in comprehension tasks (such as classification, detection) or generative tasks (such as language models, image generation). But true intelligence needs to be understood and generated to form a closed loop.

Why use foreign videos for domestic AI training?

To put it bluntly, it is necessary for the AI to mimic the learning patterns of the human brain, to see and think at the same time, and at the same time to reflect and adjust according to the changing reality in the process of self-output.

In the words of the Chinese, it is the unity of knowledge and action.

For AI to do this, it needs to be able to generate data on its own to train itself, and grow from it, evolving over time.

In this way, AI can respond flexibly and even create something like a human in the face of new situations that have never been seen before, which is an important step in realizing AGI.

Read on