Why use foreign videos for domestic AI training?

It's been more than a year since the wave of generative AI swayed.

If you want to talk about which type of model is the "crown jewel" in the AI field in this wave, it must be the Wensheng video model.

From a technical point of view, the core value of the video models of Sora and Vidu lies in the fact that they realize the synthesis and creation of cross-media information, thus forming a "great unification" of different modalities such as text, images, and videos.

And this kind of "great unification" may be the key to mankind's path to AGI.

Why use foreign videos for domestic AI training?

Under the framework of this "great unification", data is no longer limited by a single modality, but is understood and used as a synthesis of multi-dimensional information.

As Turing Award winner Yann LeCun, one of the Big Three in AI, proposed the "world model" theory, today's LLMs (large models) are only trained on text, so they can only understand the world very superficially.

Even though LLMs can demonstrate excellent text comprehension with a large number of parameters and massive training data, they still only capture the statistical rules of the text in essence, and do not really understand what the text means in the real world.

AI三巨头之一 Yann LeCun

And if the model can use more sensory signals, such as vision, to learn how the world works, then it can understand reality more deeply. In this way, we can perceive those laws and phenomena that cannot be conveyed by words alone.

From this point of view, whoever can take the lead in allowing AI to grasp the laws of real physics through multimodal world models may be the first to break through the limitations of text and semantics and climb a big step on the road to AGI.

That's why OpenAI is currently so focused on Sora.

Although some time ago, the appearance of Vidu gave domestic video technology a long face, and straightened his waist in front of an industry overlord like Sora, but the big guys were rejoicing at the same time, and took a closer look at Vidu's demonstration video, and found a very interesting thing: there are a lot of foreigners' faces inside.

At this moment, it can make the big guy think about it, and it feels like inadvertently pulling out a little pigtail of our collection of video data - the lack of high-quality data.

The data dilemma

If there is a hard threshold that restricts the development of video generation models at this stage, then such thresholds are nothing more than computing power, algorithms, and data.

And the first two of them, in fact, as long as there is money and talent, they can actually do it, but the data, once it falls, if you want to catch up, you have to work hard. Just like height, it's hard to catch up when you pull away.

To tell the truth, although in absolute terms, there is a lot of video content on the Chinese Internet, but the high-quality data that can really be used for AI training is not as rich as the external network.

For example, in terms of video object detection, the YouTube video dataset VIS contains 2,904 video sequences with a total of more than 250,000 annotated target instances. Domestic video object detection datasets, such as Huawei's OTB-88, contain only 88 video sequences.

In terms of behavior recognition datasets, the internationally renowned HACS dataset contains 1.4 million video clips, covering 200 categories of human daily behaviors. In contrast, Alibaba Cloud's Tianchi behavior recognition dataset only contains 200,000 video clips, although it also covers 200 behavior categories.

The reason for this gap, from the perspective of video ecology, is mainly because many mainstream video websites in China, such as Aiyouteng, publish most of the film and television dramas, variety shows, entertainment and other content.

And the short video platforms such as Douyin and Kuaishou, which have the largest traffic, are also full of funny jokes and life hacks, which are originally very short, and there are many editing, handling, and plagiarized works.

In this way, it is really not easy for AI to find some "serious rice" to eat.

For video AI training, such videos are either too focused on specific types and lack diverse scenarios such as daily life, or the duration is too short and lacks depth and coherent narrative, which is not conducive to AI learning the coherence, story logic, and causality of long sequences.

In contrast, the content produced by professional teams such as movies and documentaries is often the high-quality data required by video AI.

Because these themes are not only rich in variety, long enough, and very detailed, it is more conducive to AI models to capture the differences in light changes and object materials, so as to improve the accuracy of their generation.

In the field of video data, we not only lack high-quality content, but also have a headache - data labeling, which is a difficult bone to gnaw. No matter how high the quality of the video is, if you throw it directly to the AI, it will not be able to distinguish the items in it.

So after collecting the video data, someone has to be patient and tell the AI frame by frame: "See, this line is moving traffic, and the two-legged person is a pedestrian." ”

To get the hard and massive work of data labeling, you can't do it without a little bit of power. For example, in order to improve the efficiency of annotation, a number of interactive video annotation tools have emerged abroad, such as CVAT, iMerit, etc. These tools integrate algorithms such as automatic tracking and interpolation, which can greatly reduce the workload of manual annotation.

On the other hand, in our country, because automatic labeling tools are not so popular, most of them still rely on crowd tactics, and a large number of labeling teams work overtime to manually liver.

Let's do it, although the amount of labeling has gone up, but the problem has also come - this batch of temporary army, there is no unified, objective standard, training is not in place, all based on personal feelings to judge right and wrong there, in this way, uneven data quality has become the norm, some places are better marked, some places may be so-so.

What's even more head-raising is that this kind of work is not only boring and tiring, but also can't earn a lot of money, who do you say is willing to do it for a long time?

According to the feedback of a number of video data annotation companies, the monthly salary of most annotators is between 3000-5000 yuan, and the annual turnover rate of the domestic video annotation industry is generally between 30%-50%, and some companies are even as high as 80%.

In this industry, when the flow of personnel is like a marquee, the company has to keep recruiting and training new people. This directly confuses the quality and stability of data annotation.

To tell the truth, in the case that the total amount of data, diversity, and annotation links are not as good as those of the Internet, how can domestic video AI overcome the difficulty of data if it wants to rise?

Synthetic data

If high-quality data is hard to find, is it feasible to take the path of synthetic data and use artificial materials to "feed" AI? Truth be told, there were people doing this before Sora came out, such as Nvidia's Omniverse Replicator released in 2021.

To put it bluntly, Omniverse Replicator is a platform for compositing data, specializing in the kind of hyper-realistic 3D scenes. This thing is a cow, and the video data it creates, every detail follows the laws of physics, as if it was plucked directly from the real world.

Who works best with this thing? Oh, that's a lot of it, autonomous driving, robot training or whatever, or whatever you want AI to accurately understand the dynamics of physics.

When compositing data, Omniverse Replicator first drags and drops 3D models, textures, and real-world materials into its platform, and then uses them to build scenes like building blocks, such as city streets, work workshops, or busy roads.

Next, in order to make the data produced less "rigid" and "monotonous", Replicator has a powerful function, which is that it allows people to set a lot of changing factors. For example, where to put an object, which way it is facing, what it looks like, how the color changes, how the surface feels to the touch, and even how the lights hit, can make it change randomly on its own.

One of the big advantages of this is that it can make the final data varied, and the AI can see all kinds of situations. This is a crucial step for AI data synthesis.

Then, to accurately simulate real-world physics interactions, physics engines like NVIDIA PhysX in Omniverse Replicator calculate changes in the motion of objects such as velocity, acceleration, rotation, and friction when they collide or come into contact, based on physics laws such as Newtonian mechanics.

Constraints such as gravity, elasticity, friction, fluid resistance, and more are added to bring the simulation closer to reality.

While Omniverse Replicator can generate high-quality visual and dynamic 3D scenes, what it does best is dealing with things that follow the laws of physics, like how to make a virtual ball bounce the right way. And for those abstract, coherent logical and narrative content, it is beyond its competence.

For example, if people want to show a happy person in a video, they have to let the AI learn to "laugh" first, which is not something that physics simulations can do......

For example, after people drink water, if the cup is not disposable, people will often put the cup back in its original place instead of throwing it away, which actually follows more human common sense than pure physical laws.

Theoretically, Omniverse Replicator can't generate all of the data needed to train a video model like Sora on its own, especially those that involve high-level semantic understanding, coherent storytelling, and highly abstract concepts, as well as complex human emotions and social interactions that are outside the scope of Omniverse Replicator's current design and capabilities.

Take a different path

In fact, in addition to Omniverse Replicator, using Unreal Engine 5 to generate relevant data is an alternative.

In the previous videos released by Sora, people have found that the effect of some video clips is a little different from the previous realistic and realistic painting style, and it looks more like a certain "3D style", such as the little white dragon with big eyes, long eyelashes, and cold air in the mouth.

Although OpenAI officially did not admit it, sharp-eyed netizens felt it at a glance that this thing has the shadow of Unreal 5!

But even if this speculation is true, what Unreal Engine 5 can provide is most likely only simulation data of light, scenes, 3D information, and physical interactions, which is essentially the same as Omniverse Replicator, which can only provide some very "hard" physical simulations.

If you really want to create a world-class video hodgepodge dataset with everything, you have to think of new tricks.

One of the most extreme ways is to let the AI produce and sell its own and make its own videos to train itself. But there is a hole here, if these AI-made videos take up too much of the training materials, there will be a risk of "model autophagy".

In other words, it's getting worse and worse at what is being generated.

In extreme cases, the continued use of self-generated data can lead to a sharp drop in model performance, or even complete model failure, as AI may magnify the flaws of previous models from generation to generation.

Last year, the Rice University and Stanford teams found that feeding AI-generated content to the model only led to performance degradation.

Researchers have come up with an explanation for this called "model autophagy disorder" (MAD).

The study found that after the 5th iteration of training using AI data, the model would suffer from MAD.

Training an AI model on synthetic data will gradually amplify artifacts

The mechanism of this is very similar to that of biological defects in offspring due to "inbreeding".

Just as inbreeding individuals limit genetic diversity due to shrinking genetic pools, over-reliance on AI-generated data limits the diversity of model learning, as it reflects the inherent understanding of previous models rather than the original real-world diversity.

If a model is compared to a person, then any model, no matter how high the quality of the data, will always have scarce content, just as a person's genes will always have some scarce factors, no matter how good the genes are.

These "defects" are not obvious or acceptable in previous generation models, and with iterative training processes, these defects can still be amplified, especially in the absence of external diversity.

The study also found that improving the quality of synthesis compiles the diversity of synthesis.

For large models, if they want to show better generalization ability (the so-called inferences), they need to constantly adapt to new data and scenarios, respond to new challenges, and summarize new laws and new associations.

That's why data diversity is so important to models.

Since the high-quality data on the Chinese Internet is not very much, and the road of synthetic data seems to be technically difficult to go, what other ways can the domestic video model surpass Sora?

Self-evolution

Wouldn't it be nice if there was a way for a model to generate its own data without falling into the vortex of "autophagy" and evolving on its own?

To tell the truth, some domestic AI companies have taken this path, such as Awaker 1.0, a new multimodal large model developed by the Zhizi Engine team.

To put it simply, the reason why the Awaker 1.0 model can break through the previous data bottleneck is mainly due to its three unique functions: automatic data generation, self-reflection, and continuous update.

First of all, in terms of automatically generating data, Awaker 1.0 mainly collects data through two channels: the Internet and the physical world, that is, it not only searches around the Internet, watches news, reads articles, and learns things, but also can use the camera to see things, listen to sounds, and understand what is happening around it when it cooperates with smart devices in the real world.

However, unlike simple data crawling, Awaker 1.0 can also understand and digest this information after collecting this multimodal data, and use it to generate new content, such as text, images, and even videos. After that, according to these "rumination" content, we will continue to optimize and update ourselves.

Next, the enhanced Awaker 1.0 can generate new data of higher quality and more creative, and so on, forming a closed loop of self-training.

In other words, it's really a way to synthesize data dynamically, and external data just gives it the "seeds", and by constantly feeding itself, it can keep amplifying and expanding this initial data, and continue to generate new training data for itself.

It's like a powerful "extended-range engine" that cleverly uses a small amount of fuel (data) to produce a power output that far exceeds the energy of the fuel itself through a cyclic amplification process.

At the same time, in order to correct possible biases in the data in this closed loop, Awaker 1.0 will not only score and reflect on the quality of the generated data, filter out samples of low quality, but also ensure the real-time and accuracy of the data through continuous online learning and iteration, based on new external data and feedback.

In this way, the model avoids being constrained by limited external data sources and the phenomenon of "model autophagy" that can lead to falling into purely synthetic data.

And this mechanism of self-feedback and learning actually alludes to the idea that the AI field should unify the understanding side and the generation side.

Since the advent of Sora, more and more voices have expressed the need to achieve a "great unity of understanding and generation" in order to achieve AGI.

This is because the essence of human intelligence is to understand and create the world, and current AI is often specialized in comprehension tasks (such as classification, detection) or generative tasks (such as language models, image generation). But true intelligence needs to be understood and generated to form a closed loop.

To put it bluntly, it is necessary for the AI to mimic the learning patterns of the human brain, to see and think at the same time, and at the same time to reflect and adjust according to the changing reality in the process of self-output.

In the words of the Chinese, it is the unity of knowledge and action.

For AI to do this, it needs to be able to generate data on its own to train itself, and grow from it, evolving over time.

In this way, AI can respond flexibly and even create something like a human in the face of new situations that have never been seen before, which is an important step in realizing AGI.

Why use foreign videos for domestic AI training?

Read on

Xiang Zuo posted a video in response to Xiang Tai's urging to give birth to three children, put an inflatable slide in the living room, and Guo Biting accompanied her children without makeup

Wear 280,000 watches to block the follow-up! The man will sue, the full video was exposed, and the Audi woman owner panicked

The perverted man smashed Huang Jiaju's tomb and took a short video, and after being arrested, his identity was black, and his history was picked up, but he was angry and helpless

The "price war" of new cars has affected the second-hand market, and the industry is cold! Short videos and live broadcasts are popular

Li Shengli's gathering forcibly dragged the woman, and the video was exposed, and the disparity between the physical strength of the two was terrifying

High pixels, AI algorithms, self-developed chips...... The battle for domestic mobile phone images is becoming increasingly fierce Industry

Jia Yueting's breakfast video is revealing again! American bloggers calculated how much Jia Yueting would spend on this breakfast

East China University of Political Science and Law: A representative of the peak of scholarly temperament and appearance (there are video easter eggs at the end of the article)

The fat cat incident became a huge boomerang, many vocal anchors deleted the video, and Cha Baidao donated 1 million into a joke

Want to go against the market and raise prices? The new Tesla Model Y was exposed, and the configuration finally caught up with domestic production

Domestic mobile phones are all wiped out, and the top 10 mobile phone sales in the world are only Samsung and Apple

Sister Abalone, Bai Gongzi and many other Internet celebrities with millions of fans were banned and banned, and Aunt Tian Tian deleted videos related to Wang Hongquanxing

The 150,000-level "strongest domestic SUV", which can be oiled and electric, is built by Chery + BYD!

Foreign media rated the best Android machine in 2024 OnePlus 12R has become the "sole seedling" of domestic models on the list

The condition worsens! Superstar released a video and cried, netizens: Never give up!

Zhang Daxian officially responded, explaining why he didn't use domestic mobile phones, just because of one advantage to choose Apple

Straight down 240,000! The entry-level version of the Lexus LM was unveiled and introduced in the third quarter, targeting domestic models

was accused of being suspected of fraud in the promotional video of the Yuanben School, and Du Xudong responded

The new upgrade of the iPhone 16 catches up with the Pro, and the design has changed in this regard, and it is expected to support spatial video shooting

The cancer of the domestic operating system, "guess you like" roll out of the phone

Embarrassment: The supply chain that was kicked out by Apple can't be supported by domestic mobile phones

Foreign media announced the MSI single kill rankings, and only two domestic players entered the top ten, and bin ranked second

The Internet celebrity "Wang Ma" team became popular: the post-00s main creator made a short video during college with a monthly income of 700,000, and bought a Porsche Cayenne with full payment at the age of 23

Is it supply chain technology or self-developed? The secret behind the soaring battery life of domestic mobile phones

Apple issued an urgent notice asking for iPhone updates to avoid scam video calls

Netflix restricted Korean drama! The scale explosion is comparable to "Squid Game", and there are all domestic dramas that dare not be filmed!

Tens of millions of fans Internet celebrity "Wang Ma" company apologizes! During college, he made a short video with a monthly income of 700,000 yuan, and bought a Porsche at the age of 23

Don't worry about battery life! The domestic machine has entered the era of 6000mAh battery

OpenAI officially announced the launch of "next-generation cutting-edge model" training! It is expected that the training parameters will be further improved, or the "Wensheng video" model Sora will be integrated

It's not just Huawei! The once first domestic production has been completely killed

The commercial operation of domestic large aircraft has been fully accelerated: six C919 aircraft have been delivered on the first anniversary of its operation, and the company is planning to expand overseas markets

Do you deserve it? Everything is charged, domestic TV is self-defeating, and now it has fallen to 200 yuan and no one wants it