How did OpenAI's video model become so strong?

OpenAI has released a new Wensheng video model called "Sora".

The Sora model can generate up to 60 seconds of high-definition video, and the resulting footage can show the light and shadow relationships in the scene, the physical occlusion between objects, and the collision relationship, and the lens is silky variable.

I believe you have seen a lot of articles in the circle of friends showing OpenAI's official demo video, due to the security issues of the generated content Sora has not yet been open for testing, we cannot obtain more differentiated information, so the editorial department of Zhiwei will not repeat the effect of the Sora model here.

Below, we'd like to focus on why the Sora model looks so far better than the other Wensheng video models we've seen on the market, what do they do?

First of all, in the field of Wensheng video, the more mature model ideas include recurrent networks (RNN), generative adversarial networks (GAN) and diffusion models, and the Sora launched by OpenAI this time is a diffusion model.

While GAN models have been popular before, the fields related to image and video generation are now being dominated by diffusion models.

Because the diffusion model has very excellent advantages, compared with the GAN, the generation diversity and training stability of the diffusion model are better. And most importantly, the diffusion model has a higher ceiling for image and video generation, because the GAN model is essentially a machine-to-human imitation in principle, while the diffusion model is more like a machine learning to "become a person".

This may be a bit abstract, but let's change to a less rigorous but easy-to-understand example:

The GAN model is like a diligent painter, but not very controlled, because the painter (the generator) keeps drawing on the first (the training source) on the one hand, and the teacher (the discriminator) on the other side keeps grading. Just after countless rounds of the battle, the painter and the teacher upgraded and improved frantically, until the painter drew a realistic painting, but the whole process was not very easy to control, and he often practiced and went crazy, outputting some things that no one could understand. At the same time, his ascension process is essentially a constant imitation of the first work, so he also lacks creativity, resulting in a potentially low ceiling.

The diffusion model, on the other hand, is a diligent and intelligent painter, who is not a mechanical imitation, but when he learns a large number of previous works, he learns the relationship between the connotation of the image and the image, he probably knows what the "beauty" on the image should look like, what a certain "style" of the image should be, he is more like thinking, he is a more promising painter than GAN.

In other words, OpenAI's choice of the diffusion model to create the Wensheng video model is a good start at the moment, choosing a potential painter to cultivate.

Then, another question arises, since everyone knows the superiority of the diffusion model, in addition to OpenAI, there are many friends who are also doing diffusion models, why does OpenAI look more amazing?

Because OpenAI has such a thinking: I have achieved very good results and achieved such great success on large language models, so is it possible for me to use this experience to achieve a new success?

The answer is yes.

OpenAI believes that the previous success of large language models is due to Token (which can be translated into tokens, tokens, and tokens, and it will be better understood when translated into tokens), which can elegantly unify code, mathematics, and various different natural languages to facilitate large-scale training. So, they created the concept of "patch" (block, if the token is translated into a token, we may be translated as a "tile") to train the video model Sora.

In fact, in large language models, the reason why the application of Token is so successful is also due to the Transformer architecture, which is matched with Token, so Sora as a video generation diffusion model, which is different from the mainstream video generation diffusion model, adopts the Transformer architecture. (Most of the mainstream video generation and diffusion models adopt U-Net architecture)

In other words, OpenAI won in the choice of experience and technical route.

However, the Transformer architecture, the "success code", is well known and has become the mainstream in text and image generation, so why didn't others think about using it in video generation, but OpenAI used it?

This stems from another problem: the memory requirements of the full attention mechanism in the Transformer architecture grow quadratically with the length of the input sequence, so the computational cost of processing a high-dimensional signal such as video can be very, very high.

In layman's terms, although the effect of using a Transformer is good, the computing resources required are also very scary, and it is not very economical to do so.

Of course, although OpenAI has been soft with various financings, it is still not so wealthy, so they did not directly smash resources, but thought of another way to solve the problem of high computing costs.

Here we will first introduce the concept of "latent", which is a kind of "dimensionality reduction" or "compression", which is intended to express the essence of information with less information. Let's give an inappropriate but understandable example, as if we could record the structure of a simple three-dimensional object with a three-dimensional view, rather than necessarily saving the three-dimensional itself.

For this reason, OpenAI has developed a video compression network, which reduces the video dimension to the latent space first, and then uses the compressed video data to generate patches, which can reduce the input information and effectively reduce the computational pressure brought by the Transformer architecture.

In this way, most of the problems have been solved, and OpenAI has successfully integrated the Wensheng video model into the paradigm of its large language models, which have achieved great success in the past, so it is difficult to think of the effect.

In addition to this, OpenAI's route selection in training is also slightly different. They chose "original size and duration" for training, rather than the commonly used "capture the video to a preset standard size and duration" in the industry before training.

This training has brought a number of benefits to Sora:

(1) The generated video can better customize the duration;

(2) the generated video is able to better customize the video size;

(3) The video will have better framing and composition;

The first two points are easy to understand, and the third point OpenAI gives an example, they did a comparison of the model of the interception-size video training and the raw-size video training:

On the left is the video generated by the model after training on the captured size video

On the right is the video generated by the model after training on the original size video

In addition, in order to better understand the user's intent and achieve better generation results, OpenAI has also added some ingenuity to the Sora model.

First of all, training a Wensheng video model like Sora requires a large number of video materials with text descriptions, so OpenAI uses its own DALL· E 3's re-captioning feature adds high-quality text descriptions to all training footage, which they say improves the overall quality of the output video.

OpenAI uses the power of GPT, when the user inputs the prompt word, GPT will first expand the prompt word entered by the user accurately and in detail, and then hand over the expanded prompt word to Sora, which can better allow Sora to follow the prompt word to generate more accurate videos.

Well, that's the end of our brief analysis of why the Sora model looks stronger.

Looking at the whole, you will find that the success of the Sora model is not accidental, and it can have such amazing results thanks to OpenAI's past work, including GPT, DALL· E, etc., some are direct calls, and some are borrowed ideas.

Perhaps we can say that OpenAI itself first became a giant, and then stood on the shoulders of its own giant and became a new giant.

Correspondingly, other competitors, both domestic and foreign, may be left behind in the future because of the poor technology of Wensheng Wen and Wensheng diagrams.

The so-called "overtaking in corners" and "the gap is only X months" may not exist, but just self-comfort.

How did OpenAI's video model become so strong?

How did OpenAI's video model become so strong?

Read on