Sora exploded, Musk was in a hurry!"Tesla has the best video generation technology"

Has OpenAI's new work Sora been swiped in the past two days?

The bustling Spring Festival of the Year of the Dragon, there are many characters, and each has its own behavior at the same time:

Sora exploded, Musk was in a hurry!"Tesla has the best video generation technology"

After the rain, the streets of Tokyo are well treated with light and shadows and reflections:

Even a lizard in the super close-up view, full of details:

All of the above are from OpenAI's first video generation model, Sora.

As long as you enter the prompt word, you can generate a 1-minute HD video, which has been seen as a new king bomb technology that rewrites the entire video generation field.

This not only caused a sensation in the academic circle, but also made the old horse in the science and technology circle unable to sit still.

Tweet bluntly: Tesla has the world's best real-world simulation and video generation capabilities!

Oops, fight, fight.

01 Musk responds to Sora

After the release of Sora, the effect immediately shocked the whole network.

However, unlike ChatGPT, only a few people now have access to Sora.

But many people still want to play for themselves, so OpenAI CEO Adam Altman immediately seized the opportunity to show his ability and started taking orders online on Twitter after releasing Sora.

As long as you post a prompt and Aite Sam, or reply to Sam's Twitter, you may receive a video generated by Sora.

Among them, there are those who reply seriously, and there are those who take the opportunity to make trouble.

Dogecoin's graphic designer DogeDesigner replied to Sam's tweet with the following prompt:

One person turns an open-source, non-profit company into a closed-source profit-making company.

This description, you don't have to report Sam's ID number directly.

And Musk directly popped out this reply.

On the one hand, his favorite digital currency is Dogecoin, and he often interacts with this user on Twitter, and more importantly, on the other hand, Musk and OpenAI have a lot of holidays.

Although Musk was a co-founder, he was later kicked out of the board of directors, and after OpenAI was transformed into a profitable company, he repeatedly criticized and accused OpenAI in public that it had lost its original intention and began to pursue profits.

Subsequently, Musk reposted another piece of content related to OpenAI, and accompanied by an emoji with monocles, as if he was puzzled.

The article is saying that Sam owns an OpenAI venture capital fund that has committed $175 million as of last year.

And this fund is not managed by OpenAI, but only "temporarily" placed under Sam's name.

As we all know, Sam does not directly own the equity of OpenAI, and calls his indirect holdings of OpenAI's investment through the YC fund "unimportant", saying that he founded OpenAI because he likes AI.

And the news that Sam owns the OpenAI venture capital fund was exposed, and Musk expressed doubts, maybe he wanted to imply that Sam still wanted to use OpenAI to make a profit, not that he showed "indifference to fame and fortune" before.

I thought that Musk's taunts would be over after two ridicules, but after a user posted a tweet comparing Sora and Tesla FSD V12, Musk replied online:

Tesla was able to generate real-world videos about a year ago, and they were physically accurate.

But it's not very interesting, because all the training data is coming from the car, so the video also looks like it's coming from the camera on the Tesla vehicle, although it's a dynamically generated rather than a recorded world.

So let's take a look, what is the comparison between the capabilities of Sora and Tesla?

02 What is Sora

Sora, OpenAI's first video generation model, or Wensheng video model.

Essentially a diffusion model, trained on videos and images of different durations, resolutions, and aspect ratios.

The official only briefly introduced some technical details, among which the more key ones are patches, latent (latent), and the selection of training routes.

Corresponding to the token in the large language model, OpenAI created the concept of patch, which can compress the video into the low-dimensional latent space and decompose it into Spacetime latent patches to unify different visual data representations.

In other words, just as tokens can simplify and unify different natural languages, patches can unify videos and images with different resolutions, durations, and aspect ratios.

This video compression network is also specially trained by OpenAI to reduce the dimension of visual data, and the training is also based on this network, which can reduce the pressure of computation.

In addition, because Sora's training is directly based on the original size of the video data, unlike other models, Sora can also hold videos of various resolutions, durations, aspect ratios, viewing angles, and so on.

The composition and layout have also been optimized. For example, similar models in the industry will blindly crop the output video to be square, resulting in the theme elements can only be partially displayed, but Sora can capture the complete scene.

In addition, Sora's technology also includes OpenAI's previous achievements in DALL· E 3. Technology accumulation and breakthrough in diffusion Transformer.

Sora, who is finally demonstrated, is able to understand not only the requirements of the prompts, but also the way in which these objects exist in the physical world.

It is understood that paper airplanes collide as they pass through the forest, and that light and shadow change at the same time.

A flock of paper airplanes flutter through the dense jungle and weave through the woods like migratory birds.

Create multiple shots in a single video at the same time, and rely on a deep understanding of language to accurately interpret prompts, preserving characters and visual style.

Beautiful, snow-capped Tokyo is bustling. The camera travels through the bustling city streets, following a few people enjoying a beautiful snowy day and shopping at a nearby stall. Gorgeous cherry blossom petals flutter in the wind with snowflakes.

Sora isn't perfect right now, though. OpenAI notes that it may struggle to accurately simulate the physics of complex scenarios and may not be able to understand cause and effect.

For example, "five gray wolf pups frolicking and chasing each other on a remote gravel road", the number of wolves will change, and some will appear or disappear out of thin air.

It is also possible to confound the spatial details of the cues, such as left and right, and it may be difficult to accurately describe events that occur over time, such as following a particular camera trajectory.

For example, in the prompt "The basketball passes through the basket and then explodes", the basketball is not properly blocked by the basket.

But these shortcomings did not make various bigwigs stingy with their praise, such as Xie Senen, an assistant professor at New York University and a writer of ResNeXt, bluntly said that Sora will rewrite the entire field of video generation.

The above is the current capabilities of Sora, as well as the technology behind them, so what about Tesla's capabilities?

03 Tesla's video generation capabilities

In July last year, Ashok Elluswamy, Tesla's director of self-driving software, mentioned in a CVPR2023 speech that Tesla is building a basic world model for its AI technology.

According to him, the model is based on neural networks, using videos of the past and other things as conditions to predict the future.

The model predicts not only the angle of view of one camera, but the angle of view of eight cameras (shown seven).

For example, for the same video, the model can predict the future evolution of the surrounding environment of the car in the case of "continue to go straight" and "change lanes to the right".

This is actually the ability to generate different videos based on text.

At the same time, the color of the surrounding vehicles can be consistent between different camera angles, that is, it conforms to the movement of 3D objects.

Tesla also emphasizes here that we did not specifically train it on its capabilities at the 3D level, or require it to exhibit capabilities at the 3D level, which means that the neural network has understood physical concepts such as depth and motion.

Moreover, Tesla's model is not limited to the RGB data dimension, but can also be a semantic or geometric dimension.

In a word, based on past videos, the model can predict different future situations and generate videos by giving vehicle action cues, or even without hints.

So since Tesla has such a powerful model, why didn't it have a lot of exposure before?

Because at the time of the introduction, Ashok bluntly said that it was still a "work-in-progress", and the key was that it could provide a neural network simulator that could deduce different future outcomes and track all moving objects in the road.

In addition, when Musk demonstrated his video generation capabilities this time, he also admitted that the current computing power for FSD training is not enough, so he did not use the video generated by the model for training.

However, Musk also said that Tesla can be trained, and it will start later this year when the company has spare computing power.

At this point, we can actually see the similarities between Tesla's world model and Sora, both of which allow AI to understand and even simulate the real physical world through vision.

It's just that OpenAI first released Sora in the process of exploration to bring a little shock to the world, and Tesla has used this ability to explore autonomous driving, through pure vision solutions, and end-to-end neural networks trained by video data, FSD V12 has been able to compete with old drivers.

Therefore, FSD and Sora are just two blossoms and fruits of AI understanding the world through visual cognition, FSD is ultimately used for action, and Sora is used to generate videos.

Different means to the same end.

Musk's cognition is indeed amazing.

Sora Portal: https://openai.com/sora

[Smart Car Reference] original content, without the authorization of the account, it is forbidden to reprint at will.

Sora exploded, Musk was in a hurry!"Tesla has the best video generation technology"

Sora exploded, Musk was in a hurry!"Tesla has the best video generation technology"

Read on