The advent of OpenAI Sora, another ChatGPT moment on the road to AGI!GPT4 may also be killed

OpenAI's Wensheng video model Sora has swept the screen.

Let's just say that the last time such a crazy swipe was probably when humans first saw ChatGPT. Also, Sora a few hours ago Google just launched its strongest LLM Gemini 1.5 and tried to claim that it had finally killed GPT-4, however, apparently no one is paying attention now.

Because after reading Sora, you may find that OpenAI itself may want to use it to kill GPT-4 first.

Everyone can create their own world

Let's start with Sora.

People have been looking forward to GPT-5, but Sora has brought no less sensation than a GPT-5 release.

As OpenAI's first text-to-video model, Sora is capable of generating up to 1 minute video based on text instructions or static images, including intricate scenes, vivid character expressions, and complex camera movements. It also accepts existing video extensions or fills in missing frames.

The 60-second video length of each prompt compares to Pika Labs' 3 seconds, Meta Emu Video's 4 seconds, and Runway's Gen-2's 18 seconds. And judging from the official demo, both in terms of video smoothness and detail performance, Sora's effect is quite amazing.

For example, this 14-second video of the snow scene in Tokyo in the official tweet.

提示词：Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.

"Beautiful, snow-covered Tokyo is busy. The camera travels through the busy city streets, following a few people enjoying the snow and shopping at the nearby stalls. Beautiful cherry blossom petals fall in the wind and flutter with snowflakes. 」

Fashionably dressed women strolling through the streets of Tokyo on a neon background with the reflection of standing water on the ground.

The depiction of facial features and skin is very realistic, especially acne scars and nasolabial folds, and the details are amazing.

Mammoths slowly creep from the glacial snowfield, snow mist rising from behind them.

Innocent and naughty 3D animated little monsters by the candle flame, full marks for light and shadow, expressions and furry details:

A close-up of a 24-year-old woman's eyes is enough to confuse the real.

Drone view of the waves crashing against the cliffs of the coast of Cape Gari in Big Sur, and the setting sun casts a golden glow.

Time-lapse video of flowers blooming on the windowsill:

People take to the streets to dance dragons to celebrate the Chinese New Year.

Cute kitten online soothes wake-up gas.

Happy puppy running on the street at night.

Two miniature pirate ships face off over a cup of coffee.

Rare "historical footage" from California's gold rush era leaked out -- like?

Sora is currently in beta and is only open to a select group of evaluators, visual artists, designers, and filmmakers, and those who have been granted a trial are already starting to run wild.

Sam Altman reposted the "Golden Zoo" video made by netizens with Sora, and played his own "What" meme:

He also invited everyone to enthusiastically propose a prompt that they wanted to make a video with Sora, and the team immediately generated more than 8,000 replies for everyone.

Netizens have opened their minds to see the open cycling of sea creatures.

Two golden retrievers are on a podcast in the mountains with headphones on.

Of course, I didn't forget to cue and go to Ilya, who is a mystery, and ask for a "real world in Ilya's eyes" to be generated.

However, OpenAI also said that while Sora has a deep understanding of natural language, can accurately understand prompts, generate expressive content, and can create multiple shots, maintain character and visual style consistency, there are still inevitably some weaknesses.

For example, it may have difficulty accurately simulating the physical phenomena of complex scenarios and may not understand specific cause-and-effect relationships. For example, "When a person takes a bite of a cookie, there may be no bite marks on the cookie." ”

Models can also confuse the spatial details of the prompts, such as getting left and right wrong. or "having difficulty accurately representing events over time, such as following a particular camera trajectory."

Sora also uses DALL· E 3's recaptioning technique, which involves generating highly descriptive titles for visual training data. As a result, the model is able to more faithfully follow the user's text instructions in the generated video.

It can generate an entire video at once, or extend an already generated video to make it longer. By having the model foresee multiple frames at once, the challenging problem of remaining the subject unchanged even when the subject is temporarily out of sight is solved.

Regarding security, OpenAI said it is working with experts in areas such as misinformation, hateful content, and bias to adversarially test the model. It is also developing tools to help detect misleading content and identify whether a video was generated by Sora. Text prompts that violate the usage policy, such as violence, hatred, and infringement of the intellectual property rights of others, will not be shown to users.

In addition, DALL· The same existing security approach that the E3 product builds applies to Sora as well.

"Despite extensive research and testing, we can't predict how people will take advantage of our technology or how people will misuse it. That's why we believe that learning from real-world use cases is a key component in building AI systems that are becoming more and more secure over time. ”

OpenAI is confident in Sora, which it sees as a "significant milestone in achieving AGI" as laying the foundation for models to understand and simulate the real world.

Netizens also mourned the companies in the relevant tracks for the n+1st time:

"OpenAI just can't stop killing startups. ”

"Oh my God, now we have to figure out what's true and what's fake. ”

"My job is gone. ”

"The entire video material industry has been washed in blood, rest in peace. ”

Can you kill GPT-4's model of the world?

As always, OpenAI didn't give a very detailed technical explanation, but a few words are enough to make your imagination come to mind.

The first point that caught our attention the most was the processing of data.

Sora is a diffusion model that uses a GPT-like Transformer architecture. In terms of solving the unity between training Chinese data and video data, OpenAI said that when processing image and video data, they call the smallest unit obtained by segmenting them as patches, which is the basic unit tokens corresponding to LLM.

This is an important technical detail. Using it as the basic unit of model processing enables deep learning algorithms to process a wide range of visual data more efficiently, covering different durations, resolutions, and aspect ratios.

Judging from the final shocking effect, it is difficult to conclude that the ability to understand language can be transferred to the understanding of more forms of data.

The effect of the previous Dalle-3 is recognized to a large extent from OpenAI's leading N-generation language ability accumulated on GPT, even if it is an image-output model, language ability improvement is crucial. And today's video models, too.

As for how it does this, many industry experts have given the same guess: its training data uses Unreal, the most cutting-edge physics engine in the gaming field Engine5, simple and crude understanding, that is, after the language ability is strong enough, the generalization ability it brings can directly learn the image and video data generated by the engine and the patterns it embodies, and then you can directly use the learned, the most understandable way for the engine to give instructions to these visual model modules that use the powerful technology of the engine, and generate the realistic and powerful video we see that reflects the "understanding" of the physical world.

Based on this speculation, this sentence in OpenAI's short introduction seems to be even more important:

"Sora is the foundation for models that can understand and simulate the real world, and OpenAI believes this capability will be an important milestone in achieving AGI. ”

Understanding, reality, the world.

Isn't this the only world model that people are arguing about that has the potential to "kill" GPT-4. Now, OpenAI has come up with its prototype and put it in front of you.

It seems that this model learned about 3D geometry and consistency, and it was not pre-set by the OpenAI training team, but it was learned completely naturally by observing large amounts of data. Tim Brooks, an OpenAI scientist responsible for Sora's training, said AGI would be able to simulate the physical world, and Sora was a key step in that direction.

Obviously, in OpenAI's eyes, it's not just a "Wensheng video model", but something bigger.

If we try to give a further argument, it is that language is the basis for understanding everything, and only after understanding video will the world model come.

Maybe this is what is more terrifying than swiping the screen today and feeling that "reality doesn't exist". This could be another ChatGPT moment for humanity on the road to AGI.

The advent of OpenAI Sora, another ChatGPT moment on the road to AGI!GPT4 may also be killed

The advent of OpenAI Sora, another ChatGPT moment on the road to AGI!GPT4 may also be killed

Read on