OpenAI's first AI video model exploded, completely ending the industry's job bowl

[New Zhiyuan Guide] just now, OpenAI released the first AI video model Sora, 60 seconds of one shot to the end, god-level effect generation. Netizens have exclaimed that AI videos are going to change.

New Zhiyuan reports

Editor: Editorial Department

In just a dozen hours, OpenAI and Google released nuclear bomb-level results one after another.

People in China who haven't slept have experienced a crazy night of rollercoaster.

Just now, OpenAI suddenly released the first Wensheng video model - Sora. To put it simply, AI video is about to change!

Not only is it capable of creating realistic and imaginative scenes based on text commands, but it also generates ultra-long videos up to 1 minute long, which is still a one-shot to the end.

AI video tools such as Runway Gen 2 and Pika are still breaking through the coherence in a few seconds, while OpenAI has reached an epic record.

In the end of the 60-second shot, the heroine and background characters in the video have reached an amazing consistency, various shots are switched at will, and the characters have maintained god-like stability.

Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

According to the official website, "By providing the model with multiple frames of prediction at once, we solved a challenging problem. 」

Obviously, this king-level technology has a revolutionary significance, and even Sam Altman is addicted to it!

Not only did he tweet Amway crazily, but he also personally went down to generate videos for netizens: You are free to prompt, and I will output them one by one.

A wizard wearing a peaked hat and a blue robe embroidered with white stars is casting a spell, shooting lightning in one hand and holding an old book in the other.

In a rustic Tuscan kitchen with cinematic lighting settings, a social media granny is teaching you how to make delicious homemade nochi.

We'll take you on a street tour of a futuristic city, where high-tech and nature coexist in harmony in a unique cyberpunk style.

The city is immaculate, full of state-of-the-art futuristic trams, gorgeous fountains, giant holographic projections, and robots patrolling the streets.

Imagine a human guide from the future leading a group of curious alien visitors to show them the culmination of human ingenuity – an incomparably fascinating futuristic city.

A number of technologies have broken records

With a deep understanding of language, Sora is able to accurately understand the needs expressed in the user's instructions and grasp how these elements repose in the real world.

Because of this, Sora has created characters that can express a lot of emotions!

It produces complex scenes that include not only multiple characters, but also specific types of movements, as well as precise and detailed depictions of objects and backgrounds.

Look, the pupils, eyelashes, and skin textures of the characters in the picture below are so realistic that you can't see a single flaw, and there is no AI flavor at all.

From now on, what is the difference between video and reality?!

Prompt: Extreme close up of a 24 year old woman’s eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field, vivid colors, cinematic

In addition, Sora was able to design multiple shots in the same video while maintaining a consistent character and visual style.

You must know that the previous AI videos were all generated with a single lens.

And this time, OpenAI can achieve the consistency of objects in multi-angle lens switching, which has to be said to be a miracle!

This level of multi-lens consistency is completely unattainable for both Gen 2 and Pika......

Prompt: A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.

For example: "Tokyo is bustling after the snow. The camera travels through the busy streets, following a few people enjoying the beautiful snow and shopping at the nearby stalls. Beautiful cherry blossom petals flutter in the wind along with snowflakes. 」

Based on this hint, Sora presents a dreamy scene in Tokyo in winter.

The drone's footage follows a leisurely stroll of a couple through the streets, with the sound of vehicles driving along the riverbank road on the left and customers shuttling between a row of small shops on the right.

Prompt: Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.

It can be said that the effect of Sora has taken the lead to the terrifying level, completely jumping out of the era of hand-to-hand combat with cold weapons, and other AI videos have been completely.

The world model has come true??

The most terrifying point is that Sora already has the prototype of the world model?

By observing a large amount of data, it has learned a lot about the physical laws of the world.

Here's an impressive clip: The prompt depicts "an animated scene of a short-stuffed monster kneeling next to a red candle" while describing both the monster's movements and the vibe of the video.

Sora then created a Pixar-like creature that seemed to fuse the DNA of Furby, Gremlin, and Sully from Monsters, Inc.

Shockingly, Sora's understanding of the physics of hair textures is jaw-droppingly accurate!

Back then, when "Monsters, Inc." was released, Pixar spent a lot of effort to create a super complex hair texture for monsters when they moved, and the technical team worked directly for several months.

And that's something Sora does so easily, and no one has ever taught it!

"It learned about 3D geometry and consistency," says Tim Brooks, a research scientist on the project.

"It's not something we pre-set – it's learned entirely by looking at a lot of data. 」

Prompt: Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.

Thanks to DALL· The diffusion model used by E 3, along with GPT-4's Transformer engine, allows Sora to not only generate videos that meet specific requirements, but also demonstrate a spontaneous understanding of the grammar of filmmaking.

This ability is reflected in its unique talent for storytelling.

For example, in a video titled "A Coral Reef World Filled with Colorful Fish and Marine Life, Carefully Constructed by Paper Art," project researcher Bill Peebles noted that Sora managed to advance the story through her camera angles and timing.

"There are actually a lot of camera transitions in the video – the shots aren't stitched together in post, they're generated by the model in one go," he explains. "We didn't specifically tell it to do that, but it did it automatically. 」

Prompt: A gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.

The current model isn't perfect, though. It can be difficult to simulate the physics of complex scenes, and sometimes it is difficult to accurately understand cause and effect in a given situation. For example, after someone eats a portion of a cookie, the cookie may still appear intact.

In addition, models can make mistakes in handling spatial details, such as distinguishing between left and right, and may not be accurate in describing events over time, such as specific camera movements.

Fortunately, it's not perfect.

Otherwise, can the boundary between the virtual and the real be clearly distinguished?

Isn't this reality?

But there's no denying that the terrible truth is already in front of us: a model that can already understand and simulate the real world means that AGI is not far off.

"The only real video generation job"

Industry leader Zhang Qixuan commented, "Sora is the only real video generation job I have seen that has gone beyond empty shot generation. 」

In his opinion, it seems that there is a generation gap between Sora and Pika and Runway, and the field of video generation is finally dominated by OpenAI. Maybe one day in the field of 3D video, you will be able to experience this fear.

Netizens were shocked to the point of speechlessness: "The next decade will be a crazy decade. 」

"It's over, I'm going to lose my job. 」

"The entire material industry will die with the release of this work......

OpenAI just can't stop killing startups, can it?

"A nuclear explosion is about to occur in Hollywood."

AI filmmakers and their current projects.

Technical introduction

SORA is a diffusion model that starts with what appears to be static noise at the beginning and progressively generates a video through a multi-step noise removal process.

Sora is not only capable of generating a full video in one go, but also extending the video that has already been generated.

By enabling the model to anticipate multiple frames, the team overcame the challenge of ensuring consistency of the subject in the video, even if it temporarily disappeared.

Similar to the GPT model, Sora employs a Transformer architecture, which enables superior performance scaling.

OpenAI breaks down videos and images into smaller data units called "patches", each of which is equivalent to a "token" in GPT.

This unified data representation enables the training of diffusion transformers on a wider range of visual data, covering different durations, resolutions, and aspect ratios.

Sora is based on DALL· The research results of E and GPT models were developed using DALL· E 3's relabeling technique, which enables the model to more accurately follow the user's text instructions to generate a video by generating a detailed description of the title for the visual training data.

In addition to generating video based on text commands, the model can also convert existing static images into videos, animating the content in the images with precision and detail. The model can also extend existing video or complete missing frames.

Sora lays the groundwork for understanding and simulating real-world models, which OpenAI sees as an important step towards achieving artificial general intelligence (AGI).

Work appreciation

A fascinating view from the window as a train passes through the suburbs of Tokyo.

Prompt: Reflections in the window of a train traveling through the Tokyo suburbs.

In the snowy grasslands, several huge woolly mammoths slowly move forward, their long furs fluttering gently in the breeze. In the distance, snow-covered trees and majestic snow-capped mountains are visible, and the afternoon sun pierces through the thin clouds, adding a warm glow to the scene. The low-angle shots make these huge furry animals particularly spectacular, and the depth-of-field effect is captivating.

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.

The drone looks down from the air over the rugged cliffs near Cape Grande Surgaray Beach, where the waves crash against the rocks to form white crests and the golden glow of the setting sun illuminates the rocky shores. In the distance there is a small island with a lighthouse, and the edge of the cliff is covered with green vegetation. The sight of the steep descent from the road to the beach and the protruding cliff edge showcases the raw beauty of the coast and the rugged scenery of the Pacific Coast Highway.

Prompt: Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway.

Aerial view of Santorini in the blue hour, showing the stunning architecture of the white Cycladic buildings and the blue dome. The view of the caldera is breathtaking, and the lights create a beautiful and serene atmosphere.

Prompt: Aerial view of Santorini during the blue hour, showcasing the stunning architecture of white Cycladic buildings with blue domes. The caldera views are breathtaking, and the lighting creates a beautiful, serene atmosphere.

A young man in his 20s sits on a cloud in the sky, immersed in a book.

Prompt: A young man at his 20s is sitting on a piece of cloud in the sky, reading a book.

A group of lively Golden Retriever puppies frolicking in the silvery white snow, their curious little heads poking out of the snow from time to time, adorned with snowflakes.

Prompt: A litter of golden retriever puppies playing in the snow. Their heads pop out of the snow, covered in.

In the rows of brightly colored buildings in Burano, Italy, a cute Dalmatian is looking curiously out through a window. At the same time, the streets are bustling with people, some on foot and some on bicycles.

Prompt: The camera directly faces colorful buildings in burano italy. An adorable dalmation looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings.

A tilt-shift photograph of a construction site filled with workers, equipment, and heavy machinery.

Prompt: Tiltshift of a construction site filled with workers, equipment, and heavy machinery.

In a petri dish, a bamboo forest grows, in which red pandas run happily.

Prompt: A petri dish with a bamboo forest growing within it that has tiny red pandas running around.

A cartoon kangaroo is dancing on the disco dance floor.

Prompt: A cartoon kangaroo disco dances.

In the middle of a cup of coffee, two pirate ships engage in a fierce battle, a hyper-realistic close-up video.

Prompt: Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee.

Big guy guesses: game engine blessing?

Soumith Chintala, co-founder of Pytorch, speculated that "based on all the user request videos posted by Sam Altman, it appears that Sora is powered by a game engine and generates work and parameters for the game engine."

Jim Fan, a senior scientist at NVIDIA, expressed some of his opinions on the new Sora model:

Sora is a data-driven physics engine. It is a simulation of many worlds, both real and imaginary. The simulator learns complex rendering, "intuitive" physics, long-term reasoning, and semantic understanding through denoising and gradient learning.

I wouldn't be surprised if Sora had been trained on a lot of synthetic data using Unreal Engine 5. It has to be!

Similarly, Yao Fu, a PhD student at the University of Edinburgh, said, "Generative models learn the algorithms that generate data, rather than remembering the data itself. Just like a language model encodes the algorithms that generate a language (in your brain), a video model encodes the physical engine that generates a video stream. A language model can be thought of as an approximation of the human brain, while a video model approximates the physical world."

Reinventing the video industry

Although, it may take a long time for text-to-video technology to threaten traditional filmmaking -

You can't simply stitch together 120 one-minute videos generated by Sora to make a coherent movie, because these models don't ensure continuity of content.

However, this does not prevent Sora and similar programs from revolutionizing social platforms such as TikTok.

"Making a professional film requires a lot of expensive equipment. "This model will make it possible for the average person to produce high-quality video content on social media." 」

Resources:

https://twitter.com/OpenAI/status/1758192957386342435

https://openai.com/sora

OpenAI's first AI video model exploded, completely ending the industry's job bowl

OpenAI's first AI video model exploded, completely ending the industry's job bowl

Read on