OpenAI's first video generation model was released, with 1 minute of smooth and high-definition, and the effect exploded

2024-02-16 09:26:15

OpenAI's first video generation model was released, with 1 minute of smooth and high-definition, and the effect exploded

Edited by: Bi Luming

According to OpenAI's official website, OpenAI's first video generation model Sora was released, which perfectly inherits DALL· The E 3's image quality and command compliance capabilities produce HD video up to 1 minute long.

OpenAI's first video generation model was released, with 1 minute of smooth and high-definition, and the effect exploded

In the Spring Festival of the Year of the Dragon imagined by AI, the red flag is crowded with people.

There are children who follow the dragon dance team and look up curiously, and many people take out their mobile phones to take pictures while following, and a large number of characters have their own behaviors.

A sleek lady strolls through the streets of Tokyo, surrounded by warm flashing neon lights and dynamic city signs.

A 30-year-old astronaut wears a red knitted motorcycle helmet as he embarks on an adventure, and the movie trailer shows him as he travels between the blue sky and the white clouds and the salt lake desert, in a unique cinematic style, shot on 35mm film in vivid colors.

In the vertical screen super close-up view, this lizard is full of details:

OpenAI says it is teaching artificial intelligence to understand and simulate the physical world in motion, with the goal of training models that can help people solve problems that require interaction with the real world. Here, introducing the text-to-video model, Sora. Sora can generate videos up to a minute long while maintaining visual quality and meeting the requirements of user prompts.

Today, Sora is open to select members to assess potential hazards or risks in key areas. At the same time, OpenAI has invited a group of visual artists, designers, and filmmakers to join in the hope of receiving valuable feedback to advance the model and better empower creatives. OpenAI shares its research progress in advance, aiming to collaborate with people outside of OpenAI and get feedback, so that the public can understand the upcoming new chapter of AI technology.

The SORA model is capable of generating complex scenes with multiple characters, specific types of motion, and precise details of subjects and backgrounds. The model understands not only what the user is asking for in the prompt, but also how those things exist in the real world. The model has a deep understanding of language, can accurately interpret prompts, and generates engaging characters that express rich emotions. Sora is also able to create multiple shots in a single generated video, keeping characters and visual styles accurate and consistent.

For example, as a swarm of paper airplanes flies through the woods, Sora knows what to expect after the collision and expresses the changes in light and shadow.

A flock of paper airplanes flutter through the dense jungle and weave through the woods like migratory birds.

Sora can also create multiple shots in a single video and rely on a deep understanding of language to accurately interpret prompts, preserving characters and visual style.

OpenAI is also not shy about Sora's current weaknesses, as models may have difficulty accurately simulating the physics of complex scenarios, and may not be able to understand specific examples of cause and effect. For example, "five gray wolf pups frolicking and chasing each other on a remote gravel road", the number of wolves will change, and some will appear or disappear out of thin air.

In addition, models can confuse the spatial details of the prompts, such as left and right, and may also have difficulty handling accurate descriptions of events that occur over time, such as tracking specific camera tracks.

For example, in the prompt "The basketball passes through the basket and then explodes", the basketball is not properly blocked by the basket.

In terms of technology, OpenAI has not revealed much at present, and a brief introduction is as follows:

Sora is a diffusion model that starts with noise and is capable of generating the entire video at once or extending the length of the video,

The key is to generate multiple frames of predictions at once, ensuring that the subject remains intact even when the subject is temporarily out of view.

Similar to the GPT model, Sora uses the Transformer architecture and has strong scalability.

In terms of data, OpenAI represents videos and images as patches, similar to tokens in GPT.

With this unified representation of data, models can be trained on a wider range of visual data than before, covering different durations, resolutions, and aspect ratios.

Sora builds on the past of DALL· E and GPT models on top of the study. It uses DALL· E 3's restatement prompt technology generates highly descriptive annotations for visual training data, so it is able to follow the user's text instructions more faithfully.

In addition to being able to generate video based on text instructions only, the model is also able to take existing static images and generate videos from them, accurately moving the image content and paying attention to small details.

The model can also take an existing video and expand it or fill in the missing frames, see the technical paper for more information (to be released later).

Sora is the foundation for models that can understand and simulate the real world, and OpenAI believes this capability will be an important milestone in achieving AGI.

The official website of OpenAI, a comprehensive synthesis of the National Business Daily

National Business Daily

OpenAI's first video generation model was released, with 1 minute of smooth and high-definition, and the effect exploded

OpenAI's first video generation model was released, with 1 minute of smooth and high-definition, and the effect exploded

Read on