Sora, it's overturned again.

Remember the 7 videos that OpenAI collaborated with visual artists, filmmakers, designers, and other creative professionals to bring Sora to life?

Among them, the short film "Air Head" (hereinafter referred to as "Balloon Man)" created by multimedia production company Shy Kids has sparked widespread discussion because of its complete plot and strong narrative. On major domestic platforms, netizens praised this work without hesitation, and some even praised it as "the best release in the history of Sora".

"Anti-counterfeiting" Sora's explosive short film: the best release in history, it is all human-driven

On April 26, X-blogger "Bilawal Sidhu" posted that "Balloon Man" was not a one-click direct release from Sora, and that a lot of rotoscoping and manually created visual effects were used in the actual production process.

Up to now, the Sora "anti-fake post" has exceeded 1.9 million views on X.

American comedian and animator Sway Molina left a message in the comment section saying that he now has a trust problem with OpenAI.

Some netizens pointed out that when OpenAI released these videos, it did not explain that the videos had been edited in the later stage, which was suspected of misleading the audience.

Sora is the first Wensheng video model developed by OpenAI, and since its release in February this year, it has attracted widespread attention and discussion in the industry.

According to the observation of the "number one AI player", from operating official social media accounts to linking professional creators, industry KOLs, and even well-known institutions such as TED, almost every once in a while, OpenAI will release creative videos generated by Sora to keep the topic hot and satisfy everyone's appetite.

OpenAI's officially operated TikTok account

However, apart from OpenAI's official technical report and demo video, most people don't have the opportunity to actually experience Sora. Recently, the Balloon Man production team was interviewed by online media platform Fxguide to reveal Sora's limitations when it comes to video generation, based on their personal experience with Sora.

After squeezing out the bubble, what is the real experience of Sora, and how do the first-line AIGC creators view the Sora rollover?

Interacting with Sora relies on "card drawing", which requires detailed prompts

The production team for Balloon Man consisted of three members: Sidney Leeder as producer, Walter Woodman as both writer and director, and Patrick Cederberg for post-production. They are all from Shy Kids, a multimedia production company that has been nominated for an Emmy Award and an Oscar.

Even in the hands of such a professional team, the experience of using Sora can be called "twists and turns".

In order to maintain the consistency of the short films, Shy Kids' workflow can be roughly divided into two parts: first interacting with Sora to generate the original footage, and then using professional film and television production tools such as AE (After Effects) for post-editing and modification.

The user's interaction with Sora is mainly achieved through text prompts, in which ChatGPT is responsible for converting the user's input text into longer strings to trigger the generation of video clips. As of mid-April, Sora does not yet support multimodal inputs.

First of all, the tragic "anti-counterfeiting" is that Sora does not show the super ability to maintain the consistency of the subject as shown in the promotional video.

Patrick, who is in charge of post-production, revealed that the team's solution for Balloon Man was to "describe the objects in as much detail as possible in the text prompts," such as the character's costume and balloon type.

Patrick,图源:fxguide.com

Since Sora doesn't offer any features to help users control the consistency of content across shots, the overall experience of the production team with Sora is still "gacha", and even with the same prompt words, the results generated by the first and second runs are very different.

The reason for this is that when an AI model (such as Sora) generates a video, it does not simply copy an existing image or video clip, but rather based on the object features learned from the training data.

These features constitute the "Latent Space" of the object. In the field of deep learning, latent space is a compressed and abstract representation of the concept of an object.

Patrick gave an example.

If you ask Sora to generate a kitchen long shot with a banana on the table. In this case, the AI needs to implicitly understand the characteristics that the "banana" may contain, such as "yellow", "curved", and "with dark ends", among others.

Since the latent space is compressed, it is much smaller than the collection of all the banana images that could actually exist. This means that AI can efficiently generate banana images without having to maintain a large "banana stock library".

Each time the AI runs and generates content, it interprets or samples the latent space differently, which is why with the same hint, the resulting banana image has the potential to be different each time.

Therefore, providing detailed and specific "display" instructions, i.e., "describe the object in as much detail as possible in the text prompt", can help the AI better understand what kind of picture you need.

The workload in the later stage only increased, and the three of them spent nearly two weeks to complete "Balloon Man"

According to the introduction, Shy Kids' methodology is to do post-production and editing like shooting a documentary, first generating a large number of shots around the script, and then weaving a new story from these materials, rather than strictly following the script.

For the footage that ended up in the short film, Patrick estimates that they generated hundreds of clips, each about 10 to 20 seconds long, with a roughly 300:1 ratio of the original footage to the final product.

Artificial intelligence can't do it, so I have to make it up manually.

1. Role consistency?

Sora couldn't make sure that the yellow balloon head would remain the same in every shot, and even though the prompt called for a yellow balloon, it gave the result that either the color was wrong or a face would appear on the balloon.

The original picture of Sora's output

Since many of the balloons in the real image are equipped with ropes, Sora will also associate the rope with the balloons, resulting in the balloon man spawned with a rope on his chest, which does not match the production team's imagination of the balloon man.

All of these "image blemishes" need to be removed in post-production.

2. The lens rendering time is long, and the resolution is improved by manual post-production

Balloon Man uses Sora-generated footage, but a lot of it has been graded and reprocessed. For efficiency and quality reasons, the production team used to generate the initial footage at a low resolution, and then use the AI tool Topaz to upscale the resolution.

Patrick explained that Sora supports operations up to 720p resolution, and 1080p is already available, but with longer rendering times. To speed things up, they generated the entire content of Balloon Man in 480P resolution.

Sora supports lens rendering in different time periods, such as 3 seconds, 5 seconds, 10 seconds, 20 seconds, up to one minute. Rendering times vary depending on the time of day and cloud usage needs.

Patrick mentioned that in general, it takes about 10 to 20 minutes for each render. Teams tend to render full 20-second footage to give them more opportunities to edit or edit in post-production, increasing their chances of getting a satisfactory picture.

3. Understanding camera movement is a blind spot for AI

In addition to resolution, Sora also allows users to choose an aspect ratio, such as portrait mode or landscape mode. This feature is used in key shots in the short film that reveal the true identity of the main character. But Sora couldn't natively render camera movements like panning, where the team rendered the shot in portrait mode and then manually created an upward panning effect by cropping it in post.

For generative AI tools, the metadata that accompanies the training data is a valuable source of information. For example, if you train on still photos, the camera metadata will provide the lens size, aperture size, and many other key information needed for model training.

However, concepts such as "tracking", "panning", "tilting", or "propulsion" in cinema footage cannot be captured through metadata.

Patrick points out that in earlier versions Sora generated camera angles rather randomly, and that typing in a "camera pan" prompt only had about a 60% chance of getting a correct response.

"Nine different people could have nine different ways to describe a shot on a movie set, and OpenAI's researchers didn't really think like filmmakers before inviting artists to use the tool. Patrick added.

Sora is not alone in not being able to understand the jargon of video production. Almost all major AI video generation companies face the same challenge. Although Runway, an AI video company, is more advanced in providing a user interface that depicts camera movements, it renders footage of less quality and length than Sora.

4. Lighting and color grading: The post-production special effects are full

Shy Kids used the term "35mm film" in their cues and found that they resulted in a more consistent picture.

In addition, by prompting "High Contrast" or "Key Light", Sora can also generate the corresponding visual effect.

The overall visual style of the short film is based on the footage generated by Sora, and the addition of grain and flicker effects in post-production to mimic the style of traditional film film. In this step, Sora does not provide additional channel options, such as masks or depth channels.

5. Restriction of prompt words brought by copyright

Sora cannot generate content that infringes copyright or appears to infringe portrait rights. For example, if you type a prompt such as "A futuristic spaceship from a 35mm movie, a man walks forward with a lightsaber", Sora will refuse to generate the clip because it is too similar to Star Wars. Even "Hitchcock Zoom", which has become the basic term for filming, was rejected by Sora due to copyright issues.

6. Adjust the lens speed

During the production of Balloon Man, an unexpected phenomenon was that many of the original versions of the footage generated by Sora were presented in slow motion. The exact reason for this is unknown, but the production team had to adjust the speed of these shots.

Patrick mentions, "There were a lot of shots that were generated at 50 to 75 percent, and we spent a lot of time tweaking them to avoid the whole film feeling like a big slow-motion project. ”

7. Sound effects and narration

Outside of the visuals, the background music of the short film, "The Wind", is an original work by the Shy Kids team, and the narration was recorded by Patrick himself. He adds, "Sometimes I write an extra line of script to change the pace of the film, and then I record it and use Sora to generate the corresponding shots. Here's another powerful use of the tool in post-production: when you need to fill in a gap or spark ideas, Sora can help you generate content quickly.

According to reports, the three of Shy Kids team took about 1.5 to 2 weeks to complete the production of "Balloon Man". Currently, they are working on a sequel to the short film.

Compared with relying on Sora to directly generate a film, using Sora as an auxiliary visual effects tool, combined with traditional film and television production methods such as live action shots and AE compositing, is the next direction for the production team to explore the use of Sora in a more "technical" way.

Where has the application of AI video tools been?

In fact, this is not the first time that Sora has rolled over.

In February of this year, shortly after the release of Sora, a lot of external test videos were leaked. Bloomberg, which was the first to get the test qualification, published an article saying that Sora does not understand the rules of physics, and the speed and effect of generating videos have not met expectations, far from an amazing degree.

Bloomberg measurement: Monkeys grow parrot tails

In the past, most of the Sora overturns were aimed at a single video clip, but the behind-the-scenes production process of "Balloon Man" reveals the limitations of the current AI video tools represented by Sora from the perspective of film and television production.

"The climate is still early, the cost is quite large, and the traditional skills of professional users are the underlying support. There's a lot of post-production work in this, which proves once again that these advanced tools are not something that ordinary creators can control. AIGC artist Tuto believes that judging from the experience revealed by the Shy Kids team, Sora has not yet reached the height of the so-called world simulator, and there are still many flaws in the content presented, and there is still a long way to go from real commercial or film and television video production.

Balloon Man is more of an experimental exploration for professional players. "The decisive factor in the quality of the content is the professional and complete production team. "Technology is still in its early stages, so the selection and creation of topics are still centered around developing Sora's potential, and it has not really reached the stage where technology serves content." ”

Combined with front-line work experience, experimental filmmaker and AIGC artist Hai Xin believes that AI video tools are not necessarily only used by professional creators, "Maybe at this stage, more and more traditional film and television advertising practitioners can commercialize it, but more and more creators with non-film and television backgrounds are also using AI videos to express themselves."

When talking about the waste rate of using AI to make videos at this stage, Hai Xin said that at present, when AI generates some specific shots, such as empty shots such as the moon turning and flowers blooming, the rate of card drawing will be very high. The shots involving character performances are more difficult, and the rate of card drawing and waste films will also increase.

Walter, the director of Balloon Man, has said that Sora is good at creating things that seem real, but what excites them is its ability to make something completely surreal.

For professional creators, the traditional workflow is to find randomness in controllability, and the new workflow integrated with AI is to find controllability in randomness.

For ordinary users who do not have a background in film and television production, finding controllability in randomness is still a big challenge. This may be the reason why Sora has been slow to release to the public, choosing to partner with creative software giant Adobe first.

At present, AI video generation technology is still in rapid iteration. Perhaps, as OpenAI researcher Jason Wei put it, Sora is the GPT-2 moment of video generation, and its appearance will inspire the progress of a series of subsequent models.

Compared with Sora, many new players have recently emerged at home and abroad. For example, the first Sora-level video model "Vidu" launched by Biodata Technology and Tsinghua University supports one-click generation of video content up to 16 seconds and a resolution of 1080p, and now it has opened applications for internal testing by partners.

From "toys" to real productivity tools, Sora has a long way to go. However, it is foreseeable that with the development of the underlying model, AI video generation is expected to break through the existing limitations in the second half of this year, and gradually improve in terms of video generation duration and role consistency.

"Anti-counterfeiting" Sora's explosive short film: the best release in history, it is all human-driven

Interacting with Sora relies on "card drawing", which requires detailed prompts

The workload in the later stage only increased, and the three of them spent nearly two weeks to complete "Balloon Man"