The creator of the Sora short film explains the benefits and limitations of AI-generated videos

On April 28, OpenAI's video generation tool Sora surprised the AI community in February, and its smooth and realistic videos seem to be far ahead of competitors. But the well-planned debut left out a lot of details – those details were filled in by a filmmaker who could have used Sora in advance to make a short film.

Shy Kids, a Toronto-based digital production team, was chosen by OpenAI as one of the few teams to produce short films primarily for OpenAI's promotional purposes, even though they were given considerable creative freedom when it came to creating balloons in the air.

In an interview with visual effects news outlet fxguide, post-production artist Patrick Cederberg described the practical use of Sora as part of his work.

Perhaps the most important takeaway for most people is this: while OpenAI's posts highlighting these short clips make readers think they're more or less entirely made by Sora, the truth is that they're professionally produced, complete with powerful storyboarding, editing, color correction, and post-production work like motion observation and VFX.

The creator of the Sora short film explains the benefits and limitations of AI-generated videos

In the same way that Apple says shooting on an iPhone but doesn't show the studio setup, professional lighting, and color work afterwards, Sora's post only talks about what it lets people do, not how they actually do it.

The following is the full text:

In February, we pushed our first story on SORA; OpenAI just released the first clip of SORA, which we described at the time as DALL· Video version of E. SORA is a diffusion model that produces longer and more cohesive videos than any of its competitors. By providing the model with multiple frames of foresight at once, they solved the challenging problem of ensuring that the subject remained consistent even when it was temporarily out of view. SORA can generate an entire video at once, up to a minute in length. At the time, OpenAI also released a technical note indicating that it could (in the future) scale the generated video to make it longer or seamlessly blend two videos.

Over the past few weeks, several select production teams have been granted limited access to SORA. One of the most notable teams was the Shy Kids team, who produced the SORA short film Air Head. Sidney Lide served as the film's producer. Walter Woodman served as the writer and director, and Patrick Seidberg was in charge of post-production. The Toronto team is known as Punk Rock Pixar, and their work was nominated for an Emmy Award and shortlisted for an Academy Award.

We sat down with Patrick this week for a long conversation about the current state of SORA.

Shy Kids is a Canadian production company known for its eclectic and innovative approach to media production. Originating from a creative team in different disciplines such as film, music, and television, Shy Kids has been recognized for its unique narrative style and engaging content. The company often explores the complexities of adolescence, social anxiety, and modern life while maintaining a uniquely whimsical and sincere tone. Their work demonstrates a keen eye for visual storytelling and is often tightly coupled with original music, making their work resonant and memorable. Shy Kids has managed to carve out a niche by embracing new AI technologies and creativity to make everything possible.

SORA: Mid-April 24.

SORA is in development and is actively improving through feedback from teams like Shy Kids, but at the moment it works as follows. It's important to recognize that SORA worked well almost before the alpha release. It has not yet been released and is not a beta version.

"It's a lot of fun to play. Patrick commented. "It's a very, very powerful tool, and we're already thinking of all the ways it can fit into our existing processes. But I think any generative AI tool, at the moment, control is still the most desirable and the most elusive. 」

The user interface allows artists to enter text prompts, which OpenAI's ChatGPT then converts into longer strings, triggering clip generation. At present, there are no other inputs; This is important because, while SORA is rightly praised for its object consistency in one shot, nothing can help make anything in the first shot match in the second. Even if you run the same prompt a second time, the results will be different.

"The closest we can get is over-describing in the prompt. Patrick explained. "Explaining the costumes of the characters as well as the type of balloons is our solution to consistency, because shot-by-shot/generation-by-generation, there is not yet a proper feature set to fully control consistency. 」

The individual clips are extraordinary and jaw-dropping for the technology they represent, but the use of clips depends on your understanding of implicit or explicit shot generation. Let's say you ask SORA to shoot a long shot of a banana on a table in the kitchen. In this case, it will rely on an implicit understanding of banana sex to generate a video that shows bananas. Through the training data, it learns the hidden aspects of bananas: e.g. "yellow", "curved", "with dark ends", etc. It doesn't have an image of a banana that is actually recorded. It does not have a "banana repository" database; it has much smaller compressed hidden space, or "latent space", than bananas. Each time it runs, it shows an alternative explanation for that potential space. Your prompt responds to an implicit understanding of banana sexuality.

Prompt the right thing

For Air Head, the scene was created by generating multiple clips based on an approximate script, but there was no clear way to make the actual yellow balloon head the same in every shot. Sometimes, when the team prompts the use of a yellow balloon, it is not even yellow. Other times, it has a face embedded in it, or a face that appears to be drawn in front of the balloon. Since many balloons have strings, it is common for the "air head" character, nicknamed "Balloon Man" Sonny, to tie a string to the front of the character's shirt. Since it implicitly links strings with balloons, these balloons need to be removed at a later stage.

settle

Air Head only uses SORA-generated footage, but most of it has been graded, processed, and stabilized, and all of them have been upgraded or upgraded. The clips used by the team are generated at a lower resolution and then compressed using AI tools other than SORA or OpenAI. "You can get to 720 P (resolution)," Patrick explains. "I believe the 1080 feature is already available, but it will take a while (rendering). We did all the Air Heads at 480 degrees and then used Topaz for upright. ”

Hint "Time": Slot machines.

The original prompt expands automatically, but also appears along the timeline. "You can go into those big keyframes and start adjusting the information based on the changes you want to generate. "There's a little bit of time control over where these different behaviors happen in the actual build, but that's not exact...... It's a bit like a slot machine, and there's no way to know for sure if it actually achieves those goals. "This is the end of it. "Of course, Shy Kids is working on the earliest prototypes, and SORA is still in development.

In addition to choosing a resolution, SORA also allows users to choose an aspect ratio, such as portrait or landscape (or square). This comes in handy in shots from Sonny's jeans to his balloon head. Unfortunately, SORA itself doesn't render such action, always hoping that the main focus of the shot – the balloon head – will appear in the shot. As a result, the team rendered the shot in portrait mode and then manually created a post-pan by cropping.

Prompt for camera orientation

For many genAI tools, a valuable source of information is the metadata that accompanies the training data, such as camera metadata. For example, if you train still photos, the camera metadata will provide lens size, aperture value, and many other key information for the model to train. For cinematic shots, the idea of "tracking", "panning", "tilting", or "pushing in" is not a term or concept for metadata capture. Although the permanence of the object is essential for lens making, being able to describe the lens is equally important, Patrick points out that this was not initially the case in SORA. "Nine different people would have nine different ideas about how to describe a shot in a movie set. (OpenAI) researchers didn't really think like filmmakers before getting artists to use the tool. Shy Kids knew they were visiting early, but "the initial version about camera angles was a bit random." Whether SORA will actually record a prompt request or understand it is unclear, as the researchers have just focused on image generation. OpenAI was so surprised by the request that Shy Kids was almost shocked. "But I guess when you're just as a researcher, not thinking about how the storyteller is going to use it...... SORA is improving, but I would still say that the controls aren't quite in place yet. You can put in a 'camera pan' and I think you'll get it six times out of ten. "This is not a unique problem, and almost all major video genAI companies are facing the same problem. Runway AI may be state-of-the-art in providing a UI that depicts camera movements, but the quality of Runway and the length of rendered clips are not as good as SORA's.

Render time

Clips can be rendered in different time periods, such as 3 seconds, 5 seconds, 10 seconds, 20 seconds, up to one minute. Rendering times vary depending on the time of day and cloud usage needs. "Typically, each render takes about 10 to 20 minutes," Patrick recalls. "In my experience, the duration I choose to render has very little impact on render time. If it's 3 to 20 seconds, the rendering time tends not to change much in the 10 to 20 minute range. We usually do this because if you have the full 20 seconds, you want to have more opportunities to split/edit the content and increase the chances of getting something that looks good. ”

While all the images were generated in SORA, the balloons still required a lot of post-production. In addition to isolating the balloon for recoloration, it sometimes has a face on Sonny, as if his face was drawn with a marker, which is removed in AfterEffects. Other artifacts like this are often removed.

Edit the 300:1 shooting ratio

The approach to "Shy Kids" is post-production and editing like a documentary, with a lot of shots that you can weave a story from these materials, rather than shooting strictly according to the script. The short film was scripted, but the team needed to be flexible. "Just getting a whole bunch of footage and trying to clip it to the narrator in a fun way," Patrick recalled.

Patrick estimates that for the last minute and a half of footage in the film, "10 to 20 seconds of time each can produce hundreds of generations." Adding, "I'm bad at math, but I'm guessing the amount of source material is probably 300:1 compared to the final one." ”

Composite multiple clips and retime them

In Air Head, the team didn't put multiple shots together. For example, shots of balloons floating across the velodrome are all generated in one shot, as seen in them. However, they are working on a new film that mixes and composes multiple shots into a single clip.

Interestingly, many of the Air Head clips are generated in slow motion, which is not required in the prompts. The reason for this happening is unknown, so many clips have to be retimed to look like they were shot in real time. Obviously, this is easier than the opposite of slowing down fast movements, but this seems to be a strange aspect to extrapolate from the training data. "I don't know why, but it does look like there are a lot of clips that are at 50 to 75 percent speed," he added. "So, a lot of timing needs to be adjusted to prevent it from all feeling like a big slow-motion project. ”

Lighting and grading

Shy Kids used the term "35mm film" as a keyword in their tips, and it was generally found that tips 35mm gave the level of consistency they were looking for. "If we need high contrast, we can say high contrast and say that key lighting usually gives us something close," Patrick said. "We still needed to give it full color grading, and we made our own digital cinema look, and we applied grain and flicker to blend it all together. "There are no options for other channels, such as masks or depth channels.

OpenAI tries to be respectful and does not allow the generation of copyright-infringing material or the generation of images that appear to be from others. For example, if you prompt something like 35mm film in a futuristic spaceship, a man walking forward with a lightsaber, SORA won't allow that clip to be generated because it's too close to Star Wars. But shy kids unexpectedly encountered this during early tests. Patrick recalls that when they initially sat down just to test SORA, "we shot a shot behind the scenes; it was kind of like Aronofsky's follow-up shot. I think it's just my stupid brain because I'm tired, but I put the 'Aronovsky-style shot' in it and got hit and can't do that. He recalled. Hitchcock Zoom is another thing that is now popping up through technical jargon infiltration, but SORA has rejected the tip for copyright purposes.

sound

Shy children are known for their auditory skills in addition to their visual skills. The music in the short film is their own. "We decided on the song almost immediately because the name of the song was 'The Wind,'" Patrick said. "We all loved it. ”

Patrick himself voiced Sonny. "Sometimes we feel like the film needs another rhythm. So I'll write another line, record it, and come up with more SORA generations, which is another powerful use of the tool in the post: it's a great tool when you're in a corner, and you need to fill in the gaps, and that's a way to start brainstorming and then spit out clips and see what you can do with pacing. ”

wraparound

SORA is extraordinary; The Shy Kids team produced Air Head in about 1.5 to 2 weeks with a team of just 3 people. The team is already working on a wonderful, self-conscious, and perhaps ironic sequel. "The follow-up was a news report on Sonny the Balloon Man and his reaction to fame and the subsequent brawl with the world," Patrick said. "We're exploring new technologies!" The team wanted to be more technical in their experiments, incorporating AE compositing of SORA elements into real-world live-action footage, and using SORA as a complementary VFX tool.

SORA is so new that even the basic framework that OpenAI sketched and demonstrated for SORA is not yet available for early testing. It's doubtful that SORA in its current form will be released anytime soon, but it's an incredible step forward in terms of specific types of implicit image generation. For high-end projects, it can take a while to reach the level of specificity required by the director. For many others, it's "close enough" while providing stunning images. Air Head still needed a lot of editing and manpower to make this engaging and entertaining story film. "I just feel like people have to make SORA a real part of their process, but if they don't want to be a part of something like that, that's okay. ”

The creator of the Sora short film explains the benefits and limitations of AI-generated videos

Read on