Editor: Editorial Department

Google's new video generation model VideoPoet leads the world again! 10 seconds ultra-long video generation effect crushes Gen-2, and it can also be used for audio generation and style transformation.

AI video generation may be the next frontier in 2024.

Looking back at the past few months, RunWay's Gen-2, Pika Lab's Pika 1.0, and domestic manufacturers have emerged one after another, and they have been continuously iteratively upgraded.

No, RunWay announced early in the morning that Gen-2 supports text-to-speech, which can create voiceovers for videos.

Google's 10-second video generation model breaks the world record!

Of course, Google is not far behind in video generation, first co-releasing W.A.L.T with the team of Stanford Li Feifei, and the photorealistic videos generated with Transformer have attracted a lot of attention.

Today, the Google team has released a new video generation model, VideoPoet, that can generate videos without specific data.

Address: https://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html

The most amazing thing is that VideoPoet is able to generate 10 seconds of ultra-long, coherent large-action videos at a time, completely crushing Gen-2's video generation with only small movements.

In addition, unlike leading models, VideoPoet is not based on a diffusion model, but a multi-modal large model, which can have capabilities such as T2V and V2A, and may become the mainstream of video generation in the future.

After watching it, netizens were "shocked" and swiped the screen.

Why don't you take a look at the next wave of experiences.

Text-to-video

In text-to-video conversion, the resulting video is variable in length and capable of exhibiting a variety of actions and styles depending on the text content.

For example, Panda plays cards:

Two pandas playing cards

Pumpkin Explosion:

A pumpkin exploding, slow motion

Astronauts ride horses:

An astronaut riding a galloping horse

Image-to-video

VideoPoet can also convert the input image into an animation based on the given prompts.

Left: A ship sails on rough seas, surrounded by thunder and lightning, presented in a dynamic oil painting style

Middle: Fly past a nebula full of twinkling stars

Right: A traveler on a cane stands on the edge of a cliff, gazing at the sea fog churning in the wind

Video stylization

For video stylization, VideoPoet predicts optical flow and depth information before feeding additional text into the model.

Left: Wombats wearing sunglasses and holding a beach ball on a sunny beach

Middle: Teddy bear skating on clear ice

Right: A metal lion roars in the light of a furnace

From left to right: Photorealism, digital art, pencil art, ink, double exposure, 360-degree panorama

Video to audio

VideoPoet还能生成音频。

As follows, first generate a 2-second animation clip from the model, and then try to predict the audio without any text guidance. This makes it possible to generate video and audio from a single model.

Typically, VideoPoet generates videos in portrait orientation to match the output of a short video.

Google has also made a short movie made up of many short clips generated by VideoPoet.

In terms of text arrangement, the researchers asked Bard to write a short story about a traveling raccoon, complete with a scene breakdown and a list of prompts. Then, generate video clips for each prompt and stitch all the generated clips together to make the final video below.

Video storytelling

With prompts that change over time, a visual storytelling can be created.

Input: A walking person made of water

Extension: A walking person made of water. There is lightning in the background, while purple smoke is emitted from the person

Input: Two raccoons riding a motorbike on a mountain road surrounded by pine trees, 8k

Expansion: Two raccoons ride motorcycles. The meteor shower falls from behind the raccoon, hitting the ground and causing an explosion

LLM Video Generator in Seconds

Currently, Gen-2 and Pika 1.0 video generation is impressive, but unfortunately it can't be amazing for video generation with coherent and large movements.

Often, when they produce larger movements, the video will show noticeable artifacts.

In response, Google researchers have proposed VideoPoet, which is capable of performing a variety of video generation tasks, including text-to-video, image-to-video, video stylization, video repair/extension, and video-to-audio.

Google's approach is to seamlessly integrate multiple video generation capabilities into a single large language model compared to other models, rather than relying on specialized components trained separately for each task.

Specifically, VideoPoet mainly consists of the following components:

- The pre-trained MAGVIT V2 video tokenizer and SoundStream audio tokenizer can convert images, videos, and audio clips of different lengths into discrete code sequences in a unified vocabulary. These codes are compatible with text-based language models and can be easily combined with other modalities such as text.

- Autoregressive language models can Xi across modalities between video, images, audio, and text, and predict the next video or audio token in a sequence in an autoregressive fashion.

- A variety of multimodal generative Xi objectives were introduced into the large language model training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video repair/extension, video stylization, and video-to-audio. In addition, these tasks can be combined with each other to achieve additional zero-shot functionality (e.g., text-to-audio).

VideoPoet is capable of multitasking on a wide variety of video-centric inputs and outputs. Among them, LLMs can choose to use text as input to guide the generation of text-to-video, image-to-video, video-to-audio, stylization, and expanded image tasks

A key advantage of using LLMs for training is that many of the scalable efficiency improvements introduced in existing LLM training infrastructure can be reused.

However, LLMs run on discrete tokens, which can create challenges for video generation.

Fortunately, video and audio tokenizers can encode video and audio clips into discrete token sequences (i.e., integer indexes) and convert them back to their original representations.

VideoPoet trains an autoregressive language model to Xi across video, image, audio, and text modalities by using multiple tokenizers (MAGVIT V2 for video and images, SoundStream for audio).

Once the model has generated tokens based on the context, these tokens can be converted back into viewable representations using the Tokenizer decoder.

VideoPoet task design: Different modalities are converted to and from tokens through tokenizer encoders and decoders. Each modal is surrounded by a boundary token, and the task token represents the type of task to be executed

Three major advantages

In summary, VideoPoet has the following three advantages over video generation models such as Gen-2.

Longer videos

VideoPoet can generate longer videos by adjusting the last 1 second of the video and predicting the next 1 second.

By looping repeatedly, VideoPoet not only expands the video well, but also faithfully preserves the appearance of all objects even over multiple iterations.

Here are two examples of how VideoPoet generates a long video from text input:

Left: Astronauts dancing on Mars with colorful fireworks in the background

Right: A drone shot of a very sharp elven stone city in the jungle with a clear blue river, waterfall and steep vertical cliffs

Compared to other models that can only generate 3-4 seconds of video, VideoPoet can generate up to 10 seconds of video at a time.

Autumn view of the castle taken by a drone

Precise control

One of the most important capabilities of video generation applications is how much control the user has over the resulting animation.

This will largely determine whether the model can be used to produce long, complex and coherent videos.

VideoPoet can not only add dynamic effects to the input image through text descriptions, but also adjust the content through text prompts to achieve the desired effect.

Left: Turning to look at the camera, Right: Yawning

In addition to supporting video editing of input images, video input can also be precisely controlled through text.

For the leftmost raccoon dancing video, users can make it dance different dances by describing different dance postures.

Spawn "Left": Dance the Robot Dance

Spawn "Medium": Dance the Griddy Dance

Spawn "right": Let's take a piece of Freestyle

Similarly, existing video clips generated by VideoPoet can be edited interactively.

If we provide an input video, we can change the object's motion to perform different actions. Manipulation of objects can be centered on the first or intermediate frame, allowing for a high degree of editorial control.

For example, you can randomly generate a few clips from the input video and select the next clip you want.

As shown in the leftmost video in the diagram, it is used as a conditioned reflex to generate four videos at the initial prompt:

"Close-up of a cute rusty, dilapidated steampunk robot covered in moss and sprouts, surrounded by tall grass."

For the first 3 outputs, there is no autonomous prediction generation of cue actions. In the last video, "Activate, smoke in the background" has been added to the prompt to guide the action generation.

Techniques of camera movement

VideoPoet can also precisely control the changes in the image by attaching the desired camera movement to the text prompt.

For example, the researchers generated an image from the model that suggested "concept art of an adventure game, sunrise over snowy mountains, clear river". The following example adds the given text suffix to the desired action.

From left to right: zoom, slide zoom, pan left, arc action shot, swing shot, drone aerial shot

Evaluate the results

Finally, how did VideoPoet perform in specific experimental tests?

To ensure the objectivity of the assessment, Google researchers run all models on various prompts and have people rate their preferences.

The image below shows the percentage of VideoPoet selected as a green preference in the following questions.

Text Fidelity:

A user preference rating for text fidelity, which is the percentage of videos that are preferred in terms of accurately following prompts

Fun Action:

The user's preference rating for the fun of the action, which is the percentage of the video that is preferred in terms of generating interesting action

In summary, an average of 24-35% of people believe that the examples generated by VideoPoet follow the prompts more than other models, compared to 8-11% for other models.

In addition, 41%-54% of evaluators found the example actions in VideoPoet more interesting, compared to only 11%-21% for other models.

As for future research directions, Google researchers said that the VideoPoet framework will enable "any-to-any" generation, such as extended text-to-audio, audio-to-video, and video subtitles.

Netizens can't help but wonder if Runway and Pika can withstand the upcoming text-to-video innovations from Google and OpenAI.

Resources:

https://sites.research.google/videopoet/

https://blog.research.google/2023/12/videopoet-large-language-model-for-zero.html

Google's 10-second video generation model breaks the world record!