The effect surpasses Gen-2!byte's latest video generation model, in a word, let the Hulk wear VR glasses

author：Quantum Position 2024-01-11 13:19:00

Baijiao is from the Au Fei Temple

量子位 | 公众号 QbitAI

In a word, let the Hulk put on VR glasses.

The kind of 4K picture quality.

The effect surpasses Gen-2!byte's latest video generation model, in a word, let the Hulk wear VR glasses

Panda's Fantasy Drift~

This is Byte's latest AI video generation model, MagicVideo-V2, which can realize all kinds of whimsical ideas. It not only supports 4K, 8K ultra-high resolution, but also easily holds a variety of drawing styles.

△ From left to right: oil painting style, cyber style, design style

The performance exceeds that of Gen-2, Pika, and existing AI video generation tools.

As a result, less than 24 hours after it was launched, it attracted onlookers, for example, a tweet had nearly 200,000 views.

Many netizens were surprised by its effect, and even bluntly said: It is better than Runway and Pika.

"Even better than runway and pika"

The researchers did make actual comparisons. The contestants are: MagicVideo-V2, StabilityAI's SVD-XT, new potential player Pika 1.0, and Runway's Gen-2.

Round 1: Light and shadow effects.

As the sun sets, travelers walk alone in the misty forest.

(From left to right: MagicVideo-V2, SVD-XT, upper right Pika, lower right Gen-2, the same below)

As you can see, MagicVideo-V2, Gen-2, and Pika all have obvious light and shadow. However, Pika doesn't tell that it's for travelers, and the MagicVideo-V2 has a richer color palette.

Round 2: Expression of situational plots.

A sitcom of the 1910s about everyday life and trivia in society

This round is obviously better than MagicVideo-V2 and Gen-2. The mid-shot composition presented by SVD-XT is not sufficiently expressive, although the age is reflected.

Round 3: Realism.

A little boy rides his bicycle on a path in the park with the wheels crunching on the gravel.

This time the contrast is even more stark. MagicVideo-V2 and SVD-XT fully embody the meaning of the sentence, but MagicVideo-V2 can see the details of the child's obvious foot movement.

In addition to this, the researchers also conducted a one-on-one human evaluation of MagicVideo-V2 with the most advanced methods available at the moment.

The results show that MagicVideo-V2 is considered to work better than other methods.

(The green, gray, and pink bars represent the better, equivalent, or worse results of the MagicVideo-V2, respectively.) ）

How?

To put it simply, MagicVideo-V2 is a video generation pipeline that integrates a text-to-image model, a video motion generator, a reference image embedding module, and an interpolation module.

First of all, there is a T2I module that first generates 1024×1024 images based on the text, and then the I2V module animates the static image to generate a frame sequence of 600×600×32, and then enhances it with the V2V module to improve the video content, and finally uses the interpolation module to expand the sequence to 94 frames.

In this way, both high fidelity and temporal continuity are ensured.

However, as early as November 2022, Byte launched the MagicVideo V1 version.

At the time, however, the emphasis was more on efficiency, which produced 256x256 resolution video on a single GPU card.

Reference Links:

https://twitter.com/arankomatsuzaki/status/1744918551415443768?s=20

Project Links:

https://magicvideov2.github.io/

Paper Links:

https://arxiv.org/abs/2401.04468

https://arxiv.org/abs/2211.11018

— END —

量子位 QbitAI · 头条号签

The effect surpasses Gen-2!byte's latest video generation model, in a word, let the Hulk wear VR glasses

"Even better than runway and pika"

How?

Read on