laitimes

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

author:51CTO

Produced by | 51CTO Technology Stack (WeChat ID: blog51cto)

Sora has led the research on "video consistency", but time consistency alone can no longer satisfy the industry's desire for high-fidelity video. No, the Chinese have come out to blow up the field again!

Recently, a video model called VideoGigaGAN has become popular in the industry. Super Resolution Cinema Lenses, Don't Wait for Sora!

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

Image

According to reports, there are currently two major problems in the field of VSR (Video Super Resolution): One challenge is to maintain temporal consistency between output frames. The second challenge is to generate high-frequency detail in the upsampled frames. The main question of this paper is the second question. In response to this problem, it seems that the effectiveness of GANs (generative adversarial networks) has once again been verified.

1. Let the blurry video restore realistic details, 8 times more than SOTA

For example, in car recognition, previous VSR methods, such as BasicVSR++, lacked detail, while ImageGigaGAN can produce sharper results with richer details, but it produces videos with artifacts such as time flickering and aliasing (note the architectural footage in the video).

The newly proposed VideoGigaGAN method can generate video results with both high-frequency detail and temporal consistency, while significantly mitigating problems such as aliasing artifacts.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

Image

VideoGigaGAN is a generative video super-resolution model that supersamples high-frequency details of video while maintaining temporal consistency. Compared to existing VSR methods, VideoGigaGAN is able to generate time-consistent videos with more fine-grained appearance details.

The study showed that VideoGigaGAN was very effective on public datasets and demonstrated video results that exceeded 8x the current state-of-the-art VSR models.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

Image

Let's show a few comparison videos first, I believe you can't believe your eyes: video black technology is so shocking!

The time has come to witness the miracle -

The research team released a video comparison of enoki mushroom shabu-shabu, digression: Xu himself is also a Cooking enthusiast.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!
The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

You should still remember that the Sora-like video released by the bird tool before, after flying from the book, there will always be a layer of ghost, this problem has been solved by VideoGigaGAN.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!
The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

The animal world is very exciting, but if you can't see the web behind the spider clearly, how the "little flower cat" interacts with the rope will somewhat lose some of the beauty of the shot.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!
The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!
The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!
The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

2. How? The answer lies in the model details

Next, let's take a look at the power of this model.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

Image

First, the video super-resolution (VSR) model is built on top of the GigaGAN upsampler of the asymmetric U-Net architecture of the image.

Second, to enhance temporal consistency, the team upscaled image samples into video samplers by adding temporal attention layers to the decoder block.

Then, another trick is to enhance consistency by integrating the features of the stream-oriented propagation module.

Next, to suppress aliasing artifacts, the team used anti-aliasing in the encoder downsampling layer.

Finally, Xu et al. directly transfer high-frequency features to the decoder layer through a layer hopping connection to compensate for the loss of detail in the BlurPool process.

One thing to note here: because the spatial window size of temporal attention is limited. Therefore, Xu et al. introduced stream-oriented feature propagation into the amplified GigaGAN to better align features from different frames based on the stream information.

Secondly, there is anti-aliasing processing, which further mitigates the temporal flicker caused by the downsampled block in the GigaGAN encoder, while maintaining the high-frequency detail by transmitting the high-frequency features directly to the decoder block.

Of course, these ideas were also verified by the results of the final experiments. So, these model design choices are very important.

3.背后的一作:爱Cook的Xu yiyan

That's right, Xuyiyan is another Chinese scholar who graduated from South China University of Technology with a bachelor's degree and is now a doctoral student at the University of Maryland's Park College. Xu's current research interests include generative models and their applications, and he is also known to have done research on scene understanding in the field of autonomous driving.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

Image

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

Image

As mentioned earlier, Xu's personal hobbies are quite special: photography, hiking, and cooking.

The Chinese continue to blow up the field!8 times the release of the SOTA model, and you can finally see the spider web silk clearly!

Image

4. Netizens are hotly discussed: the quality is good, the duration is too short, we need 200 frames (at least 9 seconds)

The focus of research on the issue of camera length was the focus, and one user on HN commented: "The video quality looks good, but it has a lot of limitations. Our model struggled with extremely long videos, such as 200 frames or more. So he thinks more research is needed to use it in a real-world setting.

In this regard, some netizens showed a similar opinion: "To a certain extent, I will compulsively count the seconds of the shots, knowing that a show/movie has several shots that are more than 9 seconds, and can win our trust, I can let go." ”

According to another Hackernews user review, the average shot length for a modern movie is about 2.5 seconds, and about 15 seconds for animation. The frame rate of 30fps in this study is not enough, meaning that the time will be less than 7 seconds.

All in all, we are very much looking forward to the results if the paper can be scaled to 200 frames.

5.One More Thing:别忘了打上AI标签

In addition, the release of the research results has once again raised concerns about the misuse of AI. "This is great for entertainment, but overly realistic and clear images can still be used as any kind of 'evidence', and people don't know how the details of these hallucinations work, so such videos still need to be prominently marked. "However, the sober thing is that there are already quite a few software or video/photography functions on smartphones that are already using proprietary algorithms to "infer" whether there are fake details, and the scale of the inspection will be much larger. However, going back to this study, the most interesting thing is the magical ability to restore details. Think of the many images in TV and film, especially the precious influence of a decade ago, with this technology, it will no longer be difficult to "enhance" low-resolution images to make them clear!

Source: 51CTO Technology Stack

Read on