laitimes

How does "AI" popularize science Sora?"Long Video Generation": Challenges, Methods and Prospects

author:Chinese Society of Artificial Intelligence

Transferred from Expertise

How does "AI" popularize science Sora?"Long Video Generation": Challenges, Methods and Prospects

Video generation is a rapidly growing field of study that has gained significant attention due to its wide range of applications. A key aspect of this space is the generation of long-duration video, which presents unique challenges and opportunities. This article presents the first review of recent advances in long-form video generation and summarizes it into two key paradigms: divide and conquer or temporal autoregression.

We delve into the models commonly used in each paradigm, including aspects of network design and conditional techniques. In addition, we provide a comprehensive overview and categorization of datasets and evaluation metrics, which are essential for advancing long-form video generation research.

Wrapping up with a summary of existing research, we also discuss the challenges that have emerged in this dynamic area and the way forward. We hope that this review will serve as an important reference for researchers and practitioners in the field of long-form video generation.

https://www.zhuanzhi.ai/paper/6fcdf09712b06f301551fccf2dc693f8

How does "AI" popularize science Sora?"Long Video Generation": Challenges, Methods and Prospects

The field of computer vision and artificial intelligence has experienced transformative growth, especially in the field of video generation. Recently, there has been a proliferation in the development of algorithms capable of producing high-quality and realistic video sequences. It is worth noting that the generation of long videos, characterized by their extended duration and complex content, presents new challenges for the community and stimulates new research directions.

Still, there are gaps in research on long-form video generation. One gap in the current research is the lack of a standard definition of long video. The distinction between long and short videos often relies on relative metrics in different jobs, such as the number of frames (e.g., 512, 1024, or 3376 frames) or duration (e.g., 3, 5 minutes), compared to shorter (e.g., 30, 48, or 64 frames). Considering the diversity of research criteria, we summarize in Figure 1 the video length generated by long videos claimed in existing studies, and based on this, we propose a definition of long video. Specifically, if the duration of the video is more than 10 seconds, the standard frame rate is assumed to be 10fps, or equivalently, if the video contains more than 100 frames, the video is classified as a "long" video. This definition is intended to provide a clear benchmark for the identification of long videos in a variety of research contexts.

According to this definition, there has been significant progress in the length of long videos. Yin et al. (2023) proposed a divide-and-conquer diffusion structure that was specifically trained for long videos to bridge the gap between inference and training, successfully generating videos up to 1024 frames. Zhuang et al. (2024) harness the power of large language models (LLMs) to extend input text into scripts to guide the generation of minute-long videos. More recently, Sora (OpenAI, 2024) has achieved high-fidelity and seamless generation of long videos up to a minute long, featuring high-quality effects such as multiple resolutions and camera transitions. In addition, many prominent studies have introduced new structures and ideas on top of existing video generation models, paving the way for long video generation.

Even so, the generation of long videos still faces many challenges. At its core, the inherent multidimensional complexity of long video places huge demands on the hardware resources to process and generate, resulting in a significant increase in training and generation costs in terms of time and resources. This presents the challenge of growing video within the constraints of existing resources. In addition, the scarcity of long video datasets fails to meet the training requirements, preventing researchers from directly obtaining the optimal parameters to support the generation of long video models. In this case, when the length of the generated video exceeds certain thresholds, it is difficult to maintain the temporal consistency, continuity, and diversity of long video generation. In addition, there are several phenomena that deviate from the established laws of physics in the real world on the surface of current research, presenting unforeseen challenges that have not yet been understood or directly manipulated by existing methods. Therefore, research on long video generation is still in its early stages, and there are many challenges to be solved, which require further exploration and development.

In this review, we conducted a comprehensive survey of existing research on long-form video generation, aiming to provide a clear overview of the current state of development and contribute to its future progress. An overview of the organization for the rest of this article is shown in Figure 2. Initially, we defined the long video duration in Section 1. Section 2 discusses four different types of video generation models and control signals. Based on Sections 1 and 2, we have introduced two common paradigms for simplifying long video generation tasks in Sections 3.1 and 3.2, respectively: divide and conquer and temporal autoregression. Sections 4 and 5 discuss video quality improvements and hardware requirements. Finally, the article concludes with a summary of the long video generation and a discussion of emerging trends and opportunities.

How does "AI" popularize science Sora?"Long Video Generation": Challenges, Methods and Prospects

We detail four popular video generation models, including diffusion models, autoregressive models, generative adversarial networks (GANs), and mask modeling.

The diffusion model for video generation employs an iterative refinement process of traditional diffusion techniques, which were originally designed for static images (Ho et al., 2020), adapted to the dynamic domain of video. At the heart of these models are a series of random noises that are progressively denoised through a series of steps to produce a coherent video sequence. Each step is guided by learned gradients that predictively denoising based on the spatial content of a single frame and the temporal relationships between successive frames. This approach allows the resulting video not only to visually match each frame with its predecessor, but also to contribute to the smoothness of the entire sequence.

In video generation, spatial autoregressive models (Alex Graves, 2013) employ a unique approach to synthesizing content through a patch-based approach, with the creation of each patch relying on a spatial relationship with the previously generated patches. This process is similar to a recursive algorithm, generating patches one at a time. Thus, it builds the video frame by frame until it is finished. Within this framework, the spatial relationship between patches is crucial, as each subsequent patch must be seamlessly aligned with its neighbors to ensure visual coherence throughout the frame. This approach takes advantage of the spatial dependence inherent in video content, ensuring that as the video progresses in time, each frame is consistent and continuous with its predecessor, not only in time, but also spatially.

GAN (Generative Adversarial Network) (Creswell et al., 2020) in the process of video generation using GANs, starting with a generator, converts a simple noise pattern into a series of video frames. This essentially random noise serves as an initial blank state for video production. Through layers of neural networks, the generator gradually shapes this noise into an image that looks like a video frame, ensuring that each frame logically follows the previous frame, creating smooth action and a believable narrative.

This evolution from noise to video is refined by feedback from a discriminator, a component that determines whether the resulting video looks real or fake. The generator learns from this judgment, improving its ability to produce more realistic videos over time. The ultimate goal is to generate a video that is indistinguishable from the real one and shows natural movements and transitions.

Mask ModelingIn video generation, mask modeling utilizes the concept of selectively masking part of the video frame to enhance the model learning process. This technique starts by applying masks to certain segments of the video, effectively hiding them during training. The model then learns to predict these occluded portions based on visible context and the temporal flow of the video. This process not only forces the model to understand the basic structure and dynamics of the video content, but also improves its ability to generate coherent and continuous video sequences. By iteratively training on partially visible data, the model becomes adept at filling in the missing information, ensuring that the resulting video maintains the natural progression of the scene and action.

Long video generation paradigm

In the field of long video generation, the challenge of limited computing resources and the insufficient ability of existing models to directly generate videos of significant duration have led to the proposal of two different paradigms: divide and conquer and temporal autoregression, as shown in Figure 3. These paradigms aim to deconstruct the complex task of long video generation into a more manageable process, focusing on creating individual frames or short segments that can be logically assembled to complete the generation of long videos.

The divide and conquer paradigm begins by identifying keyframes that outline the main narrative, and then generates frames in between to weave a coherent long video. On the other hand, the temporal autoregressive paradigm, also known as autoregression for short, uses a sequential approach to generate short video segments based on previous conditions. This paradigm is designed to ensure smooth transitions between segments, resulting in continuous long video narration. Unlike the divide-and-conquer hierarchical approach, which distinguishes between storyline keyframes and supplemental fill frames, the temporal autoregression paradigm abandons the hierarchy and instead focuses on directly generating detailed fragments guided by the information of the preceding frames.

In this part, the discussion focuses on two paradigms and examines how current research can strategically simplify long-form video generation tasks into smaller, more manageable tasks. In addition, it highlights how existing models were used to generate, and these outputs are then assembled into a complete video narrative.

How does "AI" popularize science Sora?"Long Video Generation": Challenges, Methods and Prospects

Conclusion and future directions

This article provides a comprehensive review of the latest research progress in the field of long video generation. We systematically review four video generation models and delve into the paradigms that generate long videos based on these models, classifying them into two main types: divide and conquer and autoregressive. In addition, our work includes a comprehensive summary of the quality characteristics of long video generation. A detailed explanation is provided for existing studies aimed at enhancing these qualities. Research focused on solutions to resource requirements was also discussed. To further advance the field, we have identified several promising directions for future development.

The existing methods of data resource expansion face the challenge of insufficient resources of long video datasets when training long video generation models, which fail to meet the requirements of obtaining optimal model parameters from training data. As a result, this has led to issues such as incoherent long video generation and duplicate content. To solve this problem, Gu et al. (2023) proposed a method to use large language models and transform existing video content to expand the dataset, effectively solving the problem of data scarcity. Future research can explore more efficient ways to enrich long video datasets.

The existing paradigm for the development of a unified generation method for long video generation is summarized into two broad categories: divide and conquer and autoregression. While they are capable of generating long videos from existing models, each method has its drawbacks. Specifically, divide and conquer is subject to the scarcity of long video training datasets, which requires significant generation time, and faces the challenge of predicting keyframes over a long span, and the quality of keyframes significantly affects the quality of filled frames. Autoregression tends to accumulate errors and suffers from content degradation after multiple inferences. Overall, each paradigm has its strengths and weaknesses. Future research may aim to develop a high-quality unified paradigm that integrates the strengths of both paradigms to address their respective limitations.

Generation with flexible lengths and aspect ratiosCurrent research is primarily focused on training and creating long video content with predetermined dimensions. However, the growing demand for diverse video content and simulated real-world situations requires the generation of videos with variable lengths and aspect ratios. Sora (OpenAI, 2024) and FiT (Lu et al., 2024) have made progress in this area, with Sora enabling the generation of flexible video sizes and FiT demonstrating adaptability in both dimensions of image generation. Future research is likely to highlight improving the flexibility of video generation, aiming to improve the applicability of generative models in real-world settings and further stimulate innovation in the use of video content.

Generation of ultra-long videos In the survey described in Figure 1, the maximum duration of long videos in existing studies is 1 h (Skorokhodov et al., 2022). However, in real life, such as movies and driving simulations, the video duration is usually 90 minutes or even longer. We refer to these as "extra-long videos". As a result, future research can focus on generating ultra-long videos and addressing the challenges of perspective shifting, character and scene development, and action and plot enrichment that arise with longer durations.

Enhanced controllability and real-world simulationIn long video generation, where the current model operates like a black box during and inside the generation, it makes it challenging to understand the cause of errors, such as errors that violate the laws of physics, as demonstrated by Sora (OpenAI, 2024). Existing solutions lack insight into the origin of the problem and intuitive, controlled remediation. Therefore, new methods and techniques are needed to enhance our understanding and control of generative models and make them more suitable for real-world applications.

【Disclaimer】Reprinted for non-commercial educational and scientific research purposes, only for the dissemination of academic news information, the copyright belongs to the original author, if there is any infringement, please contact us immediately, we will delete it in time.

Read on