laitimes

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Late last month, a Reddit user named chaindrop shared an AI-generated video on r/StableDiffusion subreddit, which caused a lot of controversy in the industry.

In the video, an ugly and deformed "Will Smith" generated by AI shovels a handful of spaghetti into his mouth with terrifying enthusiasm. The "hellish" video quickly spread to other forms of social media, with digital media and broadcaster Vice saying it would "stay with you for the rest of your life" and The A.V. Club calling it "the natural end point for AI development." On Twitter alone, the video has more than 8 million views.

The following GIF is part of it. Each frame shows a simulated scene of Will Smith gobbling up pasta from a different angle.

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Since the video of Will Smith eating pasta went viral, follow-up reports such as Scarlett Johansson and Joe Biden eating pasta have appeared on the Internet, and there are even videos of Smith eating meatballs. Although these scary videos are becoming the perfect "fear" model of the internet, like the previous Text2Video and every AI-generated content, Text2Video is accelerating into our lives.

Bunsheng Video: You write the script, I do the video

The workflow for creating the "Will Smith Eating Spaghetti" video from the open-source AI tool ModelScope is fairly simple: just give a hint for "Will Smith eating spaghetti" and generate it at 24 frames per second (FPS).

ModelScope is a "Vensonic Video" diffusion model trained to create new videos based on the user's prompts by analyzing millions of images and thousands of videos collected into the LAION5B, ImageNet, and Webvid datasets. This includes videos from Shutterstock, so there is a ghostly "Shutterstock" watermark on its output, as shown in the video.

At present, in the track of Wensheng video, large manufacturers and research institutions at home and abroad are also quietly competing. Back on September 29 last year, Meta released Make-A-Video, and on the initial announcement page, Meta showed text-generated sample videos, including "A young couple walking in heavy rain" and "A teddy bear in a portrait".

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

At the same time, Make-A-Video has the ability to take static source images and animate them. For example, a still photo of a turtle, once processed by an AI model, can look like it's swimming.

Less than a week after Meta launched Make-A-Video, Google released Imagen Video, which generates 1280×768 high-definition video at 24 frames per second based on written prompts. Imagen Video includes several notable stylistic capabilities, such as generating videos based on the work of famous painters such as Van Gogh's paintings, generating 3D rotating objects while preserving object structure, and rendering text in multiple animation styles. Google hopes that this video compositing model will "significantly reduce the difficulty of generating high-quality content."

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Subsequently, Google launched another Phenaki video model, Phenaki. Unlike Imagen Video, which focuses on video quality, Phenaki primarily challenges video length. It can create longer videos based on detailed prompts, achieving "with a story, with a length". Its ability to produce video of any length of time comes from its new codec, CViVIT, a model built on techniques honed in Imagen, Google's early Ven Graphics system, but with a bunch of new components that transform static frames into smooth motion.

On February 6 of this year, Runway, the original startup behind Stable Diffusion, unveiled a video generation AI, the Gen-1 model, which can transform existing videos into new videos and change their visual style by using text prompts or any style specified by reference images. On March 21, Runway released the Gen-2 model, which focuses on generating videos from scratch, either by applying the composition and style of images or text cues to the structure of the source video (video-to-video), or, using only text (Bunsen video).

AIGC's next stop: anticipation, vigilance pervades the world of AI editors
AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Step on the shoulder of "Wen Sheng Tu"

The key technology behind the Vincent video model like Make-A-Video — and why it arrived sooner than some experts expected — was because it stepped on the shoulders of the "Vincent Diagram" technology giant.

According to Meta, instead of training the Make-A-Video model on labeled video data (e.g., caption descriptions of actions described), they take image synthesis data (still images trained with captions) and apply unlabeled video training data so that the model learns the sense of where text or image cues may be in time and space. It can then predict what will happen after the image and display the dynamic scene in a short period of time.

From Stable Diffusion to Midjourney to DALL· E-2, the Vensentograph model, has become very popular and used by a wider audience. With the continuous expansion of multimodal models and the research of generative AI, recent work in the industry has sought to successfully extend text-to-image diffusion models to text-to-video generation and editing tasks by reusing them in the video domain, so that users can get the full video they want just by giving a hint.

Early methods of literary graphs relied on methods such as template-based generation and feature matching. However, these methods have limited ability to produce realistic and diverse images. After the success of GAN, several other deep learning-based gram methods have been proposed. These include StackGAN, AttnGAN, and MirrorGAN, which further improve image quality and diversity by introducing new architectures and enhancement mechanisms.

Later, as Transformer advanced, new approaches to grammar emerged. For example, DALL· E-2 is a 12 billion parameter transformer model: first, it generates an image token, which is then combined with a text token for joint training of autoregressive models. After that, Parti came up with a way to generate content-rich images with multiple objects. Make-a-Scene implements the control mechanism through the segmentation mask generated by the text diagram. The current method builds on diffusion models, thereby taking the synthetic quality of the Wen Sheng diagram to a new level. GLIDE improves DALL· E。 Later, DALL· E-2 makes use of the contrast model CLIP: through the diffusion process, mapping from CLIP text encoding to image encoding, as well as obtaining a CLIP decoder... (Click to learn about the past and present lives of Wen Sheng Tu)

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

These models are capable of producing images with high quality, so the researchers set their sights on developing literary graph models that can generate video. However, Vincent video is still a relatively new research direction. Existing methods attempt to generate using autoregressive transformers and diffusion processes.

For example, NUWA introduced a 3D transformer encoder-decoder framework that supports text-to-image and text-to-video generation. Phenaki introduces a bidirectional masking transformer and causal attention mechanism that allows video of arbitrary length to be generated from a sequence of text cues; CogVideo tunes the CogView 2 graph model by using a multi-frame rate layering training strategy to better align text and video clips; VDM jointly trains image and video data to naturally extend the Vensheng graph diffusion model.

The Imagen Video shown earlier builds a series of video diffusion models and uses spatial and temporal super-resolution models to generate high-resolution time-consistent videos. Make-A-Video leverages video data in an unsupervised way based on a text-to-image composite model. Gen-1 extends Stable Diffusion and proposes a content-guided approach to video editing based on the structure and content-guided description of the desired output.

Nowadays, more and more Vincent video models are constantly iterating, and we can see that 2023 seems to be the year of "Vincent Video".

Generative AI's Next Stop: Need for Improvement, Need for Vigilance

Although the techniques and training sets of Wensheng graphs are reused, it is not simple to apply diffusion models in the video field, especially due to their probability generation process, it is difficult to ensure time consistency. That is, the main subject tends to look slightly different from frame to frame, and the background is inconsistent, which makes the finished video look like everything is constantly moving and lacks realism. At the same time, most methods require a lot of labeled data and a lot of training, which is extremely expensive and unaffordable.

Recently, a novel zero-sample text-to-video generation task introduced by the Picsart AI Resherch (PAIR) team proposes a low-cost method to apply it to the video field by utilizing existing text-to-image synthesis methods such as Stable Diffusion. The study mainly made two key modifications: one is to add dynamic motion information to the underlying code that generates frames to maintain the consistency of global scene and background time; The second is to use a new cross-frame attention mechanism to reprogram frame-level self-attention for each frame's attention in the first frame to maintain the context, appearance, and identity of foreground objects.

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Text2Video-Zero uses (i) text cues (see lines 1 and 2), (ii) cues combined with posture or edge guidance (see bottom right), and (iii) video instruction-Pix2Pix, that is, instructions to guide video editing (see bottom left) to achieve zero-sample video generation. The results are consistent in time and strictly follow guidance and text prompts.

The significance of this approach is that it has low overhead while producing high-quality and fairly consistent video. In addition, this method is suitable not only for text-to-video compositing, but also for other tasks such as conditional and content-specific video generation, as well as video-guided image-to-image translation.

Experiments have shown that this method is comparable in performance to recent methods, and even superior to them in some cases, although it is not trained on additional video data. This technology can be used to create animations, commercials and short films, saving costs and time. In addition, it can provide visual materials in the field of education to make learning more lively and interesting.

However, with continuous iteration of technology, these Wensang video AI models will become more accurate, realistic, and controllable. Like the gruesome "Smith eats pasta" video, these tools are likely to be used to generate false, hateful, explicit or harmful content, and issues of trust and safety are emerging.

Google says Google Imagen Video's training data comes from the publicly available LAION-400M image-to-text dataset and "14 million video-to-text pairs and 60 million image-to-text pairs." Although it has been trained by Google to filter "questionable data", it may still contain pornography and violence — as well as social stereotypes and cultural biases.

Meta also acknowledges that creating realistic videos on demand can bring certain social harms. At the bottom of the announcement page, Meta said that all AI-generated video content from Make-A-Video includes a watermark to "help ensure viewers know that the video is generated with AI, not captured." However, competing open-source Bunsen video models may follow, which could make Meta's watermark protection irrelevant.

Philip Isola, a professor of artificial intelligence at the Massachusetts Institute of Technology, said that if you see high-resolution video, people are likely to believe it. Some experts point out that with the advent of artificial intelligence voice matching and the gradual ability to change and create realistic videos that are almost at your fingertips, falsifying the words and deeds of public figures and the general public may cause immeasurable harm. However, "Pandora's box has opened", as the next stop for generative AI, the technology of Wensheng Video needs to be continuously improved, while still needing to be vigilant about safety and ethical risks.

Read on