AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Late last month, a Reddit user named chaindrop shared an AI-generated video on r/StableDiffusion subreddit, which caused a lot of controversy in the industry.

In the video, an ugly and deformed "Will Smith" generated by AI shovels a handful of spaghetti into his mouth with terrifying enthusiasm. The "hellish" video quickly spread to other forms of social media, with digital media and broadcaster Vice saying it would "stay with you for the rest of your life" and The A.V. Club calling it "the natural end point for AI development." On Twitter alone, the video has more than 8 million views.

The following GIF is part of it. Each frame shows a simulated scene of Will Smith gobbling up pasta from a different angle.

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Since the video of Will Smith eating pasta went viral, follow-up reports such as Scarlett Johansson and Joe Biden eating pasta have appeared on the Internet, and there are even videos of Smith eating meatballs. Although these scary videos are becoming the perfect "fear" model of the internet, like the previous Text2Video and every AI-generated content, Text2Video is accelerating into our lives.

Bunsheng Video: You write the script, I do the video

The workflow for creating the "Will Smith Eating Spaghetti" video from the open-source AI tool ModelScope is fairly simple: just give a hint for "Will Smith eating spaghetti" and generate it at 24 frames per second (FPS).

ModelScope is a "Vensonic Video" diffusion model trained to create new videos based on the user's prompts by analyzing millions of images and thousands of videos collected into the LAION5B, ImageNet, and Webvid datasets. This includes videos from Shutterstock, so there is a ghostly "Shutterstock" watermark on its output, as shown in the video.

At present, in the track of Wensheng video, large manufacturers and research institutions at home and abroad are also quietly competing. Back on September 29 last year, Meta released Make-A-Video, and on the initial announcement page, Meta showed text-generated sample videos, including "A young couple walking in heavy rain" and "A teddy bear in a portrait".

At the same time, Make-A-Video has the ability to take static source images and animate them. For example, a still photo of a turtle, once processed by an AI model, can look like it's swimming.

Less than a week after Meta launched Make-A-Video, Google released Imagen Video, which generates 1280×768 high-definition video at 24 frames per second based on written prompts. Imagen Video includes several notable stylistic capabilities, such as generating videos based on the work of famous painters such as Van Gogh's paintings, generating 3D rotating objects while preserving object structure, and rendering text in multiple animation styles. Google hopes that this video compositing model will "significantly reduce the difficulty of generating high-quality content."

Subsequently, Google launched another Phenaki video model, Phenaki. Unlike Imagen Video, which focuses on video quality, Phenaki primarily challenges video length. It can create longer videos based on detailed prompts, achieving "with a story, with a length". Its ability to produce video of any length of time comes from its new codec, CViVIT, a model built on techniques honed in Imagen, Google's early Ven Graphics system, but with a bunch of new components that transform static frames into smooth motion.

On February 6 of this year, Runway, the original startup behind Stable Diffusion, unveiled a video generation AI, the Gen-1 model, which can transform existing videos into new videos and change their visual style by using text prompts or any style specified by reference images. On March 21, Runway released the Gen-2 model, which focuses on generating videos from scratch, either by applying the composition and style of images or text cues to the structure of the source video (video-to-video), or, using only text (Bunsen video).

Step on the shoulder of "Wen Sheng Tu"

The key technology behind the Vincent video model like Make-A-Video — and why it arrived sooner than some experts expected — was because it stepped on the shoulders of the "Vincent Diagram" technology giant.

According to Meta, instead of training the Make-A-Video model on labeled video data (e.g., caption descriptions of actions described), they take image synthesis data (still images trained with captions) and apply unlabeled video training data so that the model learns the sense of where text or image cues may be in time and space. It can then predict what will happen after the image and display the dynamic scene in a short period of time.

From Stable Diffusion to Midjourney to DALL· E-2, the Vensentograph model, has become very popular and used by a wider audience. With the continuous expansion of multimodal models and the research of generative AI, recent work in the industry has sought to successfully extend text-to-image diffusion models to text-to-video generation and editing tasks by reusing them in the video domain, so that users can get the full video they want just by giving a hint.

Early methods of literary graphs relied on methods such as template-based generation and feature matching. However, these methods have limited ability to produce realistic and diverse images. After the success of GAN, several other deep learning-based gram methods have been proposed. These include StackGAN, AttnGAN, and MirrorGAN, which further improve image quality and diversity by introducing new architectures and enhancement mechanisms.

Later, as Transformer advanced, new approaches to grammar emerged. For example, DALL· E-2 is a 12 billion parameter transformer model: first, it generates an image token, which is then combined with a text token for joint training of autoregressive models. After that, Parti came up with a way to generate content-rich images with multiple objects. Make-a-Scene implements the control mechanism through the segmentation mask generated by the text diagram. The current method builds on diffusion models, thereby taking the synthetic quality of the Wen Sheng diagram to a new level. GLIDE improves DALL· E。 Later, DALL· E-2 makes use of the contrast model CLIP: through the diffusion process, mapping from CLIP text encoding to image encoding, as well as obtaining a CLIP decoder... (Click to learn about the past and present lives of Wen Sheng Tu)

These models are capable of producing images with high quality, so the researchers set their sights on developing literary graph models that can generate video. However, Vincent video is still a relatively new research direction. Existing methods attempt to generate using autoregressive transformers and diffusion processes.

For example, NUWA introduced a 3D transformer encoder-decoder framework that supports text-to-image and text-to-video generation. Phenaki introduces a bidirectional masking transformer and causal attention mechanism that allows video of arbitrary length to be generated from a sequence of text cues; CogVideo tunes the CogView 2 graph model by using a multi-frame rate layering training strategy to better align text and video clips; VDM jointly trains image and video data to naturally extend the Vensheng graph diffusion model.

The Imagen Video shown earlier builds a series of video diffusion models and uses spatial and temporal super-resolution models to generate high-resolution time-consistent videos. Make-A-Video leverages video data in an unsupervised way based on a text-to-image composite model. Gen-1 extends Stable Diffusion and proposes a content-guided approach to video editing based on the structure and content-guided description of the desired output.

Nowadays, more and more Vincent video models are constantly iterating, and we can see that 2023 seems to be the year of "Vincent Video".

Generative AI's Next Stop: Need for Improvement, Need for Vigilance

Although the techniques and training sets of Wensheng graphs are reused, it is not simple to apply diffusion models in the video field, especially due to their probability generation process, it is difficult to ensure time consistency. That is, the main subject tends to look slightly different from frame to frame, and the background is inconsistent, which makes the finished video look like everything is constantly moving and lacks realism. At the same time, most methods require a lot of labeled data and a lot of training, which is extremely expensive and unaffordable.

Recently, a novel zero-sample text-to-video generation task introduced by the Picsart AI Resherch (PAIR) team proposes a low-cost method to apply it to the video field by utilizing existing text-to-image synthesis methods such as Stable Diffusion. The study mainly made two key modifications: one is to add dynamic motion information to the underlying code that generates frames to maintain the consistency of global scene and background time; The second is to use a new cross-frame attention mechanism to reprogram frame-level self-attention for each frame's attention in the first frame to maintain the context, appearance, and identity of foreground objects.

Text2Video-Zero uses (i) text cues (see lines 1 and 2), (ii) cues combined with posture or edge guidance (see bottom right), and (iii) video instruction-Pix2Pix, that is, instructions to guide video editing (see bottom left) to achieve zero-sample video generation. The results are consistent in time and strictly follow guidance and text prompts.

The significance of this approach is that it has low overhead while producing high-quality and fairly consistent video. In addition, this method is suitable not only for text-to-video compositing, but also for other tasks such as conditional and content-specific video generation, as well as video-guided image-to-image translation.

Experiments have shown that this method is comparable in performance to recent methods, and even superior to them in some cases, although it is not trained on additional video data. This technology can be used to create animations, commercials and short films, saving costs and time. In addition, it can provide visual materials in the field of education to make learning more lively and interesting.

However, with continuous iteration of technology, these Wensang video AI models will become more accurate, realistic, and controllable. Like the gruesome "Smith eats pasta" video, these tools are likely to be used to generate false, hateful, explicit or harmful content, and issues of trust and safety are emerging.

Google says Google Imagen Video's training data comes from the publicly available LAION-400M image-to-text dataset and "14 million video-to-text pairs and 60 million image-to-text pairs." Although it has been trained by Google to filter "questionable data", it may still contain pornography and violence — as well as social stereotypes and cultural biases.

Meta also acknowledges that creating realistic videos on demand can bring certain social harms. At the bottom of the announcement page, Meta said that all AI-generated video content from Make-A-Video includes a watermark to "help ensure viewers know that the video is generated with AI, not captured." However, competing open-source Bunsen video models may follow, which could make Meta's watermark protection irrelevant.

Philip Isola, a professor of artificial intelligence at the Massachusetts Institute of Technology, said that if you see high-resolution video, people are likely to believe it. Some experts point out that with the advent of artificial intelligence voice matching and the gradual ability to change and create realistic videos that are almost at your fingertips, falsifying the words and deeds of public figures and the general public may cause immeasurable harm. However, "Pandora's box has opened", as the next stop for generative AI, the technology of Wensheng Video needs to be continuously improved, while still needing to be vigilant about safety and ethical risks.

AIGC's next stop: anticipation, vigilance pervades the world of AI editors

Read on

Director Nolan came to China to attend the Chinese premiere of "Oppenheimer" and the big-screen moviegoers shouted shock

New film丨 "Burning Winter": Dew marriage, deep love

"Super Race": High concept + strong cast, unlock quasi-explosive "super powers"

NBA absent World Cup star inventory: Jokic alphabet GoDurant leads 5 reasons for withdrawal

Yeonjin is not an evil girl! 8 points new drama, a little ruthless

The anti-fraud warning of "All or Nothing", the thunder is loud and the rain is small

"The Glory of Fathers" and "Hive" premiered on CCTV! Period drama VS spy war drama, which one do you chase

Basketball World Cup: Lithuania beats Mexico by 30 points Valanciunas cut 15+12

Athletics World Championships Finale: Uganda Rises, Yang Shaohui is the best in China, and the redemption of Norwegian teenagers

8.3 points, "Apocalypse in the Costume" really loaded?

Basketball World Cup: Serbia beats South Sudan to advance Bogdan 23+9 Jovic all

Redemption or destruction? Step inside Nolan's new film Oppenheimer and learn about one of the most important people of all time

Nolan and Oppenheimer

Oppenheimer, the naivety and pathos of scientists

Basketball World Cup: France lightly beat Côte d'Ivoire to top the qualifying group Gobert cut 17+8

Overblown, Nolan's "Oppenheimer" is not worth 8.8, but it is still stronger than the domestic main melody

Artificial intelligence brings parenting anxiety, and Chinese parents in Australia are worried about the future of their children

The past and future of OpenAI o1 and artificial intelligence

Of the four areas that will not be replaced by artificial intelligence in the future, the first is the most stable, and the fourth is the most cost-effective

Adobe's Project Turntable AI tool rotates two-dimensional artwork in three-dimensional space

Chen Jianlin|Types of enterprise data empowerment from the perspective of general artificial intelligence

Scientists are using new artificial intelligence to uncover the secrets of infant learning and development

Nansha and Huawei join forces! Jointly build an artificial intelligence ecological base

Top 10 Trends in Artificial Intelligence in 2025! The latest forecast →

Zhang Yimou revealed the progress of "The Three-Body Problem": only one film, significant deletion, and the introduction of artificial intelligence

The Frankfurt Book Fair focuses on the development and regulation of artificial intelligence

Top 10 trends in the future of artificial intelligence

Research Report | Explore the frontiers of science and technology and lead the future innovation" Artificial Intelligence Innovation and Application Expo The research journey set sail

The Forum was successfully held

Digital technology and artificial intelligence save the ratings of the Spring Festival Gala

DeepSeek is born, artificial intelligence is powerful, will teachers be replaced? Is there still any point in reading?

While a large number of people are unemployed, they are engaged in artificial intelligence, and the development has robbed hundreds of millions of people's jobs, what will happen in the future?