laitimes

If you don't understand the word this year, you risk losing your job

If you don't understand the word this year, you risk losing your job

What would it be like for two golden retrievers to record a podcast on the top of a mountain?

On February 16, Beijing time, netizens wrote this text prompt about the AI model Sora on social media, and OpenAI CEO Sam Altman typed it into Sora, generated a 10-second high-definition video, and posted it on social platforms. In the picture, two golden retrievers wear headphones and stand two microphones in front of them, squatting relaxed on a red and white picnic cloth in the mountains. The Golden Retriever's shiny coat and surrounding mountaintop environment are so realistic that they are almost indistinguishable from TV documentaries.

If you don't understand the word this year, you risk losing your job

Video generated by Sora based on "Two Golden Retrievers Recording a Podcast on the Top of a Mountain" Source: Video screenshot

Following the generation of text and images by AI, OpenAI has officially set foot in the field of video generation. In the early morning of February 16, Beijing time, OpenAI released the first AI text-generated video model Sora, which allows users to generate a video with a maximum duration of 1 minute by describing a scene with text. OpenAI only publishes dozens of Sora videos on its official website, and at the moment, the feature is not yet available to the public, only for security personnel, and will also provide access to specific artists and designers. However, the ultra-high restoration of text prompts in Sora video quickly detonated the Internet. Some netizens sighed, "Sora wants to change the life of the film and television industry" and "It will bring video content into the era of 'zero-based creation'". The boundary between AI and reality is difficult to distinguish, "reality, no longer exists".

Wang Shuai, an engineer at Nvidia, exclaimed after the release of Sora that it was "another ChatGPT moment". In an interview with China Newsweek, he mentioned that Sora has significantly increased the upper limit of AI's text-generated video capabilities, which is undoubtedly the consensus in the industry. However, there are still disagreements in the industry about what Sora's products and commercialization paths are, and how the products generate value. "Just because the technical skills have improved doesn't mean that it can solve all the problems, and it's still far from that to help Hollywood directors make movies without the need for camera. ”

Why does Sora outperform other models?

Even those who don't pay attention to the technology of large models will notice the 59-second video that has been widely circulated on this social network for the past two days: a woman wearing sunglasses, a red dress and leather clothes, and leather boots, strolling through the streets of Tokyo, the camera smoothly pushes to her face, and you can clearly see the freckles and skin texture on her face. This one-shot video doesn't need to be shot and edited by a human, just a description is fed into Sora and generated by AI.

If you don't understand the word this year, you risk losing your job

Screenshot of video generated by Sora, with the prompt words: A stylish woman walks on the streets of Tokyo filled with warm neon lights and animated city signs. She wears a black leather jacket, a long red skirt, black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The streets are damp and reflective, creating a mirror effect illuminated by colored lights. Many pedestrians walked around. Source: OpenAI official website

A 1-minute video isn't a long time, but it's a huge leap forward for an AI-generated video. In the past year or so, ChatGPT, Midjourney and other phenomenal explosive applications have been born, and the rapid development of AI-generated text and picture generation technology is exciting. In December 2023, the Google team released VideoPoet, a video generation model that can generate 10-second ultra-long, coherent and large-action videos at a time, surpassing other intelligently generated models of 3~4 seconds, which is enough to excite the industry.

Nie Zaiqing, chief researcher at Tsinghua University's Institute of Intelligent Industries, explained to China News Weekly that one of the main reasons for the short duration of text-generated videos was that AI didn't know what was going to happen next, so it didn't know what kind of content to generate.

Sora's video is longer, and viewers clearly feel that it is more logical, as it "demonstrates" the ability to understand the real world to a certain extent. In a video publicly released by OpenAI, the retro SUV is driven on a steep mountain road, the body is naturally bumpy, and the tires kick up dust, therefore, it is more convincing. OpenAI calls this capability "the prototype of the world model." Jim Fan, chief research scientist at the NVIDIA Artificial Intelligence Research Institute, also lamented on social platforms that Sora is not just a creative toy, but a data-driven physics engine that can simulate the real or virtual world.

If you don't understand the word this year, you risk losing your job

Screenshot of video generated by Sora. The prompt reads: The camera follows a vintage white SUV with a black roof rack, which accelerates on a steep dirt road surrounded by pine trees on a steep hillside, the tires kick up dust, and the sun shines on the SUV driving on the dirt road, casting a warm glow on the whole scene. The dirt road winds slowly into the distance, and no other cars or vehicles are in sight. The road is lined with redwood trees, scattered with patches of greenery. From the back, the car moves effortlessly along curves and looks like it's driving on rough terrain. The dirt road is surrounded by steep hills and mountains with clear blue skies and wisps of clouds.

Nie Zaiqing said that the world model can be simply understood as AI modeling the real world, which can restore the understanding of people and things in the real world, "For example, take a paper cup, AI 'knows' that it is very light, if the cup is made of iron, it will be very heavy, if a person drives in the wrong direction, other vehicles will be frightened to slow down or avoid."

Sora's ability to accurately understand the meaning of words and present realistic pictures lies in the same logic as ChatGPT, that is, "great power to work miracles". Nie Zaiqing mentioned that in the past, the challenge of text-generated videos was that the videos used to practice the model needed to adjust the resolution, aspect ratio, and duration to a unified format, which was not convenient enough. Sora proposes to use spatiotemporal visual patches to convert different video data into unified visual data representations, which is equivalent to the token (the smallest unit of text) used in the training process of ChatGPT. According to OpenAI's official introduction, Sora can sample widescreen 1920x1080p, vertical 1080x1920p, and everything in between. Sampling is more flexible, and the amount of data in the video increases.

In addition, training text to generate videos requires a large amount of video data with subtitles. OpenAI leverages DALL· The E3 and GPT models, which generate subtitles for the trained video set, can improve the text fidelity as well as the overall quality of the video.

But in Wang Shuai's view, the model technology is actually an open secret, and the reason why Sora's ability is so amazing is the data that OpenAI feeds to the model, "How much data do they use, how to choose the data, these are just points in OpenAI's report, there are almost no details, but only industry insiders know, this is the key."

Sainin Xie, an assistant professor of computer science at New York University, is a well-known scholar in the field of machine learning, and he is one of the lead authors of an important paper on diffusion, Sora, a diffusion model, which combines the underlying mode used by ChatGPT to achieve a breakthrough in the field of vision. Xie Sening also bluntly said on social media that OpenAI did not talk about the source and construction of the data at all, which may imply that data is the most critical factor in Sora's success. He speculated that OpenAI may have used game engine data, as well as movies, documentaries, movie long shots, etc., and the quality of the data is very important.

Sora still has a clear weakness

"You can draw exactly what you think and then bring it to life. As Tim Brooks, an OpenAI engineer who worked on Sora's design, said, Sora lowers the technical barrier to entry for video production, but the requirements for storytelling skills have increased. People can't help but worry that the emergence of Sora will have a significant impact on the Hollywood film industry, and a large number of directors, cameramen, makeup, props, editing, dubbing, etc. may lose their jobs.

An AI industry researcher who did not want to be named mentioned in an interview with China News Weekly that generating video tools is not equal to directors and screenwriters who can tell stories, just like printing cannot replace Li Bai and Du Fu, so the large-scale generation of videos only significantly reduces the threshold and cost of mass production, and the creativity, storytelling, and artistry of video works will become more and more demanding in the competition of more massive works.

At this stage, Sora still has obvious weaknesses. On social media, Tim Brooks posted a video generated by Sora based on the prompt "People were relaxing at the beach and then a shark jumped out of the water and surprised everyone." In the video, a woman turned her head for help after seeing the shark, but because the angle of turning her head was too large, she was ridiculed by netizens as "doing an exorcist-style 180-degree rotation". OpenAI has also openly acknowledged Sora's current limitations, which are not always an accurate simulation of the interaction laws of the physical world. For example, in the video it generates, a person runs in reverse on a treadmill, in some physical scenes, a person or animal will spontaneously appear, and even an AI models a chair as a flexible object.

If you don't understand the word this year, you risk losing your job

OpenAI scientist Tim Brooks posted a screenshot of a video generated by Sora on social media, which was ridiculed by netizens as "doing an exorcist-style 180-degree rotation" because the woman in the video turned her head at too large an angle. Source: Video screenshot

In Nie Zaiqing's view, what people see now are videos selected by OpenAI, which everyone thinks is amazing, but how many imperfections there are are can't be completely determined, and we will not know the real effect until more videos are released.

Unlike the optimism and excitement of some technology practitioners, Wang Shuai is more sober about the Sora model. He is more concerned about how models like Sora are implemented. Recently, he was asked how the model controls each object in the video that Sora generates, where people and vehicles are so natural, but in reality, the way the model operates is very different from the way humans think, and the model doesn't even know that there are objects, and it's the big data that tells it what each part is going to be. Generating video relies on huge amounts of data, and if you have enough data, you will be able to generate videos better. However, when editing videos in the future, there may be situations where the data distribution is not distributed, and if the data is not covered, the generation effect may not be satisfactory.

Wang Shuai further explained that it is not only necessary to have a large data scale, but also to have better data details, "People use the model to generate a 60-second video that works well, but this is not the end, if people want to do video editing in the future, such as taking off the sunglasses of the lady walking on the streets of Tokyo, if the training data does not include a video of this action, or it is rare, the model may not be able to do this instruction, or it will not be done very well." Therefore, a lot of debugging work needs to be done during testing, which is very time-consuming and labor-intensive. Wang Shuai mentioned that Sora's learning logic is the same as ChatGPT's, and the model has learned enough data to grasp the rules between data. But there's still a difference with artificial general intelligence – Sora is essentially mimicking video from training data.

"A text prompt, the model may correspond to tens of millions of videos, and at present, OpenAI presents a video that looks good enough, and everyone thinks this model is very powerful, but you don't know if it can generate other videos well. Wang Shuai mentioned that this is like a question with 100 answers, but it only gives one correct answer, which means that the large model will do this question? No, unless it can also tell you the remaining 99 answers, it means that there is no problem with its understanding.

As for whether it will subvert the film and television industry, in Wang Shuai's view, everyone is now guessing some general directions, but it is likely that the answer is unexpected. "When Google was first founded, people didn't think it would make money from advertising, and people initially had high hopes for Facebook to change the way they socialize, but they didn't think about the scandals that would leak the privacy of user data in the future, and the same is true for AI technology. Many technicians think that as long as the technical problems are solved, this is not the case. The impact of technology on the business and societal levels is an extremely complex system that is difficult to understand in terms of technical logic alone. ”

(Wang Shuai is a pseudonym.) )

Reporter: Yang Zhijie

Editor: Du Wei

Read on