Interview with the person in charge of OpenAI Sora: 20 questions to delve into the details of R&D, Sora is still in the GPT-1 period

Compile | A pen

Edit | Yun Peng

Zhidong reported on April 26 that AI-generated video is not only an upgrade of the image generator, but also a key step towards AGI (Artificial General Intelligence). On "No Priors", Sora team leaders Aditya Ramesh, Tim Brooks and Bill Peebles join the moderator to discuss OpenAI's recently announced generative video model, Sora. The model is capable of generating realistic, visually coherent, and high-definition video clips of up to one minute based on text prompts.

In the interview, the three leaders discuss the development process of Sora and share their views on its potential applications, such as education, entertainment, digital identity, and more. However, the team's focus is still on the basic development of the technology, rather than on specific downstream applications. Brooks said that while the idea of including digital avatars makes sense, the team has not explored the issue at this time. According to Brooks, Sora is still in the GPT-1 era of AI video models.

In addition, Ramesh said that Sora's visual aesthetic is compelling, but its aesthetic is not deeply embedded in the model. When it comes to security, Sora also faces challenges such as misinformation and offensive text generation. In response, they will take all possible security measures to prevent deepfakes and misinformation from being generated, while ensuring that the model provides real value to users. At the same time, we will gradually open up technology and respect users' rights to free expression.

Peebles discussed how to make the technology more widely available, including reducing costs and dealing with possible misinformation and associated risks. Peebles mentioned that as part of a data network (DN), teams must consider security considerations and proactively take steps to address the associated risks. These have become one of the important tasks on the team's research path.

The following are 20 Q&As from an interview with the Sora team leader, some of which have been processed for ease of reading without going against the original intent as much as possible:

1. From text to video, from AI to AGI, how did you start researching this field?

Peebles: We strongly believe that models like Sora are really a critical step towards AGI. We think a good example is the scene of a group of people walking through the winter in Tokyo, which is an extremely complex environment. In this case, you can imagine a camera flying over the scene, many people interacting with each other, talking, holding hands, and vendors nearby. This example illustrates how Sora can model extremely complex environments and worlds within the weight range of neural networks.

To generate a truly realistic video, it's essential to learn how people work, how they interact, and ultimately their thought process. This includes not only humans, but also animals and other objects that need to be modeled. So as we continue to scale models like Sora, I'm sure we'll be able to build something like a world simulator.

This means that anyone can interact with the characters in the simulator, and I, as an individual, can run my own simulator and interact with the characters in the simulator. This interaction is one of the paths to AGI. As we expand Sora in the future, we will see AGI come to fruition.

2. What do you need to do before using Sora more broadly?

Brooks: We really want to talk to people outside of OpenAI and think about how Sora is going to impact the world and how it can help people. At this time, we don't have an immediate plan for the product, or even a clear timeline for custom products. But we're taking action to give access to Sora to a group of small artists and RED team members and start exploring the impact it could have.

We've received feedback from artists on how to make Sora the most useful tool for them. We've also received feedback from members of the red team to help us stay safe and think about how to present it to the public. This feedback will develop a roadmap for our future research and guide whether or not we will eventually launch the product, as well as a specific timeline.

3. Can you share the feedback you got?

Ramesh: We've opened up access to Sora to a small group of artists and creators for early feedback. We believe that the most important thing is controllability. Currently, the model only accepts text as input. While this feature is already quite useful, it's still limited by the need to describe exactly what you want. Therefore, we are looking at how we can extend the functionality of the model in the future to accept input beyond text.

4. Have you seen a favorite thing that artists or others have made with it, or a favorite video, or something that you find inspiring?

Brooks: It's amazing to see how artists are using this model. We have our own ideas for a few things, but those who make a career out of producing creative content are very creative. For example, Shy Kids made a really cool video where they made the short story Airhead, the character has a balloon and they love the story. It's cool to see that Sora is able to unlock and make this story easier to tell. I don't think it's about Sora making a particular clip or video, it's more about what these artists want to tell and be able to share that story, and Sora can help make that happen.

Peebles: My personal favorite sample is our Bling Zoo. I tweeted about Sora the day we launched it. It's essentially a multi-shot scene from the New York Zoo and a jewelry store. You can see the saber-toothed tiger in this glittering environment as if it were an ornament, which is very surreal.

I love these samples because as someone who loves to create content but doesn't actually have the creative skills, by manipulating this model, I was able to easily generate a whole bunch of ideas and end up with some great work. Also, it takes much less time to actually build than it would have been to generate content through iterative prompts.

So for me, it's very interesting to operate this model and get what you want out of it. I'm glad to see that artists also enjoy using this model and get creative inspiration from it.

5. When will we be able to see the actual content produced by Sora or other models, which is produced by professionals and becomes part of a broader media type?

Brooks: Good question. I don't have a prediction of the exact timeline, but I'm very interested in what people might use it for in addition to traditional films. In the next few years, we may see people making more and more films, but I think people will also find new ways to use these models that are completely different from what we're used to in the current media. When you tell these models what you want to see, and they're able to respond in a very different way, that makes for a very different paradigm.

Maybe a whole new mode of interaction will emerge, similar to how truly creative artists interact with content. So, I'm really excited about the new ideas that people are going to try. It's really interesting because it's different from what we currently have.

6. When you think about the capabilities of the world simulation model, do you think it will be the physics engine for simulation? People are actually simulating wind tunnels? Is it the basis of robotics?

Peebles: I think you're right on the point. For an app like a robot, you can learn a lot of things from video that you don't necessarily learn from other modes, like companies like OpenAI have invested a lot of money in this area in the past, like language, the way arms and joints move through space, and things like that.

Back in that scene in Tokyo, you can observe how the legs move and how they come into contact with the ground in a physically accurate way. Therefore, you can learn a lot about the physical world during the training. We believe that the original video is essential to the development of things like physical embodiment.

7. Can you explain what a diffusion Transformer is for a broad technical audience?

Brooks: Sora builds on OpenAI's DALL-E model and GPT model. Diffusion is a data generation process, in our case, video generation. The process starts with noise and is removed by repeating it several times until finally enough noise is removed to generate only one sample. That's how we generate videos. Let's start with a noisy video and work our way up to remove the noise.

From an architectural perspective, it's critical that our models are scalable, and they need to be able to learn from large amounts of data and understand those very complex and challenging relationships in the video. Therefore, we have adopted an architecture similar to the GPT model, called Transformer. We've even published a technical report on Sora, showing the results obtained from the same tip when using less, medium, and more compute volumes.

We believe that as we continue to add compute resources and data, we will continue to improve these models so that they can handle more tasks, such as better simulations and longer-term generation.

8. Can you explain to us what the scaling law of this model is?

Peebles: That's a great question. As Tim mentioned, one of the advantages of using Transformer is that it inherits all the great features that we see in other areas, such as languages. As a result, you can start to come up with the law of scaling for video rather than language.

This is something that our team is actively working on, and we're not just building the model, we're making it better. This means that I can get better results by using the same amount of training computation without radically increasing the amount of computation required. These are some of the questions our research team tackles every day to advance Sora and future models.

9. One of the problems with applying Transformer in this field is tokenization. Also, who came up with the name? but a sci-fi name like a planetic patch is really great, can you explain what it is and why it's related to it?

Brooks: I don't think we coined the name, it's more of a descriptive term. One of the key successes of the LLM paradigm is the concept of tokens. When you browse the internet, you will find all kinds of text data, including books, codes, math, and more. The beauty of language models is that they have a single concept of token, which allows them to be trained on such a wide range of data. However, in the past, visual generative models lacked a similar concept. So, before Sora, you would train an image generation model or a video generation model on a 256×256 resolution image or a 256×256 video that was exactly four seconds long.

Therefore, in Sora, we have introduced the concept of "space-time patches". You can think of it as a representation of the data, existing in images and long videos, like a highly stacked vertical video from which you can extract cubes. As a result, Sora can generate not only 720P resolution videos, but also vertical videos, widescreen videos, and even images. This makes Sora the first visual content generation model with breadth, just as language models have breadth. That's the real reason why we're moving in this direction.

8. How to apply end-to-end deep learning to video?

Brooks: Before Sora, a lot of the models that worked with video were actually looking at extending the image generation model and had made a lot of progress in image generation. Many people have been experimenting with an image generator and extending it a bit in order to make some videos that last longer than just a static image.

But what really matters to Sora is the difference in architecture. We didn't start with an image generator and then try to expand it into a video generator. Instead, let's start with a bigger problem: how to make a one-minute HD video clip. That's what we're aiming for. When we set this goal, we realized that we couldn't just rely on scaling image generators.

In order to make high-definition footage, we need something scalable, breaking down the data into very simple ways so that we can use scalable models. So I think it's really an architectural evolution from image generators to Sora. This is a very interesting framework because we believe that it can be applied not only to the field of video generation, but also to many others.

Of course, we're not the first to launch a video generator in the shortest amount of time. A lot of people have made impressive strides in video generation. However, we would like to work towards a much further future. We'd rather pick a point in the future and spend a year studying it. And we're under pressure to do things fast because AI is moving so fast.

11. One of the striking aspects of Sora is its visuals and aesthetics. Can you talk about how you tweaked or shaped Sora's aesthetic?

Ramesh: With Sora, we didn't put too much effort into aesthetics, the world itself is beautiful, but maybe a good answer. In fact, Sora's language understanding allows the user to guide it in a more direct way, which is difficult for other models to do. Users can provide a variety of prompts and visual cues to guide the model in generating the type of content they want. This interactivity allows users to communicate with the model more flexibly, resulting in a more accurate result that meets their expectations.

I think that the model of the future will understand the personal aesthetic. Many of the artists and creators we reach out to want to upload their entire assets into the model, so that they can draw on a large number of works when writing the title, and get the model to understand the terminology that their design company has accumulated over decades, etc. So, I think personalization, and how to combine it with aesthetics, is going to be a cool thing to explore.

12. Can we get a very different entertainment paradigm than we do now?

Brooks: I feel like the evolution of video models is going to lead to new ways of entertaining, educating and communicating. Entertainment is an important part of this, but on a deeper level, these models promise to give us a deeper understanding of the world and our lives, and how to experience them visually. Not only can they provide us with entertainment, but they can also be a powerful tool for education.

Sometimes, customized educational videos can be the best way to learn something new, and making videos to explain ideas can be the most effective way to communicate with others. So, I think there's a broader potential application for video models.

13. Have you tried to apply these technologies to things like digital identity, and does that not work very well? Because it's more like a text-to-video prompt.

Brooks: So far, our focus has been on Sora's core technology rather than on the specific application side. While the idea of including digital avatars makes sense, we haven't explored the issue yet. I think it would be cool to try these ideas, but I think where we are now in Sora's trajectory is like GPT-1 for this new visual model paradigm.

14. How do you think video models raise security issues and how to prevent counterfeiting, spoofing, or other similar issues?

Ramesh: That's a very complex question. I think we can learn a lot from DALL-E3, such as the way we deal with pornography or gore. But new security issues are bound to arise, such as misinformation or whether users are allowed to generate offensive content.

A key question is, how much responsibility should be taken on by the company deploying this technology? For example, should the company inform users that the content they see may not be from a trusted source? and how much responsibility should be borne by the user? This is a tricky question that we need to think about carefully to find the best solution.

15. In the past, people used Photoshop to process images and publish them, making claims. People aren't saying that the makers of Photoshop are responsible for the misuse of the technology, what do you think of those precedents?

Ramesh: That's important. We want people to be able to express themselves freely and do what they want to do, but also responsibly. It's a smart way to release the technology gradually, and gradually guide people to adapt to it. This ensures that the use of technology is responsibly and that users' right to free expression is respected.

16. Can you tell us about your next steps or some features you are developing?

Brooks: I'm really excited about how people are going to create something new with our products. I think there are a lot of talented, creative people who have their own things that they want to create. But sometimes it's really hard to do that because they may lack the necessary resources, tools, or other things. This technology has the potential to allow many talented and creative people to create what they want. I'm really looking forward to what amazing stuff they're going to make and how this technology will help them.

17. Aside from the obvious issues like length, can you describe the limitations you are trying to solve?

Peebles: There are several factors that we need to consider in order to make this technology more ubiquitous. One of the important factors is to reduce costs so that more people can afford them. We all know that in the world of video generation, the exact parameter settings have a big impact on the results. You know the resolution and duration of the video you're creating, but you also know that the generation process isn't instantaneous and requires a few minutes of waiting, especially for longer videos.

Therefore, we are actively working to reduce the cost of threading to achieve wider adoption. As part of the Data Network (DN), we must also consider security considerations, especially in election years. We handle potential misinformation with great caution and take proactive steps to address the risks around us. Today, solving these problems has become one of the important tasks on our research path.

18. What would you like to say about the future research direction of Sora?

Brooks: We hope that Sora will be able to analyze all the visual data to enable a deeper understanding of the world, even in 3D. This is very exciting because we don't feed the 3D information directly into it, but let it learn on its own by looking at the video data. It understands the 3D structure that exists in the video, for example, it knows that when you bite into a burger, it will leave bite marks.

As a result, it gives us a deeper understanding of our world. When we interact with the world, most of the information is visual, and what we learn is also visual information. So we do believe it's important to lead AI models to become smarter and better, so that they understand the world the way we do. Our world is full of complexity, and there's a lot of stuff about how people interact, how things happen, how past events affect future events, which can actually lead to a much broader range of smarter AI models than generating videos.

Much of human intelligence is actually related to our modeling of the world. Whenever we think about how to act, we imagine scenarios in our minds and use our imagination to imagine all possible scenarios. Before we actually act, we think, "What happens if I do this, what happens if I do that?" So we have a model of the world, and we build Sora as a model of the world, very similar to most of the intelligence that humans have.

19. How did we get Sora to have a model of the world that closely resembles a human, rather than something completely accurate like a physics engine?

Peebles: We know that human cognition isn't always accurate, so we can't be completely precise. When we delve into a very narrow set of physical laws and make long-term predictions, there are systems we can use to improve our understanding.

As a result, we are optimistic about Sora's prospects, believing that it will be able to replace this capability someday. In the long run, we hope that it will be able to play the role of a model of the world better than humans. However, we must also recognize that this ability is not always necessary for other types of intelligence. Still, there is room for improvement in the future for Sora and other models.

20. Do you think there are any misconceptions about video models among the public?

Ramesh: The release of Sora is probably the biggest update for the public. Internally, as Bill and Tim said, we've been comparing Sora to the GPT model. When GPT-1 and GPT-2 came out, it became increasingly clear that simply scaling up these models would give them amazing capabilities.

It's unclear whether scaling up the next marker prediction will result in a language model that will help write code. For us, it was clear that applying the same approach to a video model would also bring very amazing capabilities. I think the release of Sora 1 proves that, and now we're standing at a critical point on the scaling curve. We're very excited about it.

Peebles: As Tim and Aditya hinted at, we do feel like video models are in the GPT-1 moment, but those models are going to get better very quickly. We're very excited about this because we believe it will bring incredible benefits to the creative world."

While it will take time to achieve AGI, we are working to ensure that security issues are fully considered and that a strong technology foundation is built to ensure that society truly benefits from it, while mitigating potential negative impacts. Despite the challenges we face, it's an exciting time to see what the model of the future can achieve.

来源：No Priors

This article is from the WeChat public account "Zhidong" (ID: zhidxcom), author: Xia Qiuli, 36 Krypton is authorized to publish.

Interview with the person in charge of OpenAI Sora: 20 questions to delve into the details of R&D, Sora is still in the GPT-1 period

Read on

OpenAI曾秘密测试GPT-4o，登顶聊天机器人竞技场排行榜

最强OpenAI发布新ChatGPT-4o,AI领域的突破情感识别+视觉理解

OpenAI launches new large language model GPT-4o; Apple will start selling the Vision Pro in China; SoftBank sold almost all of its shares in Alibaba

OpenAI's new product release, three keys may subvert the marketing circle

OpenAI rewrote history overnight; With a 100% tax increase, the United States wants to do something to Chinese cars......

OpenAI推出最新大模型“GPT-4o”，你的快乐悲伤它都能读懂

GPT-4 tuning instructions revealed, OpenAI is open again! Netizens asked GPT-5 online

OpenAI blew up the field late at night, and GPT-4o overturned all voice assistants! Xuan Jing spoke out after her resignation

OpenAI releases GPT-4o late at night! It's about all companies, all business owners!

OpenAI overturned the voice assistant overnight! ChatGPT learns to look at screens, and the real-life version of Her is here

OpenAI is the king of bombing, and the global "AI Month" is coming! A-share multimodal AI concept stocks rose, and many game stocks rose to the limit

Artificial intelligence can also provide "emotional value", OpenAI released a new large model GPT-4o

Outburst! OpenAI's chief scientist llya has left, a man who broke Musk with Page

The key figures in the OpenAI infighting turmoil have resigned! Will the conservatives weaken and OpenAI accelerate commercialization?

Heavy! The key figures in the OpenAI infighting turmoil have resigned!

Google hardened OpenAI, and the Silicon Valley war continued to escalate