Sora: OpenAI’s Video Creator
The GenAI model creates video from text but the outcomes are far from perfect for now
It was only a matter of time before OpenAI toed the line taken by much larger tech giants such as Google and Meta and stepped into the world of video generation. Called Sora, the GenAI model uses a brief or detailed description or even a still image to create 1080p movie-type scenes with multiple characters, different motion types and background details.
OpenAI claims that Sora can also extend existing video clips by filling in the missing details. The company notes in a blog post that, “Sora has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions.” The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”
You can check out the Sora demo page right here and as you would observe there’s a lot of blah blah around the idea itself. Sample this – Sora is an AI model that can create realistic and imaginative scenes from text instructions. Or this – Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.
How does it do what it does?
According to the research note on the web page, Sora is called a diffusion model that generates videos by starting off with one that looks like static noise and then transforming it by removing the noise over several steps. “Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily,” it says.
It is similar to GPT models in that Sora uses a transformer architecture that unlocks superior scaling performance. ”We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios,” says the note.
Sora has been built on past research on DALL·E and GPT models and uses the re-captioning technique from DALL·E 3 that involves generating descriptions for the visual training data. “As a result, the model is able to follow the user’s text instructions in the generated video more faithfully,” the note says.
Not perfect, but quite an achievement
While the samples on the page do pass muster in terms of quality metrics for such videos, one cannot miss the video-gaming quality about them. Maybe because there isn’t much happening in the background and then there is also a certain AI quality (or is it lack of it) that creeps in. Yes, what we are saying is, it’s quite easy to make out that the videos are AI generated.
Having said so, one must acknowledge that Sora can generate videos across styles such as animated, black and white and photorealistic. And these are up to sixty seconds long, which is way beyond what one has come across till date when it comes to text-to-video models. Given that this is a first release, one can ignore the AI facade that seems to come through.
Of course, OpenAI is the first to admit that the model is far from perfect. The Sora page states “the current model has weaknesses. It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. It may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.”
OpenAI paper details out the next steps
The company, part-owned by Microsoft, is quite clearly positioning Sora as a research preview though later in the day it did share a technical paper titled Video generation models as world simulators. OpenAI also noted that it would bring tools to detect whether a video was generated by Sora and in case of a public-facing product will ensure that provenance metadata was included in the generated outputs.
“We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology. “Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time,” OpenAI says.
Now, coming to the research note co-authored by several OpenAI researchers, reveals that Sora can generate videos of an arbitrary resolution and aspect ratio. It can also perform several image and video editing tasks such as creating looping videos to extend videos forward or backwards in time or changing backgrounds of existing videos.
There is also a reference to Sora’s ability to simulate digital worlds whereby OpenAI fed prompts to Sora containing the word “Minecraft” and got it to render a Minecraft-like HUD and game that included its dynamics and physics and controlled the player character. Could this mean the end of the world for game developers? Let’s wait till the bombast gets over!
According to experts, this is made possible by the fact that Sora is a data-driven physics engine rather than a creative tool. It isn’t generating an image or a video but determining the physics of each object in an environment and then rendering it based on these calculations. “These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them,” says the note.