How OpenAI’s Sora is Changing the Game: An Insight into Its Core Technologies

A masterpiece of state of the art technologies

Ryota Kiuchi, Ph.D.
Towards Data Science

--

Photo by Kaushik Panchal on Unsplash

On February 15, 2024, OpenAI, which had astonished the world by announcing ChatGPT in late 2022, once again stunned the world with the unveiling of Sora. This technology, capable of creating videos up to a minute long from a text prompt, is undeniably set to be a breakthrough.

In this blog post, I will introduce the underlying methodologies and research behind this astonishing technology, based on the technical report released by OpenAI.

Incidentally, “Sora” means “sky” in Japanese. Although it has not been officially announced whether this naming was intentional, it is speculated to be so, given that their official release tweet featured a video themed around Tokyo.

OpenAI unveils the Sora to the world via X

Table of Contents

About Sora

Sora is an advanced text-to-video conversion model developed by OpenAI, and its capabilities and application range illustrate a new horizon in modern AI technology. This model is not limited to generating mere seconds of video; it can create videos up to one minute long, maintaining high visual quality while faithfully reproducing user instructions. It’s as if it’s bringing dreams to life.

OpenAI Sora’s demo via X

Generating Complex Scenes Based on the Real World

Sora understands how elements described in prompts exist and operate within the physical world. This allows the model to accurately represent user-intended movements and actions within videos. For example, it can realistically recreate the sight of a person running or the movement of natural phenomena. Furthermore, it reproduces precise details of multiple characters, types of movement, and the specifics of subjects and backgrounds.

Previously, video creation with Generative AI has faced the difficult challenge of maintaining consistency and reproducibility across different scenes. This is because understanding previous contexts and details completely when generating each scene or frame individually and appropriately inheriting them to the next scene is challenging. However, this model maintains narrative consistency by combining a deep understanding of language with visual context and interpreting prompts accurately. It can also capture the emotions and personalities of characters from the given prompts and portray them as expressive characters within the video.

The post by Bill Peebles (OpenAI) via X

What kind of technology and research is behind it?

Photo by Markus Spiske on Unsplash

Sora is built upon a foundation of prior studies in image data generation modeling. Previous research has employed various methods such as recurrent networks, Generative Adversarial Networks (GANs), autoregressive transformers, and diffusion models, but has often focused on a narrow category of visual data, shorter videos, or videos of a fixed size. Sora surpasses these limitations and has been significantly improved to generate videos across diverse durations, aspect ratios, and resolutions. In this section, I will introduce the core technologies that support these advancements.

1. Transformer

Vaswani et al. (2017), “Attention is all you need.”

The Transformer model is a neural network architecture that revolutionized the field of natural language processing (NLP). It was first proposed by Vaswani et al. in 2017. This model significantly overcame the challenges that traditional Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) faced, supporting various breakthrough technologies as an innovative method today.

Figure 1: The Transformer — model architecture. |Vaswani et al. (2017)

Issues with RNNs:

  • The problem of long-term dependencies: Although RNNs theoretically can transmit information through time, they struggle to capture dependencies over long durations in practice.
  • Limitations on parallelization: Since the computation at each step in an RNN depends on the output of the previous step, sequential processing (e.g., processing words or sentences in a text one by one, in order) is mandatory, preventing the utilization of parallel processing advantages offered by modern computer architectures. This made training on large datasets inefficient.

Issues with CNNs:

  • Fixed receptive field size: While CNNs excel at extracting local features, their fixed receptive field size limits their ability to capture long-distance dependencies throughout the context.
  • Difficulty in modeling the hierarchical structure of natural language: It’s challenging to directly model the hierarchical structure of language, which can be insufficient for deep contextual understanding.

New features of the Transformer:

  • Attention Mechanism: Enables direct modeling of dependencies between any positions in the sequence, allowing for the direct capture of long dependencies and extensive context.
  • Realization of parallelization: Since the input data is processed as a whole at once, a high degree of parallelization in computation is achieved, significantly accelerating training on large datasets.
  • Variable receptive field: The attention mechanism allows the model to dynamically adjust the “receptive field” size as needed. This means the model can naturally focus on local information for certain tasks or data, and consider broader context in other cases.

For more detailed technical explanations about Transformer:

2. Vision Transformer (ViT)

Dosovitskiy, et al. (2020), “An image is worth 16x16 words: Transformers for image recognition at scale.”

In this study, the principles of the Transformer, which revolutionized natural language processing (NLP), are applied to image recognition, opening up new horizons.

Token and Patch

In the original Transformer paper, tokens primarily represent parts of words or sentences, and analyzing the relationships between these tokens allows for a deep understanding of the sentence’s meaning. In this study, to apply this concept of tokens to visual data, images are divided into small sections (patches) of 16x16, and each patch is treated as a “token” within the Transformer. This approach enables the model to learn how each patch is related within the entire image, allowing for the recognition and understanding of the entire image based on this. It surpasses the limitations of the fixed receptive field size of traditional CNN models used in image recognition, enabling flexible capture of any positional relationships within an image.

Figure 1: Model overview. |Dosovitskiy, et al. (2020)

For more detailed technical explanations about ViT:

https://machinelearningmastery.com/the-vision-transformer-model/

3. Video Vision Transformer (ViViT)

Arnab, et al. (2021), “Vivit: A video vision transformer.”

ViViT further extends the concept of the Vision Transformer, applying it to the multidimensional data of videos. Video data is more complex as it contains both static image information (spatial elements) and dynamic information that changes over time (temporal elements). ViViT decomposes videos into spatiotemporal patches, treating these as tokens within the Transformer model. With the introduction of spatiotemporal patches, ViViT is able to simultaneously capture both static and dynamic elements within a video and model the complex relationships between them.

Figure 3: Tubelet (the spatio-temporal input volume) embedding image. |Arnab, et al. (2021)

For more detailed technical explanations about ViViT:

4. Masked Autoencoders (MAE)

He, et al. (2022), “Masked autoencoders are scalable vision learners.”

This study dramatically improved the traditionally high computational costs and inefficiencies in training on large datasets associated with high dimensionality and vast amounts of information, using a self-supervised pre-training method called the Masked Autoencoder.

Specifically, by masking parts of the input image, the network is trained to predict the information of the hidden parts, resulting in more efficient learning of important features and structures within the image, and acquiring rich representations of visual data. This process has made the compression and representation learning of data more efficient, reduced computational costs, and enhanced the versatility of different types of visual data and tasks.

The approach of this study is also closely related to the evolution of language models by BERT (Bidirectional Encoder Representations from Transformers). While BERT enabled a deep contextual understanding of text data through Masked Language Modeling (MLM), He et al. have applied a similar masking technique to visual data, achieving a deeper understanding and representation of images.

Figure 1: Masked Autoencoders Image. |He, et al. (2022)

For more detailed technical explanations about MAE:

5. Native Resolution Vision Transformer (NaViT)

Dehghani, et al. (2023), “Patch n’Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.”

This study proposed the Native Resolution ViTransformer (NaViT), a model designed to further expand the applicability of the Vision Transformer (ViT) to images of any aspect ratio or resolution.

Challenges of Traditional ViT

The Vision Transformer introduced a groundbreaking approach by dividing images into fixed-size patches and treating these patches as tokens, applying the transformer model to image recognition tasks. However, this approach assumed models optimized for specific resolutions or aspect ratios, requiring model readjustment for images of different sizes or shapes. This was a significant constraint, as real-world applications often need to handle images of diverse sizes and aspect ratios.

Innovations of NaViT

NaViT is designed to efficiently process images of any aspect ratio or resolution, allowing them to be directly inputted into the model without prior adjustment. Sora applies this flexibility to videos as well, significantly enhancing flexibility and adaptability by seamlessly handling videos and images of various sizes and shapes.

Figure 2:|Dehghani, et al. (2023)

6. Diffusion Models

Sohl-Dickstein, et al. (2015), “Deep unsupervised learning using nonequilibrium thermodynamics.”

Alongside the Transformer, Diffusion Models form the backbone technology supporting Sora. This research laid the theoretical foundation for diffusion models, a deep learning model using non-equilibrium thermodynamics. Diffusion models introduced the concept of a diffusion process that starts with random noise (data without any pattern) and gradually removes this noise to create data resembling actual images or videos.

For instance, imagine starting with mere random dots, which gradually transform into videos of beautiful landscapes or people. This approach was later applied to the generation of complex data such as images and sounds, contributing to the development of high-quality generative models.

Image of denoising process|Image Credit (OpenAI)

Ho et al. (2020), “Denoising diffusion probabilistic models.”

Nichol and Dhariwal (2021), “Improved denoising diffusion probabilistic models.”

Building on the theoretical framework by Sohl-Dickstein et al. (2015), practical data generation models known as Denoising Diffusion Probabilistic Models (DDPM) were developed. This model has shown particularly notable results in high-quality image generation, demonstrating the effectiveness of diffusion models.

Impact of Diffusion Models on Sora

Typically, to train machine learning models, a lot of labeled data is needed (for example, being told “This is an image of a cat”). However, diffusion models can learn from unlabeled data as well, allowing them to utilize the vast amount of visual content available on the internet to generate various types of videos. In other words, Sora can observe different videos and images and learn “this is what a normal video looks like.”

For more detailed technical explanations about Diffusion Models:

7. Latent Diffusion Models

Rombach, et al. (2022), “High-resolution image synthesis with latent diffusion models.”

This research has made a significant contribution to the field of high-resolution image synthesis using diffusion models. It proposes a method that significantly reduces computational costs compared to direct high-resolution image generation by utilizing diffusion models in the latent space while maintaining quality. In other words, instead of directly manipulating images, it demonstrates that by encoding and introducing the diffusion process to data represented in the latent space (a lower-dimensional space holding compressed representations of images), it is possible to achieve with fewer computational resources.

Sora applies this technology to video data, compressing the temporal+spatial data of videos into a lower-dimensional latent space, and then undergoing a process of decomposing it into spatiotemporal patches. This efficient data processing and generation capability in the latent space plays a crucial role in enabling Sora to generate higher-quality visual content more rapidly.

Image of visual encoding|Image Credit (OpenAI)

For more detailed technical explanations about Latent Diffusion Models:

8. Diffusion Transformer (DiT)

Peebles and Xie. (2023), “Scalable diffusion models with transformers.”

This research might be the most crucial in realizing Sora. As mentioned in the technical report published by OpenAI, Sora employs not a vanilla (normal) transformer but a diffusion transformer (DiT).

Importantly, Sora is a diffusion transformer. (via OpenAI Sora technical report)

The study introduced a new model that replaces the U-net component, commonly used in diffusion models, with a Transformer structure. This structure enables the Latent Diffusion Model through operations on latent patches by the Transformer. This approach allows for more efficient handling of image patches, enabling the generation of high-quality images while effectively utilizing computational resources. The incorporation of this Transformer, which differs from the Stable Diffusion announced by Stability AI in 2022, is considered to contribute to more natural video generation.

Figure 1: Generated images by the Diffusion Transformers|Peebles and Xie. (2023)

Furthermore, it’s important to note that their validation results demonstrate the scalability of DiT, significantly contributing to the realization of Sora. Being scalable means that the model’s performance improves with an increase in the transformer’s depth/width (making the model more complex) or the number of input tokens.

Figure 8 & 9: Scalability of the Diffusion Transformers|Peebles and Xie. (2023)
  • Gflops (Computational performance): A unit of measure for a computer’s calculating speed equal to one billion floating-point operations per second. In this paper, network complexity is measured by Gflops.
  • FID (Fréchet Inception Distance): One of the evaluation metrics for image generation, where a lower value indicates higher accuracy. It quantitatively assesses the quality of generated images by measuring the distance between the feature vectors of generated images and real images.

This has already been observed in the field of natural language processing, as confirmed by Kaplan et al. (2020) and Brown et al. (2020), supporting the crucial characteristics behind the innovative success of ChatGPT.

Kaplan et al. (2020), “Scaling Laws for Neural Language Models.”

Brown, et al. (2020), “Language models are few-shot learners.”

This significant feature, in addition to generating high-quality images at a lower computational cost than traditional diffusion models due to the benefits of the Transformer, indicates that even higher-quality images can be produced with larger computational resources. Sora applies this technology to video generation.

Scalability of video generation|Image Credit (OpenAI)

For more detailed technical explanations about DiT:

Review the paper of DiT by hu-po via YouTube

The capabilities enabled by these research efforts for Sora

Variable durations, resolutions, aspect ratios

Primarily, thanks to NaViT, Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos, and everything in between. This means it can create visuals for various device types at any resolution.

Prompting with images and videos

Currently, the videos generated by Sora, as demonstrated, are created in a text-to-video format, where instructions are given through text prompts. However, as can be easily anticipated from the previous research, it’s also possible to use existing images or videos as inputs, not just text. This allows for the animation of images or for Sora to imagine and output the past or future of an existing video as visuals.

3D consistency

While it’s not clear how the aforementioned research is directly involved, Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.

The future of Sora

In this blog post, I have explained the technologies behind OpenAI’s AI for generating videos, Sora, which has already shocked the world. Once it becomes publicly available and accessible to a wider audience, it is bound to make an even more significant impact worldwide.

The impact of this breakthrough is expected to span across various aspects of video creation, but it is predicted that it may likely evolve from video to further advancements in 3D modeling. If that becomes the case, not only video creators but also the production of visuals in virtual spaces like the metaverse could soon be easily generated by AI.

The arrival of such a future has already been implied as below:

Martin Nebelong’s post about Micael Rublof’s product via X

Currently, Sora is perceived as “merely” a video generation model, but Jim Fan from Nvidia has implied it might be a data-driven physics engine. This suggests the possibility that AI, from a vast amount of real-world videos and (though not explicitly mentioned) videos considering physical behaviors like those from Unreal Engine, might understand physical laws and phenomena. If so, the emergence of text-to-3D in the near future is also highly probable.

Jim Fan’s intriguing post via X

Thank you so much for reading this article.
Your clap to this article and subscription to
my newsletter would motivate me a lot!

--

--