
While 2023 was a year dominated by Language Models (LLMs) and a surge in Image Generation technologies, Video Generation, on the other hand, received relatively little attention. In researching this topic, I found it quite challenging to keep up with the latest developments and the overall architectural designs, as they represent a diverse array of models.
In this post, I aim to share how Video Generation has evolved in recent years, how the architectures of models have developed, and what outstanding questions we now face.
During the time of writing this article, OpenAI released Sora – a video generation model with stunning capabilities. While its architecture is not disclosed, I hope you will get some insights into what it can be.
Let’s Dive into the Timeline
Consider this timeline as a journey to observe the evolution of proposed models for Video Generation. This will help us understand why the models are designed the way they are today and provide insights for future research and applied works.
Each model is supplemented with a unified graphical representation of its architecture and pipeline. Treat it as a simplified graphical summary, rather than an in-depth model schema.
So, let’s start with a not-so-early point in time – 2022…
The Dawn
📖 Video Diffusion Model [1] | 🗓 ️ April 2022
To understand the rapid dynamics of the improvements in this area I propose to start with Google’s pioneering work on utilising Diffusion Models for video generation. While the paper doesn’t go much into details, it was one of the first attempts to create T2V (text-2-video) diffusion model.

What’s inside?This model represented initial tries to utilize Diffusion Models for generating videos from text descriptions. The architecture of VDM leverages 3D U-Net with temporal attention to generate consistent images across frames. It doesn’t feature latent diffusion or cascade of diffusion models (you will see why it’s useful in future works). The model can be trained jointly on both videos and images. It’s done by removing the attention operation inside each temporal attention block and fixing the attention matrix for this input.
What’s it trained on?Undisclosed dataset of 10 million captioned videos.
Taking into account the brief style of the published paper, this work was a Proof-of-concept demonstration before a real-world example. But now it’s the turn of another corporation…
The first cascade
📖 Make-a-Video[2] | 🗓 ️ September 2022
Meta’s pioneering work on T2V (text-2-video). Most notably, trained on the open-source unpaired video dataset.

What’s inside?Same as a previous model, it ** extends Diffusion Text-to-Image (T2I) models by integrating Temporal layers to achieve text-to-video generation. But the biggest change you may notice on the schema above is a cascade. Authors used a cascade of Spatiotemporal and Spatial Super-resolution diffusion models to increase resolution and frame rate**. Quoting the article:
"The different components of the cascade are trained independently. Initially, the decoder, prior, and two super-resolution modules are trained using images only, without any text. The base T2I model is the only part that gets text input and is trained with text-image pairs without further training on videos."
Another big step is 2+1D you noticed on the schema. The model utilizes a "pseudo-3D" convolution approach for more efficient integration of temporal information. In a nutshell, pseudo-3D approaches aim to simulate the effects of full 3D convolutions and attention mechanisms (which directly operate on videos as 3D data) with a more computationally efficient strategy. It first applies **** Standard 2D convolution across each frame separately, then – 1D convolution along the temporal axis to share information between frames.
What’s it trained on?A curated mix of open-source datasets: 2.3B subset LAION-5B where the text is English + 10M subset HD-VILA-100M + WebVid-10M
The introduced approaches of cascade and pseudo-3d are not unique to this model, but all this with the utilisation of an open-source dataset made this a foundational paper that was cited and compared by many new methods after.
Let’s make it BIG
📖 Imagen Video [3] | 🗓 ️ October 2022
If you, after the previous work, wondered "If we have a cascade, can we make a bigger cascade?" then here is your answer. Meet Google’s response with an even bigger cascade.

What’s inside?There is a big set of 7 models inside: Text encoder, Base model (Video U-Net), 3 Spatial- (SSR) and 2 temporal- (TSR) super-resolution models. All 7 models can be trained in parallel. As in the previous Make-a-Video[2], SSR models increase spatial resolution for all input frames, whereas the TSR models increase temporal resolution by filling in intermediate frames between input frames.
It is worth mentioning that all the models in cascade use +1 Temporal Dimension. It means they all adapted to the video format, so Super Resolution (SSR) can consider the video’s temporal aspect and do not produce "flickering artefacts". For computational efficiency, only the Base model uses Temporal Attention across all frames, while SSR and TSR use Temporal Convolution (it’s "cheaper" to compute).
As to the exact performance numbers, quoting the article:
… starting with a base 16 frame video at a low resolution, then upsampling to generate a final 128 frame video at 1280×768 resolution and 24 frames per second.
What’s it trained on?Internal dataset 😒 – 14 million video-text pairs and 60 million image-text pairs + LAION-400M
While it looks like the newest model should have a huge cascade – it’s doesn’t. We strive for simplicity, and while some parts of the cascade are still with us (like SSR), almost all new works in this area are focusing on creating one model "to rule them all" and ditching the cascade.
How long can it be?
📖 NUWA-XL Video [4] | 🗓 ️ March 2023
A bit forgotten, yet very interesting work that showed "diffusion over diffusion" for the generation of extremely long clips with relatively good temporal consistency. Instead of cascade to interpolate frames, this model utilises recursion.

Also, one of the earliest works that used latent representation for images.
What’s inside?The core concept here is ** "Diffusion over Diffusion" – a hierarchical, _coarse-to-fin_e approach to video generation, starting with a global model to establish keyframes and progressively filling in details through local diffusion models. This enables the parallel generation of video segments**.
The authors achieve a reduction from 7.55 minutes to 26 seconds for 1024 frames (but GPUs config is not disclosed).
The key module here is Mask Temporal Diffusion (MTD), but don’t be afraid of terminology. The word "mask" means that is handles both the global generation process (which lacks initial/ending video frame references) and local refinement processes (which use existing frames as guidance).
Remember I mentioned "latent representation"? Authors applied what they name a T-KLVAE latent encoder. In details, quoting the article:
T-KLVAE utilizes pre-trained image KLVAE (latent encoder) with added temporal convolution and attention layers, preserving spatial information while incorporating temporal dynamics.
In a nutshell – T-KLVAE encodes videos into a compact dimension representation, that reduces computational complexity. Among other notable features, authors argue that it’s first model is directly trained on long films (up to 3376 frames).
What’s it trained on?FlintstonesHD – it is ~6M frames of 166 cartoon episodes in HD quality.
This work is distinct in a way that it showed entire generated cartoon episodes in one run, but almost one year after its publication, modern T2V models still struggle with consistently long videos. It can be partially justified by the lack of resources or race for the image quality over length.
The rise of latent representation
📖 Video LDM [5] | 🗓 ️ April 2023
Now it’s time for SD (StableDiffusion). While we saw a trick with latent space in NUWA-XL, this work takes the all-loved StableDiffusion and converts it into a T2V model. LDMs (Latent Diffusion Model) were a preferred architecture for image generation, now it’s time for them to shine in video generation.

What’s inside? Video LDM extends the latent space of traditional LDMs with a temporal dimension. The pipeline is straightforward: (i) Pre-train LDM (StableDiffusion) on images only (ii) Introduce Temporal Dimension into the latent space and fine-tuning the video dataset. (iii) Optional, fine-tune Image Up-sampler to make the video super-resolution model. Utilisation of LDM enabled even bigger computationally efficiency that opened the way to generate high-resolution videos (up to 1280 x 2048 pixels).
What’s it trained on?Closed-source 😒 real driving scene videos contain 683,060 videos of 8 seconds with resolution 512 × 1024 + WebVid-10M.
It was exciting to see SD in T2V for many reasons. Authors showed that DreamBooth can be incorporated to "insert" the specific object or styles that preserves the original appearance.
In addition, with fine-tuning, this approach can be applied to any SD T2I to convert into T2V. But what if we can take any SD T2I model and convert to an animation model, without any fine-tuning? That is where we going with the next work…
Can we animate anything?
📖 AnimateDiff [6] | 🗓 ️ June 2023
This is one of the most interesting applications of video pre-trained LDMs. The idea is simple – model learns emotional prior from video to animate the stack images. Those motion priors go on top of any StabelDiffusion model, so you can plug personalised SD models without retraining.

What’s inside?At the core of AnimateDiff is a Motion Modeling Module, a Spatio-Temporal Transformer trained on video datasets. The framework integrates this module into frozen T2I models (like Stable Diffusion) through a process known as Inflation, enabling the original model to process 5D video tensors (batch × channels × frames × height × width). Going into details and quoting the paper:
"Temporal Transformer" consists of several self-attention blocks along the temporal axis, with sinusoidal position encoding to encode the location of each frame in the animation.
5D video tensors: This is achieved by transforming each 2D convolution and attention layer into spatial-only pseudo-3D layers. 5D video tensor in the shape of batch × channels×frames×height×width as input
The inflation of motion priors allows us to replace the frozen T2I component with custom models of the same architecture during inference. This Removes pre-training step for converting T2I into T2V.
What’s it trained on? Motion prior (Temporal Transformer) was trained on WebVid-10M.
It is really an amazing work with highly maintained repo updated with newer versions, for example, support of SD-XL or Domain Adapter LoRA. Taking into account a vast amount of personalised T2I SD models, it’s a big field for creativity.
The hybrid
📖 Show-1 [7] | 🗓 ️ September 2023
Can we merge latent representation and pixel representation? In the end, why not?

Meet the Show-1 – hybrid model both utilising pixel-based and latent-based diffusion models.
What’s inside?It has a cascade structure of 3 Pixel-based Diffusion models (DeepFloyd as a base T2I model for key-frames, one Temporal- and one Resolution-interpolation) and 1 LDM as Super-resolution model. Quoting the paper:
The model starts with pixel-based video diffusion models (VDMs) for creating low-resolution videos closely aligned with text prompts. After, it employs latent-based VDMs to upscale the low-resolution output to high-resolution videos.
Authors argue that pixel-based is good for motion while latent is a good expert for Super-Resolution. They support their argument by showing the superior metrics on evaluation compared to VideoLDM or Make-a-Video[2]
What’s it trained on?All loved WebVid-10M
This paper raises a really interesting question: Is a latent representation good for video? At the end, you will see an unexpected answer.
Open-Source Milestone
📖 Stable Video Diffusion[8] | 🗓 ️ November 2023
The most notable open-source T2V model as of writing this post. Despite sharing a lot of similarities with Video LDM, the biggest value of this paper is data curation. Authors describe how they curated a large video dataset in detail.

Treat this work not as a new model, but rather answer to the question of how all those closed-source datasets were created and curated.
What’s inside?SVD shares the same architecture as the Video LDM[5]. (i) The model begins by training SD 2.1 on image-text pairs. (ii) Inserting temporal convolution and attention layers to adapt the model for video generation, training on large amounts of video data. (iii) Fine-tuning the model on a smaller subset of high-quality videos
The main focus of this paper is data processing to create well-curated video-text pairs. It starts with cut detection pipeline to prevent abrupt cuts and fades from affecting synthesised videos. Each video clip is annotated using three synthetic captioning methods:
- CoCa (Image Captioner): Annotates the mid-frame of each clip.
- V-BLIP: Provides a video-based caption.
- Llm-based Summarization: Combines the first two captions to create a concise description of the clip.
In the end, they filter static scenes by measuring the average optical flow and using OCR to remove clips with an excess amount of text.
What’s it trained on?Closed-source large video dataset. But at least they showed how to create such a dataset 🙂
Don’t own a private video dataset? Not a problem!
📖 VideoCrafter-v2[9] | 🗓 ️ December 2023
Another T2V is derived from Stable Diffusion. The main interest here is a well-detailed training procedure and description of how authors overcome the limitations of using only available low-quality videos with high-quality generated images.

What’s inside?Utilizes a similar architecture to VideoCrafterV1[2] and other T2V LDMs, incorporating spatial modules with weights initialized from SD 2.1, and temporal modules initialized to zeros. Very straightforward structure with no frame interpolation or upsampling; because of that and the hardware capabilities, the model can generate max 2s long videos.
Initially, a fully trained video model is obtained. Then, only the spatial modules of this model are fine-tuned with high-quality generated images. In the end, you have a T2V model with distinct quality compared to the previous SDV[4], but without utilising private datasets. **** Special thanks for the well-detailed experiment settings and description of different approaches of fine-tuning T2I on videos.
What’s it trained on?Main video training on WebVid-10M and LAION-COCO. For fine-tuning spatial modules, authors utilised the Journeydb [Junting Pan et al] dataset of high-quality generated images.
Moving on. Tired of diffusion models? Understandably so. Fortunately, human ingenuity never ceases, leading us to innovate a.k.a. –
"Let’s put LLM everywhere and for everything"
LLM enters the chat
📖 VideoPoet [10] | 🗓 ️ December 2023
The most distinct paper out of all of them. Unlike traditional methods that rely on diffusion, VideoPoet utilises autoregression LLM to generate Videos and even sounds.

The most distinct paper out of all of them. Unlike traditional methods that rely on diffusion, VideoPoet utilises autoregression LLM to generate Videos and even sounds.
What’s inside?Internal parts of this particular model are exciting but can be pretty familiar for people who worked with multi-modal LLMs. Here authros utilizes a decoder-only LLM architecture capable of admitting image, video, and audio modalities as discrete tokens.
To create such a token employs MAGVIT-v2 for joint image and video tokenization and SoundStream tokenizer for audio, encoding the first frame and subsequent 4-frame chunks into tokens for representation.
Due to the capability of handling inputs across different modalities authors stated that this enables the model to perform a wide range of Video Generation tasks with zero-shot capabilities: text-to-video, image-to-video, video stylization, and video-to-audio tasks. In addition, it can generate long videos, quoting the article:
The model can generate longer videos by conditioning the last second of the video to predict the next second, allowing for the creation of videos of any desired duration with strong object identity preservation.
What’s it trained on?Undisclosed **** 1B image-text pairs and ∼270M videos
Similar to Show-1, this model raises another question – can we really apply LLM? Should we really use Diffusion? While this discussion remains open, we can probably answer the previous one.
The latest development
📖 Lumiere[11] | 🗓 ️ January 2024
Do you remember the question "_Is a latent representation good for video?"In the latest work, Google think that pixel-based diffusion is the way to go if you have resources_

The latest development in T2V pixel-based diffusion models. The latest with detailed technical paper (hello Sora 🙂 ). It’s interesting that while most of the recent works were focused on Latent Diffusion, this work re-thinks the cascade models in some sense.
What’s inside?Overall, the model consists of a base model (STUnet) that generates all frames in one run and a temporally-aware spatial super-resolution (SSR) model. No frames interpolation. First and main novelty – Space-Time U-Net (STUnet). Quoting the paper:
The architecture extends an Imagen T2I model and now downsamples the input signal both spatially and temporally. This includes interleaving temporal blocks within the text-to-image architecture and inserting temporal down- and up-sampling modules
Another distinct feature is how it re-thinks cascade models. The model incorporates a super-resolution model on overlapping windows and MultiDiffusion to blend overlapping sections for resolution enhancement to 1024×1024. All this empowers the model to show SOTA quality on a range of tasks, including image-to-video, video inpainting, and stylized generation.
Authors argue that the idea of the Space-Time U-Net and MultiDiffusion can be applied to LDMs. Let’s see if and how this will work in the future.
What’s it trained on?Undisclosed 30 million videos with text captions, each video being 80 frames long at 16 fps
Feeling like have an open question after reading to this point? Well, you are not alone…
❓ Outstanding questions ❓
How is this possible?
Also questioning reality after Sora was released by OpenAI? Unfortunately, the technical paper looks like a blog post and doesn’t disclose detailed architecture. We can speculate on some sort of fusion of the LLM with the Diffusion model trained on a huge amount of data.
LLM vs Diffusion vs Other?
Is diffusion the best one? We saw that the current video models are diffusion-based. The base idea is to produce frames and then to create temporally consistent animation between frames. But we also saw LLMs that generate tokens which then decoded into the image and even sound. Will we see the new breakthrough architecture this year?
Temporal consistency
Just a few works here are targeting long video generation. Mainly because diffusion models lack understanding of the "changing view" – when the camera hops from one view to another. That’s why most models are trained by filtering such changes to remove "flickering" – a change of image appearance mid-video when the model was trained on unfiltered data.
Where to get Data?
The main concern now is where to get quality data. It’s mostly about annotated data, since gathering videos and annotating them costs money, more than the majority of labs can afford. However, we already saw how some models overcome this issue using generated image datasets. Will we see a new "holy grail" video dataset this year?
What do you think is the biggest question yet to solve for Video Generation models?
Citations
[1] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, D. J. Fleet, "Video Diffusion Models" (2022), arXiv:2204.03458.
[2] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, Y. Taigman, "Make-a-Video: Text-to-Video Generation without Text-Video Data" (2022), arXiv:2209.14792.
[3] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, T. Salimans, "Imagen Video: High-definition Video Generation with Diffusion Models" (2022), arXiv:2210.02303.
[4] S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang, J. Fu, G. Ming, L. Wang, Z. Liu, H. Li, N. Duan, "NUWA-XL: Unified Generative Pre-training for Visual Synthesis" (2023), arXiv:2303.12346.
[5] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, K. Kreis, "Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models" (2023), arXiv:2304.08818.
[6] Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, B. Dai, "AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning" (2023), arXiv:2307.04725.
[7] D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, M. Z. Shou, "Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation" (2023), arXiv:2309.15818.
[8] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, V. Jampani, R. Rombach, "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets" (2023), arXiv:2311.15127.
[9] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, Y. Shan, "VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models" (2024), arXiv:2401.09047.
[10] D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y. Alon, V. Birodkar, Y. Cheng, M.-C. Chiu, J. Dillon, I. Essa, A. Gupta, M. Hahn, A. Hauth, D. Hendon, A. Martinez, D. Minnen, D. Ross, G. Schindler, M. Sirotenko, K. Sohn, K. Somandepalli, H. Wang, J. Yan, M.-H. Yang, X. Yang, B. Seybold, L. Jiang, "VideoPoet: Autoregressive Video Generation" (2023), arXiv:2312.14125.
[11] O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, Y. Li, M. Rubinstein, T. Michaeli, O. Wang, D. Sun, T. Dekel, I. Mosseri, "Lumiere: Enhancing Video Generation with Pixel-based Diffusion Models" (2024), arXiv:2401.12945.