
Introduction
After shaking up NLP and moving into computer vision with the Vision Transformer (ViT) and its successors, transformers are now entering the field of image generation. They are gradually becoming an alternative to the U-Net, the convolutional architecture upon which all the early Diffusion Models were built. This article looks into the Diffusion Transformer (DiT), introduced by William Peebles and Saining Xie in their paper "Scalable Diffusion Models with Transformers."
DiT has influenced the development of other transformer-based diffusion models like PIXART-α, Sora (OpenAI’s astonishing text-to-video model), and, as I write this article, Stable Diffusion 3. Let’s start exploring this emerging class of architectures that are contributing to the evolution of diffusion models.
Preliminaries
Given that this is an advanced topic, I’ll have to assume a certain familiarity with recurring concepts in AI and, in particular, in Image Generation. If you’re already familiar with this field, this section will help refresh these concepts, providing you with further references for a deeper understanding.
If you want an extensive overview of this world before reading this article, I recommend reading my previous article below, where I cover many diffusion models and related techniques, some of which we’ll revisit here.
Comparing and Explaining Diffusion Models in HuggingFace Diffusers
Diffusion formulation

At an intuitive level, diffusion models function by first taking images, introducing noise (usually Gaussian), and then training a neural network to reverse this noise-adding process by predicting the noise that was added and, in some cases, its covariance matrix. The degree of noise introduced is controlled by a timestep variable t; where at t=0, _x0 represents the original image, and at t=1000, _x1000 is nearly pure noise.
In practice, for each timestep t, we sample _xt, conditioned on _xt-1, __ from a Gaussian distribution:

This can be reparameterized in the following way:

where _ϵ_t ~ N(0, I)._ The _βt terms denote a predetermined variance schedule. To generate _xt given _xt-1, we sample _ϵt from a multivariate standard normal distribution and apply the equation above. This step-by-step addition of noise is known as the forward process.
Fortunately, to generate _xt, it is not necessary to first generate all the previous _xt-1; it can be shown that it is possible to use directly this formula:

where:

Once we establish the noise addition methodology, we train a model to predict the noise added. During training, we sample a batch of images and a corresponding t value for each, add noise based on t, and feed these noised images with their t values into the model. The model is trained to predict the noise minimizing a loss function, typically the mean squared error between the actual added noise and the noise predicted by the model.

Note that here, _ϵθ is our neural network, and, even though it’s not shown, it also takes the value of t as input.
For image generation, we perform the reverse process, starting from pure noise and iteratively sampling using the following conditional distribution:

where:

and Σ__θ(xθ, t) is set to a diagonal matrix. In the context of Denoising Diffusion Probabilistic Models (DDPM), this matrix is fixed, whereas for Improved Denoising Diffusion Probabilistic Models (iDDPM), the matrix is learned. In the paper that we are analyzing, iDDPM is used.
With the trained model, we could start with a pure noise and attempt to remove all the noise in one step to obtain a sample resembling those from the training dataset. However, a gradual iterative process where noise is partially removed and sometimes a bit of "fresh" noise is reintroduced (Langevin dynamics), yields better results. The specifics of the reverse process are dictated by the sampling strategy employed.
Delving further into the technical details would extend beyond the scope of this summary. For those interested in exploring the subject in-depth, I recommend consulting the following seminal papers on diffusion models:
Denoising Diffusion Probabilistic Models
Improved Denoising Diffusion Probabilistic Models
Elucidating the Design Space of Diffusion-Based Generative Models
Classifier-free guidance
In practice, we rarely aim to generate images without any form of control. The most popular method to guide the final outcome is through the use of a textual prompt, essentially a description of what we want the image to depict, known as text-to-image conversion. A simpler version of this is class conditioning, where a class label c serves as the prompt. This second type of prompt is used by the Diffusion Transformer.
Unfortunately, models sometimes tend to overlook our prompts, whether textual or otherwise. To counter this, a technique known as classifier-free guidance is employed to ensure the model adheres more closely to our instructions.
You might wonder why it’s termed "classifier-free" guidance. This approach is distinguished from classifier guidance, which relies on the gradient with respect to _xt of log p(c | _xt, t), where p(c | _xt, t) represents the likelihood that the image _xt at timestep t belongs to class c and is estimated by a classifier. The gradient of a function points in the direction of the steepest ascent. Therefore, by adding this gradient with a certain weight, the chances that the generated image aligns with the desired class increase, at least according to our classifier.
One major challenge with this method is the necessity of a classifier. If no pre-trained classifier is available for our dataset’s classes, we must first train one. More problematic is that even if a suitable classifier exists, it’s unlikely to be tailored for our kind of images, since pre-trained classifiers are designed to classify "clean" images, not those obscured by Gaussian noise. And what when we want to condition our generation not on a class label but on a textual prompt like "an illustration of a baby daikon radish in a tutu walking a dog" or even a non-textual conditioning?
Classifier-free guidance effectively addresses these hurdles. It operates by occasionally substituting the prompt embedding with a learnable embedding representing the absence of a prompt during training. During inference, this allows us to compute two noise estimates: one with the prompt and one without. We can now apply the following formula:

From the formula, we see that setting the guidance scale, s, to 1 maintains the noise estimate as it would be in the unguided process (standard conditioning). Increasing s beyond 1 pushes the unconditioned noise estimate _ϵ_θ(xt, ∅) closer to one more aligned with the specific prompt, moving in the direction of _ϵ_θ(x_t, c) − ϵ_θ(xt, ∅).
Although classifier-free guidance typically results in improved samples, it also necessitates making two noise predictions for every evaluation – one with the prompt and one without – effectively doubling the computational effort. Additionally, there’s a trade-off between the quality and variety of the generated images: as s increases, the visual fidelity improves, but the diversity of the samples diminishes. In their paper, the authors use a guidance scale of s=4 when generating images conditioned on a specific class label.
Latent diffusion models

I’ve already written an entire article on latent diffusion:
Paper Explained – High-Resolution Image Synthesis with Latent Diffusion Models
Given that, I’ll only provide the main intuition here. If you want to delve deeper, I recommend reading my article above.
The diffusion formulation mentioned earlier operates in the image space. Unlike a classification model that can work with relatively low resolution, in our case, the resolution is constrained by the resolution of the image we intend to generate, for example, 1024 x 1024. Moreover, as we’ve seen, the reverse process is iterative, meaning that to generate a single image, we have to perform inference with a computationally heavy model multiple times. This problem is further accentuated by using a transformer instead of a convolutional network like U-Net, as the attention mechanism scales quadratically rather than linearly. Furthermore, as we will see, the number of tokens generated from an image also scales quadratically, meaning if we move from a resolution of 256 x 256 to 512 x 512, if previously the number of tokens was T, now it becomes _T_², and if the number of operations in the attention layers was O(_T_²), now it becomes O(_T_⁴).
There are fundamentally two solutions to this: the first, adopted by models like DALL·E 2 (OpenAI) or Imagen (Google), is to use a relatively low resolution for the diffusion process and then apply one or more super-resolution models. The second is to abandon the image space and work in a latent space.
What does this last statement mean in practice? We can view a latent space as a sort of compression of the image that maintains its semantic content, possibly discarding only some marginal details. A model capable of compressing and then decompressing an image in this way is the Variational Autoencoder (VAE).
During training, each image is compressed through the VAE encoder; during inference, a noise-only tensor z is simply generated. If the desired resolution is, for example, 256 x 256 x 3, z might be 32 x 32 x 4. Note that the number of channels, no longer working in the image space, is not constrained to remain at 3. During generation, starting from z, noise is gradually removed from it, and in the final step, the VAE decoder is used to decompress the "clean" z into the final image.
This part was just a recap. If you want to learn more, I suggest giving my articles mentioned above a read.
Diffusion Transformer Design Space

Patchify

In this section, I will assume you are familiar with the standard self-attention mechanism and the concept of sinusoidal positional encoding, concepts related to the Transformer architecture. If these concepts are new to you or you wish to review them, I recommend this article:
Transformers are models that take as input a sequence, or rather a set (the order in the sequence is given only by the addition of positional encodings/embeddings). For this reason, the first operation we must perform is to transform z, our latent tensor which in this case is 32 × 32 × 4, into a sequence of tokens. Patchify is very simple; it divides z into a grid (32 / p) x (32 / p) x 4, where each grid element p x p x 4 is then linearly projected to become 1 x d, where d is a hyperparameter. In practice, this can be easily done by applying a convolution with kernel and stride equal to p and the number of output channels equal to d, obtaining, for each batch element, a tensor (32 / p) x (32 / p) x d. If we define T=32² / p², by rearranging the elements we have, for a batch of size N, a tensor N x T x d. Halving p will quadruple T, and thus at least quadruple total transformer Gflops. In their work, the authors try p=2,4,8; smaller values of p produce better results but are more computationally demanding.
At this point, recalling that the attention mechanism itself is unaware of the order of elements, we add positional encodings to all input tokens.
Note: Although the paper refers to them as embeddings, this term is typically reserved for representations learned during training; in this case, a static encoding is used.
Assuming you are already familiar with 1D sinusoidal positional encoding, we encode the x coordinates of the grid for each token in the standard 1D manner. We apply the same process to the y coordinates and then concatenate these two encodings for each token. This approach is preferred over applying standard 1D positional encoding to the "unrolled" image because the latter would cause two nearby positions in the grid to have significantly different encodings. By acknowledging the 2D nature of the input, we maintain that adjacent tokens have similar encodings.
DiT block design
Once again, I’ll assume familiarity with the Transformer architecture. Therefore, I won’t delve into standard transformations like Layer Norm, Multi-Head Self-Attention, or Pointwise Feedforward, which are well-documented in the linked article. However, there’s a specific aspect of DiT we haven’t discussed yet: how do we implement class conditioning?
In this article, I’ll focus on the approach that the authors found most effective in their experiments: adaLN-Zero.
First, we need to transform t and c into two embeddings. For the class label c, this involves simply initializing an embedding vector for each class that is learned during training. For t, it’s a bit more complex: a sinusoidal encoding is applied to t, which is then further transformed by a small MLP consisting of a linear layer, a SiLU activation, and another linear layer. These two embeddings are then concatenated to become our conditioning vector.
Now let’s describe adaLN. The conditioning is first transformed with SiLU and then linearly projected; I’ll refer to this transformation as adaLN_modulation, for consistency with the official repository’s code. The output of adaLN_modulation must be divisible into six different chunks, each of dimension d (the same as our conditioning). These vectors represent the _γ1, _β1, _α1, _γ2, _β2, _α2, used to scale and shift the input in different parts of the DiT Block.
The sequence of transformations is as follows:
def modulate(x, shift, scale):
return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
class DiTBlock(nn.Module):
...
def forward(self, x, c):
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.adaLN_modulation(c).chunk(6, dim=1)
x = x + gate_msa.unsqueeze(1) * self.attn(modulate(self.norm1(x), shift_msa, scale_msa))
x = x + gate_mlp.unsqueeze(1) * self.mlp(modulate(self.norm2(x), shift_mlp, scale_mlp))
return x
- Input tokens
x
are normalized with Layer Norm:self.norm1(x)
. - Using
modulate(...)
, the output is scaled and shifted, whereshift_msa
,scale_msa
represent _γ1, _β1 respectively. - Multi-Head Self-Attention is applied:
self.attn(...)
. - The output is scaled with
gate_msa.unsqueeze(1)
, representing _α1. - The result is added back to the original
x
, thus employing a residual connection (similar to ResNet). - A similar process is repeated but with Pointwise Feedforward,
self.mlp(...)
, instead ofself.attn(...)
.
adaLN-Zero differs from adaLN by initializing the weights of adaLN_modulation to zero. This means that initially, as we can see in the forward
method above, x
is not altered in any way, and only gradually does the network learn the optimal scaling and shifting parameters.
Transformer decoder
Finally, the decoder consists of a linear layer plus normalizations analogous to what we have already seen. This linear layer transforms the final output x of dimension N x T x d into N x T x _p_² x 2C, where C represents the input channels. It’s important to note that it is 2C only if we aim to predict the diagonal covariance matrix Σ in addition to the noise; otherwise, it is just C.
The output is then "unpatchified" (reshaped) to yield either the predicted noise or the noise combined with the predicted Σ.
Conclusions
In this article, we have introduced the essential parts of the paper "Scalable Diffusion Models with Transformers," also providing some useful background. To conclude with a fun anecdote, this work, which is becoming increasingly important as it underlies the most modern approaches, was initially rejected at CVPR 2023,
reminding us that it is very difficult even for experts to predict what will have an impact and that the review process in scientific conferences is far from perfect.
Thank you for taking the time to read this article, and please feel free to leave a comment or connect with me to share your thoughts or ask any questions. To stay updated on my latest articles, you can follow me on Medium, LinkedIn or Twitter.