
The internet is abuzz with Stable Diffusion – it’s slowing becoming a must-have tool for creating beautiful visuals. And if you want to take your visuals up a notch, then animations made with Stable Diffusion are the way to go.
No longer do you have to spend hours creating traditional animations – with Stable Diffusion, you can make your visuals come to life in no time. So, let’s unpack this magical tool and see how you can use it to create stunning animations!
In this post, I’m going to walk through some of the techniques used to create animations with Stable Diffusion.
Some Background Information
There are a few concepts we need to familiarize ourselves with before things start moving.
The first one is how Diffusion models work. To put it simply, these models turn noise into images. During training, these models learn to reverse a noising process that progressively adds noise to an image via a scheduler function for a fixed number of steps until the image is indistinguishable from random noise.
Once trained, we can generate an image using the model by passing in an image shaped noise tensor and denoise it into an image by iteratively passing it through the model.

In the figure above, the density function q refers to the forward noising process. What this means is that the forward process estimates a noisy sample at timestep t based on the sample at timestep t-1 and the value of the noise scheduler function at timestep t.
The density function p refers to the reverse process, which involves removing a single step of worth of noise from a noisy image. This process is repeated until an image is created.
The next concept we need to get a handle of is Latent Diffusion. This is what powers the Stable Diffusion model.
In a vanilla Diffusion model, the forward and reverse process operates on a noise tensor that has the same shape as the image tensor. Latent Diffusion uses a VAE model to first encode an image into a latent code before performing the forward and reverse process.
This makes the process faster since the model operates on a lower dimensional latent code rather than the full image shaped tensor.
The other thing about Stable Diffusion is that it is text-guided. This means that image generation process is guided by a text embedding that is produced by a CLIP encoder. Information from the text embedding is incorporated into the diffusion process through Cross Attention.
Now that we’ve got that covered, let’s move on to making some animations.
Transitioning between two Images

The most basic animation we can make is a transition between two images. In the animation above, we’re creating a transition between the prompts a picture of a corgi
and a picture of a lion
. To create this transition, we’re going to need to define a couple of values.
- The number of frames to generate between the images.
num_frames
- The frame rate of the animation, defined in frames per second.
fps
The number of frames will let us know how many points in between the two frames we’re expected to generate, while the frame rate will determine how fast the animation moves.
We already covered that latent diffusion works by running the diffusion process on a latent code of the image. To generate a new image, we create a noise tensor with the same shape as our latent code and pass it in through our model along with our CLIP text embedding. This will result in a latent code that we can pass into our decoder model to create the image.
Here’s some pseudocode to demonstrate this.
seed = 42
latent = torch.randn(
(1, self.pipe.unet.in_channels, height // 8, width // 8),
device=self.pipe.device,
generator=generator.manual_seed(),
)
text_embedding = get_text_embedding(["a photo of a corgi"])
output_latent = diffuse(text_embedding, latent)
image = decode_latent(latent)
Both the text embeddings and latent are tensors. We will create these tensors for our start and end frames, and use interpolation to fill in the in between frames. Interpolation is a method for creating new data points in between two known data points.
To create our specified number of frames between our start and end frame, we will use Spherical Linear Interpolation as it accounts for the shape of the latent space. You can think of Spherical Linear Interpolation as generating vectors along an angle between two unit normal vectors.
The interpolation function takes in two tensors, and an interpolation weight, that determines how close the resulting tensor is to the start and end values. An interpolation weight of 0.1 will result in an interpolated tensor that is more similar to the starting tensor, while an interpolation weight of 0.9 results in a tensor that is more similar to the ending tensor.
import numpy as np
def slerp(t, v0, v1, DOT_THRESHOLD=0.9995):
"""helper function to spherically interpolate two arrays v1 v2"""
# from https://gist.github.com/nateraw/c989468b74c616ebbc6474aa8cdd9e53
if not isinstance(v0, np.ndarray):
inputs_are_torch = True
input_device = v0.device
v0 = v0.cpu().numpy()
v1 = v1.cpu().numpy()
dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
if np.abs(dot) > DOT_THRESHOLD:
v2 = (1 - t) * v0 + t * v1
else:
theta_0 = np.arccos(dot)
sin_theta_0 = np.sin(theta_0)
theta_t = theta_0 * t
sin_theta_t = np.sin(theta_t)
s0 = np.sin(theta_0 - theta_t) / sin_theta_0
s1 = sin_theta_t / sin_theta_0
v2 = s0 * v0 + s1 * v1
if inputs_are_torch:
v2 = torch.from_numpy(v2).to(input_device)
return v2
start_seed = 42
latent_start = torch.randn(
(1, 4, 64, 64),
generator=generator.manual_seed(start_seed),
)
end_seed = 43
latent_end = torch.randn(
(1, 4, 64, 64),
generator=generator.manual_seed(start_seed),
)
num_frames = 60
schedule = np.linspace(0, 1, num_frames)
in_between_latents = []
for t in range(schedule):
in_between_latents.append(slerp(float(t), start_latent, end_latent))
We apply the same interpolation to our text embeddings as well.
start_text_embedding = get_text_embedding(["a picture of a corgi"])
end_text_embedding = get_text_embedding(["a picture of a lion"])
for t in range(schedule):
in_between_latents.append(slerp(float(t), start_text_embedding, end_text_embedding))
This results in smooth transitions between generated images.
Preserving the Subject in an Animation
You might have noticed in the earlier animation that the picture of our corgi changes quite dramatically over the course of the animation. If we’re trying to create a focal point in our animation, this can be quite undesirable.
To circumvent this, we can use Composable Diffusion. This technique allows combining prompts in a way that preserves the individual components.
In order to do this, Composable Diffusion splits the input prompt based on a separator character, such as |
'a red house | a house in a lightning storm'
This prompt will be split into two prompts based on the separator and fed into the diffusion model with the same latent shaped noise tensor. The resulting latents produced by the model will be averaged into a single tensor and sent into the decoder model.

Notice that the house in the animation above does not change as much over the course of the animation. This is because the individual components of the prompt are combined via composition.
Pseudocode for this approach:
prompt = ['a red house | a house in a lightning storm']
prompts = prompt.split('|')
text_embeddings = [get_text_embedding(prompt) for prompt in prompts]
latent_noise_tensor = torch.randn(
(1, 4, 64, 64),
generator=generator.manual_seed(start_seed),
)
output_latents = [diffuse(text_embedding, latent_noise_tensor) for text_embedding in text_embeddings]
final_latent = mean(output_latents)
image = decode_latent(final_latent)
Audio Reactive Animations
So far we have discussed how to transition between images using a Spherical Linear Interpolation. This type of interpolation produces points that are equidistant from each other across the individual frames.

Notice how each step in the X direction results in an equal move in the Y direction.
When trying to create audio reactive animations, we want to produce dramatic changes in our visuals when there is a particular type of audio event, such as a drum beat, and keep our changes relatively constant otherwise.
To produce such changes, we need to create an interpolation weight schedule that changes rapidly when frames contain an audio event, and remains relatively constant when they don’t.
Using an audio analysis library like librosa
can help us detect these audio events.
import librosa
audio_input = "my_audio_file.mp3"
audio_array, sr = librosa.load(audio_input)
onset_env = librosa.onset.onset_strength(audio_array, sr=sr)
onset_env = librosa.util.normalize(onset_env)
This snippet loads in an audio file and extracts an onset strength envelope that shows you where the audio events occur. Additionally, we normalize the envelope so that the values are constrained to lie between 0.0 and 1.0

The peaks in the chart correspond to events in the audio file over time. Using this array directly as our interpolation weight schedule would result in very jittery motions between the two prompts, since we are constantly oscillating between 0.0–1.0 over time. This is no doubt interesting, but for more complex audio, it might just end up looking very noisy. We want to be able to drive transitions from one frame to the next in a way that we can clearly see a transition between prompts. We need a monotonically increasing function from 0.0 to 1.0
Using a normalized cumulative sum of this array will produce the effect we need.
import numpy as np
cumulative = np.cumsum(onset_env)
cumulative = cumulative / max(cumulative)

The interpolation schedule now increases over time as we move from start to end. We need to resize this array to match the number of frames we want to generate.
import numpy as np
def resize(onset_env):
x = np.linspace(0, len(onset_env), len(onset_env))
resized_schedule = np.linspace(0, len(onset_env), num_frames)
return np.interp(resized_schedule, x, onset_env)
resized = resize(cumulative, num_frames)

Now when we interpolate between, we get rapid transitions during audio onset events and smooth transitions otherwise. We can further improve transitions by decomposing the audio signal into percussive and harmonic components and building our interpolation weights based on one of these components.
Creating Animations using an Initial Video Input
Up until this point, we have been creating animations using noise tensors as our input. But there might also be cases where want to leverage an existing video as the base for our animations.
Stable Diffusion supports this workflow through Image to Image translation.
Instead of using a randomly sampled noise tensor, the Image to Image workflow first encodes an initial image (or video frame) into a latent code. We add noise to this latent code and use it as our input noise tensor, while running a slightly modified diffusion process.
We modify the diffusion process by introducing a new parameter; Strength.
The strength parameter controls how much of the original image content is preserved during the the diffusion process, and is bounded between 0.0 and 1.0. Lower values (up until 0.5) tend to preserve the original image content more, while higher values change the consistency of the image semantics.

So what is the strength parameter doing exactly?
The denoising process is iterative and runs for a fixed number of steps based on the scheduler. This means that we are free to drop in our noise tensor at any point in this iterative process. The strength parameter controls how late in this process we drop in noise tensor.
Assuming we are running the denoising process for 50 inference steps. If we set the strength value to 0.8, we would add 40 steps worth of noise to our initial latent and then denoise the image for 40 steps.
A higher value means runs the denoising process for more steps, resulting in the original image semantics disappearing. A lower value runs denoising for fewer steps a preserves more of the original image.
Parting Notes
If you’re interested in applying these techniques, try out this project that I’ve been working on for the past few weeks, Giffusion.
Giffusion applies all the techniques I’ve described in this post and provides a simple WebUI that you can run in Google Colab to create animated GIFs and Videos with Stable Diffusion.
If you’re curious about how I created the animations in this post, you can checkout out my Comet project with all the prompts, parameters and generated animations.
So, there you have it! With Stable Diffusion, you can create stunning animations and bring your projects to life. Now that you know the power of this magical tool, it’s time to fire up your creativity and make something beautiful!
References
- Introduction to Diffusion Models for Machine Learning
- Stable Diffusion with Diffusers
- Composable Diffusion
- Stable Diffusion Videos for Audio Reactivity
- [How Image to Image Works](http://How img2img Diffusion Works)
- Giffusion: A Project to generate Animations with Stable Diffusion
- Comet Project with all my prompts and generated Images