The world’s leading publication for data science, AI, and ML professionals.

Diffusion Models: How do They Diffuse?

Understanding the Core Processes Behind Generative AI

Diffusion Models

The name "diffusion models" in machine learning was derived from the statistical concept of diffusion processes.

What is that statistical concept?

In natural sciences, diffusion is the process by which particles spread from areas of high concentration to areas of low concentration over time, often described by the diffusion equation in physics and mathematics.

Reaction-diffusion is an excellent example of this.

Reaction-Diffusion

Reaction-Diffusion is quite a complicated process; if you want to read the mathematical logic, you can visit the RD master Karl Sims’ website:

Reaction-Diffusion Tutorial

Let’s start with a simple analogy:

Reaction-diffusion systems are a way to describe how things change and move around, especially when you’re talking about chemicals. Imagine you have a couple of different paints on a piece of paper, and they start to mix and create new colors – that’s like the "reaction" part. The paint blots don’t just stay in one spot; they spread out and blend – that spreading is like the "diffusion" part.

So, these systems are just a set of rules that tell us how these processes happen: how the chemicals react with each other to make new stuff and how they move around or spread out.

This can describe a lot of different things in nature, such as how patterns form on animal skins, how pollution spreads in the environment, and lots of other situations where stuff is reacting and moving at the same time!

The reaction-diffusion algorithm is quite skillful in generating appealing and functional patterns - image by the author.
The reaction-diffusion algorithm is quite skillful in generating appealing and functional patterns – image by the author.

Why is this useful?

Well, you can design shoes with them! 😉

New Balance Two WXY featuring Reaction-Diffusion pattern - not my design, but the Computational Design Team introduced RD to designers in the office back in 2018 - www.newbalance.com
New Balance Two WXY featuring Reaction-Diffusion pattern – not my design, but the Computational Design Team introduced RD to designers in the office back in 2018 – www.newbalance.com

Well, that is only one use case –

Reaction-diffusion systems can be described and simulated through a math formula, which in turn helps us understand the diffusion event.

These formulas mainly explain how the amounts of one or more chemicals change over time and move around in a space.

This involves chemical reactions, where the chemicals can turn into other chemicals, and diffusion, which is the process that makes these chemicals spread across an area.

Diffusion model in AI and ML

In the realm of Generative AI, diffusion models work in a somewhat analogous way by modeling the gradual process of adding noise to data and then learning to reverse this process.

However, it is essential not to confuse the patterns generated by Reaction-Diffusion with the noise that is added to images in ML systems!

The diffusion models take a signal (such as an image) and gradually add noise until the original signal is wholly obscured. Again, the noise added does not need to have a specific shape (As opposed to the beautiful patterns generated by reaction-diffusion!)

This is conceptually similar to the physical process of a substance diffusing through a medium.

The training of diffusion models involves learning the reverse process: starting from a noisy signal and progressively removing the noise to recover the original signal.

This procedure is reminiscent of reversing the diffusion process in physics, where the particles’ positions are traced back from a state of equilibrium (completely diffused) to their original state (concentrated).

Hence, the name "diffusion models" captures the essence of this reverse process from noise to a clean, structured signal in the generative models of machine learning.

I have seen many diagrams explaining the diffusion models, but either I got lost in them, or they were oversimplified to explicate the actual process.

I also wanted to develop a custom diagram for designers: these are people who understand pseudo-codes and flowcharts well but do not necessarily know the math that lies beneath (with exceptions, of course):

Diagram showing the diffusion process - Drawn by the author
Diagram showing the diffusion process – Drawn by the author

This diagram is neither super detailed not fully watered-down. I am hoping for the complexity to be "just right" for more people to understand the diffusion processes.

In this diagram, I find an urge to explain the UNET and subsequent "ADD NOISE" nodes as they constitute the "magic" in the diffusion processes.

Let’s start –

UNET

The UNET is a convolutional network architecture for fast and precise segmentation of images.

Turns out it was first developed for biomedical image segmentation.

The name UNET is derived from its U-shaped architecture!

Some visual communicators were in play in the derivation of the name "UNET", I would guess!Before I make a more democratized description of UNET (that I can also understand (!)), here is a more line-by-lie technical description that I found on Medium (well done!):

UNet Line by Line Explanation

UNET became popular in other applications, such as Geographic Information Systems (GIS). The technique of segmentation, in this respect, can aid in delineating shorelines from aerial or satellite data, allowing for precise coastal mapping. Similarly, it can be employed to identify and extract the outlines of large-scale structures, such as skyscrapers or industrial complexes, from high-resolution imagery.

So UNET’s utilization slowly spread to various image-to-image translation tasks beyond its original application, including tasks like image denoising, super-resolution, and finally into generative models such as stable diffusion.

In the training of models for stable diffusion, which includes tasks like generating images from textual descriptions (text-to-image) or transforming one image into another (image-to-image) with certain stylistic changes, the U-Net serves a crucial role. Here’s a simplified explanation of its use:

1. Encoder-Decoder Structure

Think of the UNET as a machine with two main parts. The first part is the "encoder," which acts like a compact camera, zooming in on the most essential parts of an image and breaking it into smaller parts – so it’s easier to work with. The second part is the "decoder," it works like an artist who takes that compacted version and redraws the whole picture with all the details.

2. Skip Connections

Typically, when you make something smaller (like in the encoder), you lose some details (obviously!). But the UNET has a clever trick: it uses "skip connections," shortcuts that allow it to remember and bring back the details that might have been lost when the image was made smaller. This helps the "artist" part ensure the final picture is as detailed as the original.

3. Latent Space Manipulation

Imagine the compact version of the image in the encoder as a recipe for baking a cake, which we call "latent space." This recipe includes all the ingredients and steps needed to bake the cake a certain way. In stable diffusion, you can change this recipe to alter the final cake. Want to turn a vanilla cake into a chocolate one? You adjust the recipe – this is like tweaking the "latent space." You’re not starting from scratch; you’re making minor changes to the ingredients (or image characteristics) to end up with a different but related result (And this can help us guess the hints for custom training). If the basic recipe is the sketch, the added flavors, and decorations are the colors and textures in a painting.

4. Training with Noise

Think of driving in a car during a heavy rainstorm. At first, the rain is pelting down your windshield, making it hard to see where you’re going. But your windshield wipers swipe away the water, giving you brief moments of clarity. Training a UNET with noise is like improving the quality of your windshield wipers. Initially, they might be flawed, and your vision is still blurry with each swipe. But as you "train" your wipers – adjusting their speed and ensuring they contact the glass correctly – they get better at clearing the rain from your view. In the same way, the U-Net starts by trying to see through the "rain" of noise on images. Over time, it gets better at swiping away the noise until it can consistently provide a clear "view" of the underlying image, just as effective wipers eventually give you a clear view of the road ahead – potentially not the best analogy, but you get the point…

If you are not in favor of such analogies, here is a more direct description:

ADDING AND REMOVING NOISE

· Begin with a clean input image and introduce noise, transforming it into a noisy image.

· Feed the noisy image into the UNET, a denoising autoencoder, during training.

· Train the UNET to predict and identify the specific noise added to the image.

· Through iterative training, improve the U-Net’s ability to predict the noise accurately.

· As the UNET’s proficiency grows, it becomes more skilled at generating clean images from noisy data.

· Utilize the UNET’s noise handling and manipulation capabilities to enhance the diffusion process, which systematically adds and removes noise to craft new images.

This gradual diffusion process necessary to produce high-quality synthetic images has taken over the internet and our daily lives – but the impressive results of models like Midjourney, Stable Diffusion, Dall-E, and others in the domain of AI-driven image generation are undeniable.


There you have it. After reading this, I hope you have a better grasp of the diffusion models.


Get to know the author

Want to learn more or get in touch? Here are my LinkedIn and YouTube accounts. I share vibrant imagery and news over my Instagram account, as well.


References

How U-net works? | ArcGIS API for Python

Convolutional Neural Networks, Explained


Related Articles