Image composition with pre-trained diffusion models

A technique to increase control over the images generated by pre-trained text-to-image diffusion models

Published in

Towards Data Science

8 min readJul 12, 2023

An image generated with Stable Diffusion using the method described in the post. Image by the author.

Text-to-image diffusion models have achieved stunning performance in generating photorealistic images adhering to natural language description prompts. The release of open-source pre-trained models, such as Stable Diffusion, has contributed to the democratization of these techniques. Pre-trained diffusion models allow anyone to create amazing images without the need for a huge amount of computing power or a long training process.

Despite the level of control offered by text-guided image generation, obtaining an image with a predetermined composition is often tricky, even with extensive prompting. In fact, standard text-to-image diffusion models offer little control over the various elements that will be depicted in the generated image.

In this post, I will explain a recent technique based on the paper MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation. This technique makes it possible to obtain greater control in placing elements in an image generated by a text-guided diffusion model. The method presented in the paper is more general and allows for other applications, such as generating panoramic images, but I will restrict here to the case of image compositionality using region-based text prompts. The main advantage of this method is that it can be used with out-of-the-box pre-trained diffusion models without the need for expensive retraining or fine-tuning.

To complement this post with code, I have prepared a simple Colab notebook and a GitHub repository with the code implementation I used to generate the images in this post. The code is based on the pipeline for Stable Diffusion contained in the diffusers library by Hugging Face, but it implements only the parts necessary for its functioning to make it simpler and easier to read.

Diffusion models

In this section, I will recall some basic facts about diffusion models. Diffusion models are generative models that generate new data by inverting a diffusion process that maps the data distribution to an isotropic Gaussian distribution. More specifically, given an image, the diffusion process consists of a series of steps each adding a small amount of Gaussian noise to that image. In the limit of an infinite number of steps, the noised image will be indistinguishable from pure noise sampled from an isotropic Gaussian distribution.

The goal of the diffusion model is to invert this process by trying to guess the noised image at step t-1 in the diffusion process given the noised image at step t. This can be done, for instance, by training a neural network to predict the noise added at that step and subtracting it from the noised image.

Once we have trained such a model, we can generate new images by sampling noise from an isotropic Gaussian distribution and use the model to invert the diffusion process by gradually removing noise.

The goal of the diffusion model is to learn the probability q(x(t-1)|x(t)) for all time steps t. Image from the paper: Denoising Diffusion Probabilistic Models.

Text-to-image diffusion models invert the diffusion process trying to reach an image that corresponds to the description of a text prompt. This is usually done by a neural network that, at each step t, predicts the noised image at step t-1 conditioned not only to the noised image at step t but also to a text prompt describing the image it is trying to reconstruct.

Many image diffusion models, including Stable Diffusion, don’t operate in the original image space but rather in a smaller learned latent space. In this way, it is possible to reduce the required computational resources with minimal quality loss. The latent space is usually learned through a variational autoencoder. The diffusion process in latent space works exactly as before, allowing to generate new latent vectors from Gaussian noise. From these, it is possible to obtain a newly generated image using the decoder of the variational autoencoder.

Image composition with MultiDiffusion

Let us now turn to explain how to get controllable image composition using the MultiDiffusion method. The goal is to gain better control over the elements generated in an image by a pre-trained text-to-image diffusion model. More specifically, given a general description for the image (e.g. a living room, as in the cover image), we want a series of elements, specified through text prompts, to be present at specific locations (e.g. a red couch in the center, a house plant on the left and a painting on the top right). This can be achieved by providing a set of text prompts describing the desired elements, and a set of region-based binary masks specifying the location inside which the elements must be depicted. As an example, the image below contains the bounding boxes for the image elements in the cover image.

Bounding boxes and prompts used to generate the cover image. Image by the author.

The core idea of MultiDiffusion for controllable image generation is to combine together multiple diffusion processes, relative to different specified prompts, to obtain a coherent and smooth image showing the content of each prompt in a pre-determined region. The region associated with each prompt is specified through a binary mask of the same dimension as the image. The pixels of the mask are set to 1 if the prompt has to be depicted in that location and 0 otherwise.

More specifically, let us denote by t a generic step in a diffusion process operating in latent space. Given the noisy latent vectors at timestep t, the model will predict the noise for each specified text prompt. From these predicted noises, we obtain a set of latent vectors at timestep t-1 (one for each prompt) by removing each of the predicted noises from the previous latent vectors at timestep t. To get the input for the next time step in the diffusion process, we need to combine these different vectors together. This can be done by multiplying each latent vector by the corresponding prompt mask and then taking a per-pixel average weighted by the masks. Following this procedure, in the region specified by a particular mask, the latent vectors will follow the trajectories of the diffusion process guided by the corresponding local prompt. Combining the latent vectors together at each step, before predicting the noise, ensures global cohesion of the generated image as well as smooth transitions between different masked regions.

MultiDiffusion introduces a bootstrapping phase at the beginning of the diffusion process for better adherence to tight masks. During these initial steps, the denoised latent vectors corresponding to different prompts are not combined together but are instead combined with some noised latent vectors corresponding to a constant color background. In this way, as the layout is generally determined early in the diffusion process, it is possible to obtain a better match with the specified masks as the model can initially focus only on the masked region to depict the prompt.

Examples

In this section, I will show some applications of the method. I have used the pre-trained stable diffusion 2 model hosted by HuggingFace to create all the images in this post, including the cover image.

As discussed, a straightforward application of the method is to obtain an image containing elements generated in pre-defined locations.

Image generated using the above bounding boxes. Image by the author.

The method allows to specify the styles, or some other property, of the single elements to be depicted. This can be used for example to gain a sharp image on a blurred background.

Bounding boxes for blurred background. Image by the author.

The styles of the elements can also be very different, leading to stunning visual results. As an example, the image below is obtained by mixing a high-quality photo style with a van Gogh-style painting.

Bounding boxes with different styles. Image by the author.

Conclusion

In this post, we have explored a method combining different diffusion processes together to improve control over the images generated by text-conditioned diffusion models. This method increases control over the location in which the elements of the image are generated and also to combine seamlessly elements depicted in different styles.

One of the main advantages of the described procedure is that it can be used with pre-trained text-to-image diffusion models without the need for fine-tuning, which is generally an expensive procedure. Another strong point is that controllable image generation is obtained through binary masks that are simpler to specify and handle than more complicated conditionings.

The main drawback of this technique is that it needs to make, at each diffusion step, one neural network pass per prompt in order to predict the corresponding noise. Fortunately, these can be performed in batches to reduce the inference time overhead, but at the cost of larger GPU memory utilization. Furthermore, sometimes some of the prompts (especially the ones specified only in a small portion of the image) are neglected or they cover a bigger area than the one specified by the corresponding mask. While this can be mitigated with bootstrapping steps, an excessive number of them can reduce the overall quality of the image quite significantly as there are fewer steps available to harmonize the elements together.

It is worth mentioning that the idea of combining different diffusion processes is not limited to what is described in this post, but it can also be used for further applications, such as panorama image generation as described in the paper MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation.

I hope you liked this article, if you want to go deeper into the technical details you can check this Colab notebook and the GitHub repository with the code implementation.