The world’s leading publication for data science, AI, and ML professionals.

Paper Review: A Deep Dive into Imagen

A critical analysis of Google's impressive new text-to-image generation tool

Photo by Amanda Dalbjörn on Unsplash
Photo by Amanda Dalbjörn on Unsplash

Text-to-image synthesis is a research direction within the field of multimodal learning which has been the subject of many recent advancements [1–4]. This review will focus on the article, ‘Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding’ [1].

Here the authors attempt to achieve state-of-the-art photorealism and provide insights into a deeper level of language understanding within text-to-image synthesis. The main output of this paper is a model named ‘Imagen’ which improves upon previous text-to-image synthesis models in the literature [2–4].

You can see and find out more about Imagen here!

What is a diffusion model?

As the title of the paper suggests, Imagen is a diffusion model.

Very briefly, diffusion models are an example of generative AI based upon taking an input _x_⁰ and gradually adding Gaussian noise at each layer t until a pure noise representation is reached x, where T is the final layer.

This is inspired by non-equilibrium thermodynamics, whereby states evolve by diffusion to be homogeneous given a long enough time frame.

Diagram displaying the forward and background diffusion processes. Image from [16].
Diagram displaying the forward and background diffusion processes. Image from [16].

Diffusion models learn to reverse this process in an attempt to generate the original _x⁰ from x (where in this case x_⁰ is an image). See the figure above for a visual aid in this.

The goal of the model is to parameterise the conditional probability describing the reverse diffusion process at each step t:

Equation describing the reverse diffusion process. Image created by author.
Equation describing the reverse diffusion process. Image created by author.

where the representation of _x_ᵗ⁻¹ (the previous time step) is drawn from a Gaussian distribution characterised by the mean μ and covariance σ with model weights θ.

Due to the diffusion process preserving the image at each step in the denoising process, this results in a more intimate connection between the data and prediction compared with other non-diffusion based text-to-image generators [4–7]. The upshot of this is, generally, a more photorealistic output from diffusion based models [1–3].

Once a base diffusion model is used to construct a 64 × 64 pixel image, Imagen then makes use of two further super resolution diffusion models to perform the upsampling 64 × 64 → 256 × 256 → 1024 × 1024. The eventual result is therefore a high resolution 1024 × 1024 pixel image such as the one below!

Note that this image is actually from DALL-E 2 [2] as Google has some restrictions on Imagen! The idea is the same but please make sure you check out the Imagen paper for the actual images.

An example output from DALL-E 2 with the text prompt "A teddy bear on a skateboard in times square". Image from [2].
An example output from DALL-E 2 with the text prompt "A teddy bear on a skateboard in times square". Image from [2].

This review will provide a brief outline of previous work, I will then compile together the main contributions and results presented by the authors, and I will discuss these contributions and provide my personal opinions on the work.

Previous Work

It has been possible to realise images from text for a number of years, however early work struggled to combine multiple textual concepts realistically into an image [5–7].

Based on these shortcomings, OpenAI released DALL-E in [4] which is able to combine multiple seemingly unrelated concepts into a single image row by row – given a text prompt and the start (first row of pixels) of an image.

Less than 12 months later, OpenAI then reformulated their approach to text-to-image synthesis with diffusion models via GLIDE [3]. The authors showed that GLIDE was preferred by human evaluators for photorealism and caption similarity in a variety of settings, thereby establishing the dominance of diffusion models in text-to-image generation.

Finally, in [2], DALL-E 2 further improves upon GLIDE by generating images with an encoding based on the image embedding found from the textual prompt.

Note that other advancements were also made in this time frame, however I have focussed primarily on three main contributions which form the foundation for Imagen [1].

Main Contributions

Architecture

Similar to GLIDE [3] and DALL-E 2 [2], Imagen is a diffusion model which is seemingly very close in its architecture to GLIDE (i.e. it takes a text embedding as input and generates images from noise). However, a key difference in Imagen is that the text embeddings are found from large off-the-shelf language models (LMs).

One of the main findings of [1] is that incorporating large frozen LM’s which are trained on text-only data, proves extremely useful in obtaining text representations for text-to-image synthesis.

Further to this, the authors explore the scaling of the text encoder and find that scaling the size of the LMs improves results significantly more than scaling the size of the diffusion model. The leftmost plot in Figure 4a in [1] summarises this result by displaying that the T5-XXL LM [8] achieves a higher quality images (↓ FID score) and better caption compatibility (↑ CLIP score).

The authors also incorporate a new technique to avoid saturated pixels in image generation with classifier-free guidance.

Classifier guidance was introduced to improve the quality of generated images via a pre-trained model which pushes the output at test time to be more faithful to the text input [9].

Classifier-free guidance [10] avoids this need for a pre-trained model by generating two samples (outputs) from the input noise, with and without text conditioning.

By finding the difference between these two samples in the feature space, it is possible to find the effect of the text in the image generation. Scaling this textual effect, the image generation can be guided towards better image-text alignment (with a varying strength of guidance weight w).

So far none of this is new, however one issue with this guidance is that when w is large, pixels can become saturated and image fidelity is damaged at the expense of better image-text alignment. Therefore the authors introduce dynamic thresholding whereby saturated pixels are pushed inwards from [-1, 1] by varying amounts, determined at each sampling step _x_ᵗ (hence being dynamic). The authors claim significant enhancements in photorealism and image-text alignment for high guidance in image generation.

Finally on the side of the model architecture, the authors propose a new variant of U-Net [11] which is simpler and more efficient than previous iterations. From what I can tell, the key modification is the removal of self-attention layers in the super-resolution models, from the U-Net models from [11–12].

DrawBench

Another important contribution to future research in text-to-image synthesis is the release of DrawBench.

DrawBench is a collection of ‘challenging’ evaluation benchmark text prompts that probe the models capacity to handle complex concepts such as compositionality, cardinality and spatial relations.

The idea behind this release is to provide an evaluation benchmark that includes some very strange text prompts to ensure that the image has never existed before. Therefore in theory this should push models to the limits of their imagination and capabilities to generate complex images.

Quantitative Results

Photo by Maxim Hopman on Unsplash
Photo by Maxim Hopman on Unsplash

The quantitative results presented by the authors in [1] compare and contrast different models on COCO [15] and DrawBench text prompts.

The authors find that human evaluation results on DrawBench show a strong preference for Imagen when analysing pairwise comparisons with DALL-E 2 [2], GLIDE [3], Latent Diffusion [14] and CLIP-guided VQ-GAN [13] models (Figure 3 in [1]). These results are provided as a measure of caption alignment and fidelity.

Meanwhile, the results on the COCO validation set seem to not show as much of a difference between different models – which is potentially why the authors do not dwell on these for too long.

However, an interesting observation on the COCO dataset is that Imagen has a limited capability to generate photorealistic people – although the authors do not provide any qualitative example of how bad Imagen is at generating people.

Discussion

In the introduction, the authors of [1] include the claim:

[Imagen delivers] an unprecedented degree of photorealism and a deep level of language understanding in text-to-image synthesis.

Investigating the first half of this claim, the authors present several qualitative comparisons between Imagen and DALL-E 2 generated images. They also provide results from human evaluation experiments where people were asked to choose the most photorealistic image from a single text prompt or caption.

Even before considering any results, immediately the authors have introduced a degree of subjectivity into their analysis that is inherent in human evaluation experiments. Therefore the results shown in [1] must be considered with care and a healthy level of skepticism.

An example output from DALL-E 2 with the text prompt "A high quality photo of a dog playing in a green field next to a lake". Image from [2].
An example output from DALL-E 2 with the text prompt "A high quality photo of a dog playing in a green field next to a lake". Image from [2].

To provide some context to these results, the authors select some example comparisons shown to human raters and include these in the Appendix (definitely take a look at these – for motivation, I’ve added an example from DALL-E 2 above).

However, even with these examples, I find it difficult to make a clear judgement over which image should be preferred. Considering the copied examples shown in the figure above, personally I believe that some of DALL-E 2’s generated images are more photorealistic than Imagen’s, which demonstrates the issues of subjectivity when collecting results such as these.

The authors choose to ask human raters ‘which image is more photorealistic?’ and whether each ‘caption accurately describes the image’ during the evaluation process. However the discontinuous nature of assessing these metrics is rather worrying to me.

For example, if we have two cartoon images in a batch (which are presumably not very realistic) and a rater is asked to choose one. As far as the photorealism metric is concerned, the chosen image will have the same level of realism as a much more realistic image (i.e. not a cartoon) chosen from a separate batch.

Clearly there is some interplay between the caption for a batch of images and the level of photorealism that can be achieved. Therefore it would be interesting to explore weighting certain text prompts based on difficulty, in an attempt to create a more continuous metric which can be aggregated more reliably.

Similarly in the case of caption alignment, the raters choose between three categorical options whether the caption is aligned with the generated image (yes, somewhat and no). These experimental results attempt to back up the second half of the quote above (claiming a deep level of language understanding).

It is true that for caption alignment, one can claim there is a more definitive answer as to whether the relationships and concepts in the text prompts have been captured in the image generation (i.e. less subjectivity than for photorealism).

However, I would argue once again that a more continuous metric should be used here, such as a 1–10 rating of alignment. Following from the discussion above, presumably the varying levels of difficulty across all captions would also manifest in lower caption alignment. Potentially asking raters to rate the difficulty of the caption or text prompt during evaluation would be interesting to explore and help standardise the dataset and metrics.

Photo by Mitchell Luo on Unsplash
Photo by Mitchell Luo on Unsplash

As this line of research develops and generated images become even more impressive and creative, this method of evaluation will naturally become less reliable (of course this is a good problem to have). Therefore it would have been great to see the authors discuss the potential for asking raters more specific questions to assess the levels of creativity, compositionality, cardinality and spatial relations captured by models.

In the event that two generated images are equally as impressive, asking the rater more specific questions could help distinguish the model performance at this very high level.

As an example, one of the applications for text-to-image generation is to aid in generating illustrations. Therefore there is surely some justification to assess the level of creativity and variation when interpreting a text prompt.

In the examples shown earlier, DALL-E 2 [2] interprets ‘glasses’ in more ways than Imagen, hence one could argue that DALL-E 2 is the more creative model?

When viewing the results in this way, a major critique of the paper would be that the chosen metrics play to Imagen’s strengths too much. The best indication (metric) of a well-performing model in different applications will presumably be different depending on the application (i.e. there is no free lunch!).

Because of this, I would be interested to hear the authors thoughts on how to rigorously evaluate these models for more than just faithfulness and caption-alignment.

Photo by Dragos Gontariu on Unsplash
Photo by Dragos Gontariu on Unsplash

The release of DrawBench is justified in [1] to be a necessary contribution to the text-to-image research field due to providing a comprehensive set of challenging text prompt scenarios.

While I agree with most of this, based on the discussion surrounding this argument, I am yet to be convinced that this is a comprehensive benchmark. If one explores DrawBench a little deeper, only around 200 text prompts/captions across 11 categories are included, which seems quite small at first glance.

This concern is only deepened when comparing to the COCO dataset [15], which includes 330K images with 5 captions per image across a much wider variety of categories. Personally, I think it would be good for the authors to discuss their reasoning as to why they claim this is a comprehensive set.

Further to this, with the rapid advancement in text-to-image synthesis, I would argue that DrawBench is a moving target in the field. Therefore it would be nice to see the possibility of adapting or adding to these captions being discussed.

Also since DrawBench is presented with Imagen, there is scope for one to hold some apprehension as to whether there was any selectiveness in choosing the 200 prompts in order to gain preferential results on Imagen.

Once again, comparing the difference in results between Imagen and the baseline models when evaluated on COCO [15] and DrawBench, the results for COCO seem much closer between models than those for DrawBench (where Imagen is seemingly far above all baseline models).

This could be due to DrawBench being a naturally harder set of prompts which Imagen is able to handle due to its pre-trained LM, or it could be that DrawBench is bias towards Imagen’s strengths? Indeed, the authors admit to some bias when constructing DrawBench by not including any people within the image generation.

Finally, it is easy to critique research when the model (or code) is not released, especially when there is the overwhelming potential for financial gain (which the authors do not mention).

However, I believe the social and ethical reasoning behind this is one of the best contributions from the paper, and one which highlights the need for some sort of governance when releasing powerful open-source AI software.

Photo by Михаил Секацкий on Unsplash
Photo by Михаил Секацкий on Unsplash

In a broader sense, generative models naturally hold a mirror up to society which may be beneficial for social research groups or even governments, if they are given access to unfiltered versions of models.

Conclusion

To summarise, the authors have made significant contributions to the rapidly growing successes within text-to-image synthesis.

While not currently available to the public (for social and ethical reasons), the resulting model ‘Imagen’ incorporates novel techniques such as using off-the-shelf text encoders, dynamical thresholding and more efficient U-Net architectures for base and super resolution layers.

Personally I enjoyed reading this paper and I believe the contributions made are exciting and interesting developments in the field.

Photo by Arnold Francisca on Unsplash
Photo by Arnold Francisca on Unsplash

However while the results are impressive, when digging deeper, it is apparent to me that the authors tend to oversell Imagen and DrawBench. Therefore, it will be interesting to observe (maybe in a future publication or from a select contingent of researchers allowed access to Imagen) a more extensive evaluation of text-to-image generation models.

References

[1] – Chitwan Saharia, et. al. Photorealistic text-to-image diffusion models with deep language understanding, arXiv:2205.11487, (2022).

[2] – Aditya Ramesh, et. al. Hierarchical text-conditional image generation with CLIP latents, arXiv:2204.06125, (2022).

[3] – Alex Nichol, et. al. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, arXiv:2112.10741, (2021).

[4] – Aditya Ramesh, et. al. Zero-shot text-to-image generation, ICML, 8821 – 8831, PMLR, (2021).

[5] – Han Zhang, et. al. Stackgan++: Realistic image synthesis with stacked generative adversarial networks, IEEE transactions on pattern analysis and machine intelligence, 41(8):1947–1962, (2018).

[6] – Tero Karras, et. al. Analyzing and improving the image quality of stylegan, In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8110 – 8119, (2020).

[7] Mark Chen, et. al. Generative pretraining from pixels, ICML, 1691 – 1703, PMLR, (2020).

[8] – Colin Raffel, et. al. Exploring the limits of transfer learning with a unified text-to-text transformer, arXiv:1910.10683, (2019).

[9] – Prafulla Dhariwal and Alexander Nichol, Diffusion models beat GANs on image synthesis, NeurIPS, 34, (2021).

[10] – Jonathan Ho and Tim Salimans, Classifier-free diffusion guidance, In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, (2021).

[11] – Alex Nichol and Prafulla Dhariwal, Improved denoising diffusion probabilistic models, ICML, 8162–8171, PMLR, (2021).

[12] – Chitwan Saharia, et. al. Palette: Image-to-image diffusion models, arXiv:2111.05826, (2021).

[13] – Katherine Crowson, et. al. VQGAN-CLIP: Open domain image generation and editing with natural language guidance, arXiv:2204.08583, (2022).

[14] – Robin Rombach, et. al. High-resolution image synthesis with latent diffusion models, arXiv:2112.10752, (2021).

[15] – Tsung-Yi Lin, et. al. Microsoft COCO: Common objects in context, In European conference on computer vision, 740 – 755, Springer, (2014).

[16] – Calvin Luo, Understanding Diffusion Models: A Unified Perspective, arXiv:2208.11970, (2022).


Related Articles