A Synthetic Image Dataset Generated with Stable Diffusion

Testing the capabilities of Stable Diffusion with Wordnet

Andreas Stöckl
Towards Data Science

--

Image created by the author with Stable Diffusion

Current models for the generation of synthetic images from text input are not only able to generate very realistic-looking photos but also to handle a large number of different objects. In the paper “Evaluating a Synthetic Image Dataset Generated with Stable Diffusion”, we use the “Stable Diffusion” model to investigate which objects and types are represented so realistically that a subsequent image classification correctly assigns them. This gives us an assessment of the model in terms of realistic representation.

The photo above uses the example of a soccer ball to show that not only are very realistic photos generated but starting from the exact text prompts, very different representations of the object are created.

Generation of the Data

As a basis for image generation, we use the “Stable Diffusion” 1.4 model with the implementation of the Huggingface Diffusers library. This model allows the creation and modification of images based on text prompts. It is a latent diffusion model trained on a subset (LAION-Aesthetics) of the LION5B text-to-image dataset.

The following figure shows an example of an image generated from the text prompt

“Haflinger horse with short legs standing in water”.

The example shows that the generator model can represent different concepts with different attributes and also combine them in one setting.

Image created by the author with Stable Diffusion

We created a dataset that contains images of a variety of different concepts. For the text input, we use the information contained in “Wordnet”. Wordnet organizes concepts into so-called “synsets”, which correspond to the meaning of one or more words with the same meaning. Therefore, a word with different meanings can belong to several synsets. For example, the word “apple” has the meanings of a fruit and a computer brand, and there is a synset for each of these terms.

Starting from the Wordnet synset ‘object.n.01’, a list of 26,204 synsets of nouns was created by recursively calling the “hyponyms” (a word of more specific meaning than a general or superordinate term applicable to it). For each of these nouns, we use the description of the synsets in Wordnet for image generation.

An example of such a prompt is: (synset for dogs)

“a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds”

For each synset, 10 images were generated and stored under the name of the synset with the number appended. This results in a total number of 262,040 images for our data set.

Together with the 10 images per synset, a text file is saved that contains the prompt used, the name of the synset (e.g. “dog.n.01”), and the wordnet number (e.g. “n12345678”). The record can be downloaded from Kaggle.

License: Creative Commons — https://creativecommons.org/licenses/by/4.0/ Citation: https://arxiv.org/abs/2211.01777

Evaluation of the Data

To perform systematic evaluations on a subset of our dataset, we use the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) dataset.
We use the Pytorch implementation of the vision transformer model, which has a top-1 accuracy of 88.55% and a top-5 accuracy of 98.69% on the ImageNet data, to verify that the generated images can be correctly classified.

A review of all 8610 images from the considered subset yields an average correct classification of 4.16 images per class (maximum 10) with an average standard deviation of 3.74 across all classes. The histogram below shows the large spread in the number of correct classifications. The black images produced by the NSFW filter are part of the statistics.

Image by the author

It can be seen that while at least one correctly recognized image was generated for a majority of the classes (73%), all 10 images were recognized for only 14% of the classes. This also reflects the observation made at the beginning of the article that the generated images of a class differ greatly. This complicates the task for the classification procedure.

Let us now consider the recognition rate of some groups of objects. Under the hierarchy of Wordnet, the associated classes for some groups of terms were summarized and the average recognition rate was determined for each. The following table shows the results.

Recognition rates for different object classes

Remarkable are the good recognition rates for buildings. The figure below shows the images for ”Greenhouse” all 10 of which were correctly recognized.

”Greenhouse” — Image created by the author with Stable Diffusion

The class “animal” shows below-average classification rates. If we look at this group a bit more closely, we see that for 162 animal classes no image at all was not recognized at all. Looking at individual-specific examples, such as the following examples for the terms “black-footed ferret” and “leafhopper,” “stable diffusion” obviously reveals significant deficiencies in animal science.

”Black-footed ferret” — Image created by the author with Stable Diffusion
”Leafhopper” — Image created by the author with Stable Diffusion

Creation of a “map” of the terms showing which of the images generated by
Stable Diffusion is correctly recognized by the vision-transformer model, and how good the recognition rate is in each case, we place the terms by semantic meaning in 2D and color them by subgroups. The size of a circle indicates the number of correctly classified images. To determine the positions on this map, we use word embeddings for the names of the classes.

“Map” of classification rates — Image by the author

Here, too, the many small red dots of animal classes that were not correctly recognized are noticeable.

Similar Projects

One project that provides access to synthetic image data generated with Stable Diffusion is “Lexica”. It is a search engine that returns results for a term from over 10 million images. However, the entire database here cannot be downloaded, and there is no categorization.

Lexica — Screenshot by the author

A large database with 2 million images, which can also be downloaded and used as open source, is offered and described in

In addition to the images, the “DiffusionDB” dataset also contains the text prompt used to generate each image. The data collection was created by the authors by crawling Stable Diffusion’s Discord server and extracting the images including the prompt.

--

--

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/