The world’s leading publication for data science, AI, and ML professionals.

MAGNet: Modern Art Generator using Deep Neural Networks

How I used CLIP, SWAGAN, and a custom genetic algorithm to create modern paintings from text descriptions

Hands-on Tutorials

Sample Output from MAGnet, Image by Author
Sample Output from MAGnet, Image by Author

My latest project in using AI for creative endeavors is called MAGnet. I built a custom Genetic Algorithm (GA) to drive the creation of modern paintings using a Generative Adversarial Network (GAN) from a textual description over several generations. MAGnet uses the CLIP model from OpenAI [1] and a variant of Stylegan2 ADA [2] from Nvidia called SWAGAN [3], which uses wavelets to create images.

All of the source code for this project is available [here](https://colab.research.google.com/github/robgon-art/MAGnet/blob/main/5_MAGnet_Generate_Modern_Paintings.ipynb). A Google Colab that you can use to create your own paintings is available here.

Prior Work

A similar system called CLIP-GLaSS was released earlier in 2021. In their paper, "Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search" [4], Federico Galatolo et al. built an image generation system that uses StyleGAN2, CLIP, and a GA called NSGA-II [5]. The authors made the system available in their Google Colab here.

Here are the top results from CLIP-GLaSS from the query "a church with a white steeple" after 50 generations.

Images generated by CLIP-GLaSS using the prompt "a church with a white steeple," Image by Author
Images generated by CLIP-GLaSS using the prompt "a church with a white steeple," Image by Author

The system appears to work fairly well, but the images are a bit small, at 256 x 256 pixels. And many of the source images appear to have come from the stock footage site Shutterstock. Notice how the GAN recreated the Shutterstock watermark in each of the images.

MAGnet Overview

Here is a brief overview of the components used in MAGnet. I’ll get into the details of each component later in the article. Note that the steps of gathering, filtering, and training a GAN with images from WikiArt are similar to my GANscapes project, which generates impressionist landscape paintings.

GANscapes: Using AI to Create New Impressionist Paintings

Below is a diagram that shows the main system components of MAGnet.

MAGnet Components, Diagram by Author
MAGnet Components, Diagram by Author

I gathered images of Modern paintings from WikiArt.org [6] and processed the images to pull out one or more square images. I then used the CLIP model to filter the images to keep 10,000 that most closely matched the term "modern painting."

I used these images to train SWAGAN, which has a generator and discriminator network. The generator creates new images starting with random "latent" vectors for form and style and tries to fool the discriminator into thinking the output images are real. Before the real and the generated images are fed into the discriminator, they are modified slightly with the Adaptive Discriminator Augmentation (ADA) module that creates visual diversity in the pictures.

I used CLIP again as part of my GA to steer SWAGAN to create an image over a specified number of generations that best matches a text description supplied by the user. As the final step, I post-processed the final image to perform a mild contrast adjustment and display the result full-sized.

Check out the image gallery in the appendix at the end of the article to see more results from MAGnet.


MAGnet System Details

This section will discuss the processes and systems I used to build MAGnet in more detail.

Gathering Images

I started by scraping modern paintings from WikiArt.org using a custom python script in the Colab here. The script goes through each artist on the site alphabetically. It checks to see if the artist is tagged as part of the "modern" art movement and if they were born after 1800 and died before 1950. The script then loops through all of the artists’ paintings, looking for ones that are available in the public domain. The script then downloads each qualified image using the artist and painting name as the filename, i.e., henri-matisse_still-life-with-fruit-1896.jpg. The system found about 4,000 paintings that matched these criteria. Note that I use the ThreadPoolExecutor in Python to gather the images using 8 parallel threads to speed up the process.

Here is a sample of images that met the criteria.

Sample of Modern Paintings from Wikart.org, Images by Vilmos Aba and Vincenzo Abbati
Sample of Modern Paintings from Wikart.org, Images by Vilmos Aba and Vincenzo Abbati

Note that the term "modern" is fairly broad, so the script gathered both abstract and representational paintings.

Pulling Square Images for Training

GANs work efficiently with square images with pixel sizes that are powers of two, i.e., 256 x 256, 512 x 512, 1024 x 1024, etc. For MAGnet, I decided to use images with a size of 512 x 512. Because the source paintings have various aspect ratios, I chose to extract three cutouts from each image.

For example, Henri Matisse’s painting Still Life with Fruit has a landscape orientation. Here are the three cutouts showing the left, center, and right portions of the painting.

Still Life with Fruit Cropped Three Ways by Henri Matisse, Source Image from Wikiart.org
Still Life with Fruit Cropped Three Ways by Henri Matisse, Source Image from Wikiart.org

The second example is Pablo Picasso’s Harlequin. The portrait is cropped to get the top, center, and bottom portions.

Harlequin Cropped Three Ways by Pablo Picasso, Source Image from Wikiart.org
Harlequin Cropped Three Ways by Pablo Picasso, Source Image from Wikiart.org

The top-cropped version of Harlequin seems to be the best because the face is intact. However, it’s OK to use all three croppings when training the GAN as I will be using CLIP later to filter and direct the generation of "good" paintings.

The source code to process the training images is in the Colab here.

Using CLIP to Filter the Images for Training

After cropping the images, I ended up having over 12,000 paintings to work with. That’s enough to train a GAN, but not all of the paintings are good. Just because a painter is tagged on WikiArt as being "modern" doesn’t mean that all of their works are good examples of modern painting. I used CLIP to filter the images as I did in my GANscapes project to winnow down the dataset.

OpenAI designed their two models for CLIP, an image and text encoder, to perform semantic searches. They trained the two models on a dataset of images with corresponding phrases. The goal of the models is to have the encoded images match the encoded phrases.

Once trained, the image encoder system converts images to embeddings, lists of 512 floating-point numbers that capture the images’ general features. The text encoder converts a text phrase to similar embedding that can be compared to image embeddings for a semantic search.

For MAGnet, I compare the embedding from the phrase "modern painting" to the embeddings from the paintings to find the top 10,000 images that best match the phrase. The source code is in the Colab here.

Here are some examples that didn’t make the cut as modern paintings. The first two probably scored lower because they don’t have a lot of contrast. The third seems more like a drawing than a painting.

Ship in Harbour by Christopher Wood, Still Life by Alfred William Finch, Illustration to Katharine Brush by Rockwell Kent, Source Wikiart.org
Ship in Harbour by Christopher Wood, Still Life by Alfred William Finch, Illustration to Katharine Brush by Rockwell Kent, Source Wikiart.org

SWAGAN

For this project, I am using a variation of StyleGAN2 called SWAGAN. It was created by Rinon Gal et al., from Tel-Aviv University. The new model is described in their SWAGAN: A Style-based WAvelet-driven Generative Model [3]. Here’s a description of their approach.

… we present a novel general-purpose Style and WAvelet based GAN (SWAGAN) that implements progressive generation in the frequency domain. SWAGAN incorporates wavelets throughout its generator and discriminator architectures, enforcing a frequency-aware latent representation at every step of the way. This approach yields enhancements in the visual quality of the generated images, and considerably increases computational performance. – Rinon Gal, et al.

A wavelet is a wave-like oscillation with an amplitude that begins at zero, increases, and then decreases back to zero. The study of wavelets is based on the work of Alfréd Haar, a Hungarian mathematician. The paper’s authors show that by adopting a GAN architecture to work directly in a wavelet-based space, they achieve improved visual quality and more realistic content in the high-frequency range [3].

Here is a comparison between the outputs of StyleGAN2 and SWAGAN from the paper.

Comparison of StyleGAN2 and SWAGAN, from the SWAGAN paper by Rinon Gal, et al.
Comparison of StyleGAN2 and SWAGAN, from the SWAGAN paper by Rinon Gal, et al.

The authors note that the interpolated StyleGAN2 images show considerable blur around high-frequency regions, such as the hair, while the SWAGAN images do not.

Training SWAGAN

I trained the system for two weeks using this Google colab. Here is the command to run the training.

python /content/stylegan2-pytorch/train.py --data_dir /content/drive/MyDrive/modern_results/ --augment --arch swagan 
--size 512 /content/drive/MyDrive/modern_art_processed_512/

Note that the architecture is set to SWAGAN (the other choice is StyleGAN2). You can see some unfiltered results here.

Sample results from SWAGAN trained on modern paintings, Image by Author
Sample results from SWAGAN trained on modern paintings, Image by Author

You can see that there is an interesting mix of abstract and representational images. The syles of the representational images seem to range from Impressionism to Cubism. And there is a nice mix of color palettes.

GAN Implementation Notes

The version of SWAGAN that I’m using was written by independent developer Seonghyeon Kim, known on GitHub as rosinality. He has an implementation of both StyleGAN2 and SWAGAN here.

Most of rosinality’s source code is released under the MIT Open Source license, which is very permissive. However, his code includes two operations, fixed_act() and upfirdn2d(), written as low-level CUDA functions. These functions were written by Nvidia and are released under their non-commercial open source license. The Nvidia license states that the source code can be used non-commercially, which means it can only be used for research or evaluation purposes.

After searching on GitHub, I found copies of the fixed_act() and upfirdn2d() functions that were reimplemented and released under the MIT Open Source license by Or Patashnik et al. for their StyleCLIP project [7]. Their functions work just fine with rosinality’s implementation of SWAGAN. You can find the merged code in my GitHub project here. I am releasing the source code under the permissive CC-BY-SA-4.0 License. And yes, you can use my source code for commercial purposes 💲💲💲 . But please give me a shout-out if you use it. 📣

Genetic Algorithms

In computer science, a Genetic Algorithm (GA) is a metaheuristic inspired by the process of natural selection that relies on biologically inspired operators such as mutation, crossover, and selection. GAs are commonly used to generate high-quality solutions to optimization and search problems [8]. You can read more about GAs in Vijini Mallawaarachchi’s article on Medium here.

I created a custom GA that uses CLIP to steer SWAGAN towards producing images that match a text description given by the user. Here is an overview of the algorithm.

Generate the initial population For each generation: Selection – Select the top 4 matching images Crossover – Cross each of the images with the others Mutation – Add a little noise to the latent vectors

Here is the Pythonish pseudocode for the GA:

# get the features from the text query using CLIP
text_query = 'an abstract painting with orange triangles'
query_features = clip.encode_text(text_query)
# get an initial population of images, i.e. 200
latents = random(num_initial_samples)
# run the genetic algorithm for some number of steps, i.e. 5 times
for g in range(num_generations):
  # generate the images
  images = swagan.generate_images(latents)
  # get the image features using CLIP and rank them to get the top 4
  image_features = clip.encode_image(images)
  ranked_indices = get_top_N_(query_features, image_features, 4)
  # start with a blank slate
  new_latents = empty_set
  # copy the best four latents (A, B, C, D)
  for i in range(4):
    offspring = latents[ranked_indices[i]]
    new_latents.append(offspring)
  # loop through all 16 combinations
  for j in range(4):
    for i in range(4):
      # combine i with j
      offspring = (1-recombination_amount) * new_latents[i] +  
         recombination_amount * new_latents[j]

      # add a little mutation
      offspring += random() * mutation_amount

      # add it to the batch
      new_latents.append(offspring)

  # use the new latents
  latents = new_latents

OK, let’s see how this works with a real example.

The parameters are set like this: text_query = ‘an abstract painting with orange triangles’ num_initial_samples = 200 recombination_amount =0.333 mutation_amount = 0.25 num_generations = 5

After generating 200 images, here are the top four that match the query.

Top 4 Images that match "an abstract painting with orange triangles," Image by Author
Top 4 Images that match "an abstract painting with orange triangles," Image by Author

The images definitely show a lot of orange colors, and some triangles can be seen. Here’s what the first generation produces. The letter "x" indicates crossover, and the symbol ‘ indicates mutation.

First Generation of Images from MAGnet, Image by Author
First Generation of Images from MAGnet, Image by Author

The original four images are at the top, and the combinations with mutations are below. This system will look at all 20 images (the original 4 plus the 16 combinations) and use CLIP to determine the top 4 for the next generation.

I ran the GA for five generations. If you are curious, you can see all of the intermediate results in the Colab here. Below is the final result.

MAGnet result for "abstract painting with orange triangles," Image by Author
MAGnet result for "abstract painting with orange triangles," Image by Author

Not bad! You can see that the final result is mostly based on the initial top two images, but the system managed to find some additional orange triangles during the run of the genetic algorithm.

Up next, I’ll share a technique that I developed that improves the results when using GAs with GANs.

Crossover Technique for GANs

In the literature for Genetic Algorithms [8], the technique used for the crossover operation is described as follows:

  1. Choose a random "crossover point" from 0 to the length of the vector
  2. Copy the values from the first parent up to the crossover point
  3. Starting at the crossover point, copy the values from the second parent

For example, if there were 10 numbers in the vectors and the recombination amount was 40%, here’s what the offspring would look like if the A vector was set to all ones and the B vector was set to all zeros. The crossover point is indicated by the bold line.

Traditional Crossover Method, Image by Author
Traditional Crossover Method, Image by Author

You can see that the first six values from A were copied into the AxB offspring, and the next four values were copied in from B. The BxA offspring has an inverse contribution from the parents.

Here’s what the Python code looks like for the traditional crossover method:

offspring = empty(vector_length)
crossover_point = int(vector_length * recombination_amount)
offspring_axb[crossover_point:] = a[crossover_point:] 
offspring_bxa[:crossover_point] = b[:crossover_point]

Although this method is roughly based on the way it works in biology, I found that it is not the best technique for working with GANs. For example, below is a blend between the two best images that match the description of "an abstract painting with red squares" using the traditional crossover technique.

Traditional Crossover Method, Image by Author
Traditional Crossover Method, Image by Author

The first image is the parent A, the last image is the parent B, and the three images in between are gradual blends between the two parents using the traditional crossover method. Note that it takes some odd turns going between the parents.

For MAGnet, I am using a linear interpolation between the two parent vectors for the blend. This is a simple weighted average between vectors. Here is what it looks like for the same parent images using the linear crossover method.

Linear Crossover Method, Image by Author
Linear Crossover Method, Image by Author

This seems to be a better method for crossover as all of the offspring seem to be variations of abstract paintings with red squares, while the traditional crossover method seemed to traverse through some images that were not abstract paintings.

Going back to the example where the A vector was set to all ones and the B vector was set to all zeros, the linear crossover method blends each number in the offspring by the recombination amount and its inverse.

Traditional Crossover Method, Image by Author
Traditional Crossover Method, Image by Author

You can see that every value in the AxB offspring is a 60/40 blend of the A and B parents, respectively, and the BxA offspring is a 40/60 blend.

Here is the code for the linear crossover method.

offspring = (1-recombination_amount) * a + recombination_amount * b

Post-processing the Final Image

Similar to the post-processing of images in my GANscapes project, I bump up the contrast for the final images, like the Auto Levels feature in Photoshop. Here is an example image before and after the contrast adjustment.

Before and After Contrast Adjustment, Images by Author
Before and After Contrast Adjustment, Images by Author

You can see how the contrast adjustment "punches up" the colors. The source code for adjusting the image contrast in Python is here.

Results

Here are some results of running MAGnet with various text queries. Be sure to check out the appendix to see even more results.

MAGnet rendering of "an abstract painting with circles," Image by Author
MAGnet rendering of "an abstract painting with circles," Image by Author
MAGnet rendering of "a painting of a landscape," Image by Author
MAGnet rendering of "a painting of a landscape," Image by Author
MAGnet rendering of "a cubist painting," Image by Author
MAGnet rendering of "a cubist painting," Image by Author

Discussion and Future Work

The MAGnet system works fairly well. It seems to do a better job generating abstract paintings than representational paintings. It converges on a solution fairly quickly, within five generations.

Although the results of MAGnet are pretty good, if you play with it, you may see the same visual themes recur occasionally. This may be due to the limited number of source images used during the training (about 4,000 initial images, augmented to over 12,000 via cropping, and filtered to 10,000). Training with more images will probably reduce the frequency of repeated visual themes.

Another area that could be improved is the textures within the images. There seem to be some "computery" artifacts in the flat areas. This could be due to the reimplementation of the CUDA operations.

I noticed that rosinality is currently working on a new open source project on GitHub called alias-free-gan-pytorch. Sounds promising!

Source Code

All of the source code for this project is available on GitHub. The source code is released under the CC BY-SA license.

Attribution-ShareAlike
Attribution-ShareAlike

Acknowledgments

I want to thank Jennifer Lim and Oliver Strimpel for their help with this article.

References

[1] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning Transferable Visual Models From Natural Language Supervision (2021)

[2] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, "Training Generative Adversarial Networks with Limited Data," https://arxiv.org/pdf/2006.06676.pdf (2020)

[3] R. Gal, D. Cohen, A. Bermano, and D. Cohen-Or, "SWAGAN: A Style-based WAvelet-driven Generative Model," https://arxiv.org/pdf/2102.06108.pdf (2021)

[4] F.A. Galatolo, M.G.C.A. Cimino, and G.Vaglini, "Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search," https://arxiv.org/pdf/2102.01645.pdf, (2021)

[5] K. Deb, K., A. Pratap, S. Agarwal, and T. Meyarivan, "A fast and elitist multiobjective genetic algorithm: Nsga-ii.", IEEE transactions on evolutionary computation, (2002)

[6] WikiArt, https://www.wikiart.org, (2008–2021)

[7] O. Patashnik, Z.Wu, E. Shechtman, D.Cohen-Or, and Dani Lischinski "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery", https://arxiv.org/pdf/2103.17249.pdf (2021)

[8] M.Mitchell, An Introduction to Genetic Algorithms. Cambridge, MA: MIT Press. (1996)

Appendix – Gallery of MAGnet Images

Here is a sampling of images created by MAGnet. You can click on each image to see a larger version.

Sample Results from MAGnet, Images by Author
Sample Results from MAGnet, Images by Author

Here are some more examples with text queries suggested by one of my reviewers, Oliver Strimpel.

MAGnet rendering of "industrial wasteland," Image by Author
MAGnet rendering of "industrial wasteland," Image by Author
MAGnet rendering of "rolling farmland," Image by Author
MAGnet rendering of "rolling farmland," Image by Author
MAGnet rendering of "traditional European landscape," Image by Author
MAGnet rendering of "traditional European landscape," Image by Author
MAGnet rendering of "stormy seas," Image by Author
MAGnet rendering of "stormy seas," Image by Author
MAGnet rendering of "painting of a placid seaside," Image by Author
MAGnet rendering of "painting of a placid seaside," Image by Author
MAGnet rendering of "lake in pointillist style," Image by Author
MAGnet rendering of "lake in pointillist style," Image by Author

Related Articles