An Intuitive Understanding to Neural Style Transfer

Eddie Huang
Towards Data Science
5 min readMay 8, 2019

--

Neural style transfer is a machine learning technique that merges the “content” of one image with the “style” of another

Creating a rainbow, mosaic hummingbird with neural style transfer
  • Content: High level features describing objects and their arrangement in the image. Examples of content are people, animals, and shapes
  • Style: The texture of the image. Examples of styles are roughness, color, and sharpness
Start by guessing with a white noise image

Given your content image and your style image, we want to create a third image that is a blend of the previous two. Let’s start off with a simple white noise image. We could start off with the content image or style image for optimization efficiency, but let’s not worry about that now.

Now imagine we have two loss functions, content loss and style loss. These functions both take in our generated image and return a loss value that represents how dissimilar the content and style of our generated image is with that of the given content image and style image respectively.

The objective is to minimize a weighted sum of the content and style loss for our generated image

Just like with any other neural network model, we use an optimizing algorithm such as stochastic gradient descent, but instead of optimizing the parameters of a neural network model, we are optimizing the pixel values of our image.

Minimizing a weighted sum of style loss and content loss over the generated image. This is a progression of optimization across 600 steps

While style and content are not completely independent of each other, neural style transfer has shown that we can separate these two fairly well for most of the time. To resolve any conflict between style and content, one can adjust the weights of the style and content losses to show more of the style or content in the generated image.

This concludes our high level explanation of neural style transfer.

Acquiring the content and style loss functions

This section assumes basic knowledge of machine learning and convolutional neural networks.

We use a trained convolutional neural network (CNN) model such as VGG19 to acquire the content and style loss functions.

Content

Recall that content are high level features that describe objects and their arrangement in the image. An image classification model needs to be well-trained on content in order to accurately label an image as “dog” or “car”. A convolutional neural network (CNN) is designed to filter out the high level features of an image. CNNs are essentially sequential mappings from an image to another image. Each successive mapping filters out more of the high level features. Therefore, we are interested in the image mappings further into the CNN because they capture more of the content.

For more detailed information on how CNNs capture high level features, I recommend this article by Google. It has a lot of great visuals

VGG19 is a traditional CNN model, which sequentially maps images to images via convolutional filters (blue). The red blocks are pooling functions that simply shrink the images to reduce the number of parameters in training. Licensed under CC BY 4.0 — https://www.researchgate.net/figure/The-figure-maps-of-VGG19-in-our-approach-Blue-ones-are-got-from-convolutional-layer-and_fig4_321232691

We define the content representation of an image as an image mapping that’s relatively further into the CNN model.

We chose the fifth image mapping as our content representation. In more technical terms, we would say that our content representation is the feature response of the fifth convolutional layer.

We can visualize the content representation by only using content representation to reconstruct the images. We start off by naively guessing with a white noise image and iteratively optimize it until the image’s content representation converges to that of the original image. This technique is called content reconstruction

Content reconstructions: The first image uses the content representation from the first image mapping, the second image uses the fifth image mapping and the third image uses the ninth mapping. With each successive image, there is more noise, but the high level structure of the bird is still retained. This indicates that the higher level features are being filtered out

Style

For style loss, we want to capture the textural information rather than the structural information of the image. Recall that an image contains multiple channels, and in CNN models, we usually map images to images with more channels so that we can capture more features.

Let’s use the word layer to describe a slice of an image at one particular channel. We define the style representation as the correlation between the layers in an image mapping.

More specifically, it is the sum of the element-wise matrix multiplication of all pairs of layers within an image mapping. A Gramian matrix is often used to express the style representation. For a given image mapping, the element at the i’th row and j’th column of the Gramian matrix would contain the dot product between the flattened i’th layer and the flattened j’th layer.

We chose the first image mapping as our style representation. Style representation is the correlation between all layers in an image. While we only used one image mapping here, we usually use multiple image mappings for style representation (e.g. the first 5 image mappings)

Just like in content reconstruction, we can visualize the style representation by using only style representation to reconstruct the images. We start off by naively guessing with a white noise image and iteratively optimize it until the image’s style representation converges to that of the original image. This technique is called style reconstruction

Style reconstructions: The first image uses the the style representation from the first image mapping, the second image uses the style representations from the first two image mappings… etc. With each successive image mapping, the textures we obtain become more high level.

Now we can define the content loss and style loss functions.

Content loss is the square error between the output image’s and the content image’s content representation

Style loss is the square error between the output image’s and the style image’s style representation

High level procedure for neural style transfer

  1. Decide which image mappings of the CNN model you want for the content representations and style representations, and then calculate the content representations from the content image and style representations from the style image. For all of these images, I used the fifth image mapping for the content representation and the first five image mappings for the style representation
  2. Initialize your output image (can be white noise image, a copy of the style image or content image, doesn’t really matter)
  3. Compute the content representation and style representation of your output image from the same image mappings of the same CNN model
  4. Compute the content loss and style loss
  5. Optimize your output image over a weighted sum of the content loss and style loss. This is often achieved by stochastic gradient descent
  6. Repeat steps 2–5 until you are satisfied with the results

More examples of neural style transfer

Neural style transfer is an amazing machine learning technique that can be widely used by artists. Here are some more examples of neural style transfer. All images were licensed under Pixabay

Implementation

There are plenty of code bases out there for neural style transfer. I recommend checking out the one from PyTorch’s tutorial series.

--

--

Computer Science PhD Candidate at University of Illinois, Urbana-Champaign