An Intuitive Understanding to Neural Style Transfer
Neural style transfer is a machine learning technique that merges the “content” of one image with the “style” of another
- Content: High level features describing objects and their arrangement in the image. Examples of content are people, animals, and shapes
- Style: The texture of the image. Examples of styles are roughness, color, and sharpness
Given your content image and your style image, we want to create a third image that is a blend of the previous two. Let’s start off with a simple white noise image. We could start off with the content image or style image for optimization efficiency, but let’s not worry about that now.
Now imagine we have two loss functions, content loss and style loss. These functions both take in our generated image and return a loss value that represents how dissimilar the content and style of our generated image is with that of the given content image and style image respectively.
The objective is to minimize a weighted sum of the content and style loss for our generated image
Just like with any other neural network model, we use an optimizing algorithm such as stochastic gradient descent, but instead of optimizing the parameters of a neural network model, we are optimizing the pixel values of our image.
While style and content are not completely independent of each other, neural style transfer has shown that we can separate these two fairly well for most of the time. To resolve any conflict between style and content, one can adjust the weights of the style and content losses to show more of the style or content in the generated image.
This concludes our high level explanation of neural style transfer.
Acquiring the content and style loss functions
This section assumes basic knowledge of machine learning and convolutional neural networks.
We use a trained convolutional neural network (CNN) model such as VGG19 to acquire the content and style loss functions.
Content
Recall that content are high level features that describe objects and their arrangement in the image. An image classification model needs to be well-trained on content in order to accurately label an image as “dog” or “car”. A convolutional neural network (CNN) is designed to filter out the high level features of an image. CNNs are essentially sequential mappings from an image to another image. Each successive mapping filters out more of the high level features. Therefore, we are interested in the image mappings further into the CNN because they capture more of the content.
For more detailed information on how CNNs capture high level features, I recommend this article by Google. It has a lot of great visuals
We define the content representation of an image as an image mapping that’s relatively further into the CNN model.
We can visualize the content representation by only using content representation to reconstruct the images. We start off by naively guessing with a white noise image and iteratively optimize it until the image’s content representation converges to that of the original image. This technique is called content reconstruction
Style
For style loss, we want to capture the textural information rather than the structural information of the image. Recall that an image contains multiple channels, and in CNN models, we usually map images to images with more channels so that we can capture more features.
Let’s use the word layer to describe a slice of an image at one particular channel. We define the style representation as the correlation between the layers in an image mapping.
More specifically, it is the sum of the element-wise matrix multiplication of all pairs of layers within an image mapping. A Gramian matrix is often used to express the style representation. For a given image mapping, the element at the i’th row and j’th column of the Gramian matrix would contain the dot product between the flattened i’th layer and the flattened j’th layer.
Just like in content reconstruction, we can visualize the style representation by using only style representation to reconstruct the images. We start off by naively guessing with a white noise image and iteratively optimize it until the image’s style representation converges to that of the original image. This technique is called style reconstruction
Now we can define the content loss and style loss functions.
Content loss is the square error between the output image’s and the content image’s content representation
Style loss is the square error between the output image’s and the style image’s style representation
High level procedure for neural style transfer
- Decide which image mappings of the CNN model you want for the content representations and style representations, and then calculate the content representations from the content image and style representations from the style image. For all of these images, I used the fifth image mapping for the content representation and the first five image mappings for the style representation
- Initialize your output image (can be white noise image, a copy of the style image or content image, doesn’t really matter)
- Compute the content representation and style representation of your output image from the same image mappings of the same CNN model
- Compute the content loss and style loss
- Optimize your output image over a weighted sum of the content loss and style loss. This is often achieved by stochastic gradient descent
- Repeat steps 2–5 until you are satisfied with the results
More examples of neural style transfer
Neural style transfer is an amazing machine learning technique that can be widely used by artists. Here are some more examples of neural style transfer. All images were licensed under Pixabay
Implementation
There are plenty of code bases out there for neural style transfer. I recommend checking out the one from PyTorch’s tutorial series.