A brief introduction to Neural Style Transfer

Shubham Jha
Towards Data Science
7 min readAug 28, 2017

--

In the August of 2015, a paper came out titled ‘A Neural Algorithm of Artistic Style’. Back then, I was just getting started with deep learning, and I attempted to read the paper. I couldn’t make head or tail of it. So I gave up. A few months later, an app called Prisma was released and , people just went crazy about it. In case you don’t know what Prisma is, it is basically an app that allows you to apply painting styles of famous painters to your own photos, and the results are quite visually pleasing. It isn’t like Instagram filters, where some sort of transformation is applied to the picture only in the color space. It is much more elaborate and the results are even more interesting. Here’s an interesting photo I found on the internet.

Now, a year later, I’ve learnt quite a lot about deep learning, and I decided to give the paper another read. This time I understood the entire paper. It’s quite an easy read actually if you’re familiar with the methods of deep learning. In this blog post I mainly present my take on the paper and try to explain neural style transfer in simpler terms to someone who is in the same position as I was a year ago i.e a beginner to the field of deep learning. I’m positive that once you see the results of neural style transfer and understand how it works, you’ll be even more excited about the future prospects and the power of deep neural networks.

Basically, in Neural Style Transfer we have two images- style and content. We need to copy the style from the style image and apply it to the content image. By, style we basically mean, the patterns, the brushstrokes, etc.

For this, we use a pretrained VGG-16 net. The original paper uses a VGG-19 net, but I couldn’t find VGG-19 weights for TensorFlow, so I went with VGG-16. Actually, this (using VGG-16 instead of VGG-19) doesn’t have much impact on the final results. The resulting images are identical.

So intuitively, how does this work?

To understand this, I’ll give you a basic intro to how ConvNets work.

ConvNets work on the basic principle of convolution. Say, for example we have an image and a filter. We slide the filter over the image and take as output, the weighted sum of the inputs covered by the filter, transformed by a nonlinearity such as sigmoid or ReLU or tanh. Every filter has it’s own set of weights, which do not change during the convolution operation. This is depicted beautifully in the below GIF-

Here, the blue grid is the input. You can see the 3x3 region covered by the filter slide across the input(dark blue region). The result of this convolution is called a feature map, which is depicted by the green colored grid.

Here are graphs of ReLU and tanh activation functions-

ReLU activation function
Tanh activation function

So, in a ConvNet, the input image is convolved with several filters and filter maps are generated. These filter maps are then convolved with some more filters and some more feature maps are generated. This is illustrated beautifully through the diagram below.

You can also see the term maxpool, in the above image. A maxpoollayer is mainly used for the purpose of dimensionality reduction. In a maxpool operation, we simply slide a window, say of size 2x2, across the image and take as output, the maximum of values covered by the window. Here’s an example-

Now here’s something really cool. Look at the image below carefully and examine the feature maps at different layers.

What you must have noticed is that- the maps in the lower layers look for low level features such as lines or blobs (gabor filters). As we go to the higher layers, out features become increasingly complex. Intuitively we can think of it this way- the lower layers capture low level features such as lines and blobs, the layer above that builds up on these low level features and calculates slightly more complex features, and so on…

Thus, we can conclude that ConvNets develop a hierarchical representation of features.

This property is the basis of style transfer.

Now remember- while doing style transfer, we are not training a neural network. Rather, what we're doing is — we start from a blank image composed of random pixel values, and we optimize a cost function by changing the pixel values of the image. In simple terms, we start with a blank canvas and a cost function. Then we iteratively modify each pixel so as to minimize our cost function. To put it in another way, while training neural networks we update our weights and biases, but in style transfer, we keep the weights and biases constant, and instead, update our image.

For doing this, it is important that our cost function correctly represents problem.
The cost function has two terms- a style loss term and a content loss term, both of which are explained below.

Content loss

This is based on the intuition that images with similar content will have similar representation in the higher layers of the network.

P^l is the representation of the original image and F^l is the representation of the generated image in the feature maps of layer l.

Style loss

Here, A^l is the representation of the original image and G^l is the representation of the generated image in layer l. Nl is the number of feature maps and Ml is the size of the flattened feature map in layer l. wl is the weight given to the style loss of layer l.

By style, we basically mean to capture brush strokes and patterns. So we mainly use the lower layers, which capture low level features. Also note the use of gram matrix here. Gram matrix of a vector X is X.X_transpose. The intuition behind using gram matrix is that we’re trying to capture the statistics of the lower layers.
You don’t have to necessarily use a Gram matrix, though. Some other statistics (such as mean) have been tried and have worked pretty well too.

Total Loss

where alpha and beta are weights for content and style, respectively. They can be tweaked to alter our final result.

So our total loss function basically represents our problem- we need the content of the final image to be similar to the content of the content image and the style of the final image should be similar to the style of the style image.

Now all we have to do is to minimize this loss. We minimize this loss by changing the input to the network itself. We basically start with a blank grey canvas and start altering the pixel values so as to minimize the loss. Any optimizer can be used to minimize this loss. Here, For the sake of simplicity, I’ve used simple gradient descent. But people have used Adam and L-BFGS and obtained quite good results on this task.

This process can be seen clearly in the following GIF.

Here are the results of some of my experiments. I ran these images(512x512px) for 1000 iterations and this process takes about 25 minutes on my laptop with a 2GB Nvidia GTX 940MX. It’ll take much longer if you’re running it on a CPU, but also a lot lesser if you have a better GPU. I’ve heard that 1000 iterations takes only about 2.5 mins on a GTX Titan.

If you wish to gain some in-depth knowledge about Convolutional Neural Networks, check out the Stanford CS231n course.

So there you have it- your own Neural Style Transfer. Go ahead, implement it, and have fun!!
Until next time!

If you’d like to know more about Neural Style Transfer and its use cases, check out Fritz AI’s excellent blog post on Neural Style Transfer. The blog also contains additional resources and tutorials to help you get started with your first Neural Style Transfer project.

--

--