Neural Networks Intuitions: 2. Dot product, Gram Matrix and Neural Style Transfer

Raghul Asokan
Towards Data Science
6 min readJan 24, 2019

--

Hello everyone!

Today we are going to have a look at one of the interesting problems that has been solved using neural networks — “Image Style Transfer”. The problem is to take two images, extract content from one image, style(texture) from the other and seamlessly merge them together into one final image that looks realistic. This blog post is an explanation of the article A Neural Algorithm of Artistic Style by Gatys et. al (https://arxiv.org/abs/1508.06576)

Lets look at an example to make things clear.

Top left is the content image, bottom left is the style image, result is the one on the right

Interesting right? Let’s take a look at how to solve it.

Overview: Let me give a brief overview of the solution

  1. Create a random input image
  2. Pass the input through a pretrained backbone architecture say VGG, ResNet(note that this backbone will not be trained during backpropagation).
  3. Calculate loss and compute the gradients w.r.t input image pixels. Hence only the input pixels are adjusted whereas the weights remain constant.

The objective is to change the input image so that it represents the content and the style from the respective images.

This problem consists of two sub problems: 1. to generate the content and 2. to generate the style.

Problem — 1. Generate Content: The problem is to produce an image that contains a content as in the content image.

One point to note here is that the image should only contain the content(as in a rough sketch of the content image and not the texture from it, since the output should contain a style as that of the style image)

Solution: The answer should be pretty straightforward. Use MSE loss(or any similarity measure such as SSIM, PSNR) between input and the target. But what is the target here? Well, if there were no constraints on the style of the output, then simply MSE between input and content image would be sufficient. So how to get an image’s content without copying its style?

Use feature maps.

Convolutional feature maps are generally a very good representation of input image’s features. They capture spatial information of an image without containing the style information(if a feature map is used as it is), which is what we need. And this is the reason we keep the backbone weights fixed during backpropagation.

Therefore, MSE loss between the input image’s features and content image’s features will work!

Content Loss
Backbone used is vgg16_bn

Using feature maps of early conv layers represent the content much better, as they are closer to the input, hence using features of conv2, conv4 and conv7.

SaveFeatures is used to save the activations of conv layers(for content/style/input) during a forward pass
MSE between input and content features

Problem — 2. Generate Style: The problem is to produce an image which contains the style as in the style image.

Solution: To extract the style of an image(or more specifically to compute the style loss), we need something called as Gram matrix. Wait, what is a Gram matrix?

Before talking about how to compute the style loss, let me talk about some math basics.

Dot product:

Dot product of two vectors can be written as:

(a)

or

(b)

A dot product of two vectors is the sum of products of respective coordinates.

In 3blue1brown’s words, dot product can be viewed as the length of the projected vector a on vector b times the length of the vector b”.

Or in khan academy’s words, “it can be viewed as the length of vector a going in the same direction as vector b times the length of the vector b”.

The term |a|cos θ in fig (a) (rearranging it as |b||a|cos θ) essentially is the length of the adjacent side(or the projected vector ‘a’ as in 3blue1brown), hence it boils down to the product of the adjacent side times the length of the vector b.

So what does this mean?

In a more intuitive way, dot product can be seen as how similar two vectors actually are. The more similar they are, the lesser the angle between them as in fig (a) or more closer the respective coordinates as in fig(b). In both the cases, the result is large. So the more similar they are, the larger the dot product gets.

But what does this have to do with a neural network?

Consider two vectors(more specifically 2 flattened feature vectors from a convolutional feature map of depth C) representing features of the input space, and their dot product give us the information about the relation between them. The lesser the product the more different the learned features are and greater the product, the more correlated the features are. In other words, the lesser the product, the lesser the two features co-occur and the greater it is, the more they occur together. This in a sense gives information about an image’s style(texture) and zero information about its spatial structure, since we already flatten the feature and perform dot product on top of it.

Now take all C feature vectors(flattened) from a convolutional feature map of depth C and compute the dot product with every one of them(including with a feature vector itself). The result is the Gram Matrix(of size CxC).

Gram matrix

There you go!

Compute MSE loss between gram matrix of input and the style image and you’re good to generate an input image with the required style.

Summing MSE of Gram matrices for all layers, normalizing and computing a weighted sum in the end
Saving feature maps for style image

It is always better to consider feature maps from several convolutional layers(on experimentation) for extracting style information. In the above code convolutional layers 2, 4, 7, 10 and 13 are used.

Computing gram matrix and style loss

Finally, add both the content and style loss before backpropagating to get an output image which has the content from a content image and the style from a style image.

A normal sum of both the losses may not work when the losses are in a different scale. So a weighted sum of content and style loss should work.

Total loss
compute weighted sum of both losses and then backprop

Note that the input image can be any random tensor which has the values in the same range as the content and style image.

Few of the results from my implementation:

Check out my github repo https://github.com/mailcorahul/deep_learning/tree/master/papers/neural_style_transfer for pytorch implementation of the paper Image Style Transfer Using Convolutional Neural Networks by Gatys et. al.

Cheers -:)

--

--