Introduction
Lately, the term "Generative AI" has become a trending topic around the world thanks to the release of the publicly available AI models, like ChatGPT, Gemini, Claude, etc. As we all know, their capabilities were initially limited to understanding and generating texts, but soon after, they got their ability to perform the same thing on images as well. Talking more specifically about generative models for image data, there are actually plenty number of model variations we can use, in which every single of those has their own purpose. So far, I already got some of my articles about generative AI for image data published in Medium, such as Autoencoder and Variational Autoencoder (VAE). In today’s article, I am going to talk about another fascinating generative algorithm: The Neural Style Transfer.
NST was first introduced in a paper titled "A Neural Algorithm of Artistic Style" written by Gatys et al. back in 2015 [1]. It is explained in the paper that their main objective is to transfer the artistic style of an image (typically a painting) onto a different image, hence the name "Style Transfer." Look at some examples in Figure 1 below, where the authors restyled the picture on the top left with different paintings.
![Figure 1. Applying NST to the original image (top left) using styles from The Shipwreck of the Minotaur by J.M.W. Turner (top right), The Starry Night by Vincent van Gogh (bottom left), and Der Schrei by Edvard Munch (bottom right) [1].](https://towardsdatascience.com/wp-content/uploads/2024/12/145OyICh9_-vKr-3SMJNpOw.png)
The Idea Behind NST
The authors of this research explained that the content and the style of an image can be separated by CNN. This essentially implies that if we have two images, we can take the content from the first image and the artistic style from the second one. By combining them, we can obtain a new image that retains the content of the first image yet is painted in the style of the second image. The content and style separation performed in the initial step is possible to be done based on the fact that typically shallower layers in CNN focused on extracting low-level features, i.e., edges, corners, and textures, while deeper layers are responsible to capture higher-level features, i.e., a pattern that resembles a specific object. In fact, we can think of the low-level features as the style of an image, while the higher-level ones as the image content.
In order to exploit this behavior, we need to have three images: content image, style image, and generated image. Content image is the one that the style will be replaced with the artistic pattern from the style image. Neither content nor style image are actually modified in the process since these two images will act as the ground truths. The generated image, on the other hand, is the one that we are going to modify based on the content information from the content image and the style information from the style image. Initially, the generated image can either be a random noise or a clone of the content image. Later in the training process, we will gradually update the pixel values inside this image such that it minimizes its difference between both the content and style image.
NST Architecture
According to the paper, the backbone of NST is the VGG-19 model. The flow of the three images in the network can be seen in Figure 2 below.
![Figure 2. The flow of the generated, style, and content images in the pretrained VGG19 model [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/1WO4dYspwIu1l8ksZl_i_xg.png)
The VGG-19 network above initially works by accepting our content, style and generated images simultaneously. The content image (blue) will be processed starting from the beginning of the network all the way to _conv42 layer. To the style image (green) we also pass it from the input layer, but for this one we will take the feature map from _conv11, _conv21, _conv31, _conv41, and _conv51. Similarly, the generated image (orange) is also passed through the network, and we will extract the feature maps from the same layers used for both the content and style image. Additionally, we can also see in the figure that all layers after _conv51 are not necessary to be implemented as our images will not go through these layers.
Content & Style Loss
There are two loss functions implemented in NST, namely content loss and style loss. As the name suggests, content loss is employed to calculate the difference between the content image and the generated image. By minimizing this loss, we will be able to preserve the content information of the content image within the generated image. In Figure 2 above, content loss will be applied to the feature maps produced by the blue and the corresponding orange arrow (the two arrows coming out from _conv42 layer). Meanwhile, style loss is applied to compute the difference between feature maps from the style image and the generated image, i.e., between the green and the corresponding orange arrows. With the style loss minimized, our generated image should look similar to the style image in terms of the artistic patterns.
Mathematically speaking, content loss can be defined using the equation displayed in Figure 3. In the equation, P represents the feature map corresponding to the content image p. Meanwhile, F is the feature map obtained from the generated image x. The input parameter l indicates that feature maps P and F are taken from the same layer, which in this case it refers to layer _conv42. – By the way, if you often work with regression models, you should be familiar with this equation since it is essentially just an MSE (Mean Squared Error).
![Figure 3. The mathematical definition of content loss [1].](https://towardsdatascience.com/wp-content/uploads/2024/12/1LOnFYlVB67lerZK1JRvlDA.png)
As for the style loss, we can calculate it using the equation in Figure 4. This equation sums the style loss E at each layer l with a weighting factor w.
![Figure 4. Equation for calculating the overall style loss [1].](https://towardsdatascience.com/wp-content/uploads/2024/12/1KzDbYRzCHKBW6bqrUH3nAA.png)
The style loss of each layer itself is defined in the equation in Figure 5, where it is actually just another MSE function specifically used for computing the difference between the Gram matrix of the feature map from style image A and the Gram matrix of the feature map from the generated image G. – Don’t worry if you’re not yet familiar with Gram matrix, as I’ll talk about it later in the next section.
![Figure 5. Equation for computing the style loss from a single feature map [1].](https://towardsdatascience.com/wp-content/uploads/2024/12/1tvsxj9ajV8t1B9Ex8-5ilw.png)
As we already got the idea to compute content and style loss, we can now combine them to form the total loss. You can see in Figure 6 that the summation between content and style loss is done with the weighting parameters α and β. These two coefficients allow us to control the emphasis of the loss function. So, if we want to emphasize the content, we can increase α, or if we want the style to be more dominant, we can use a higher value for β.
![Figure 6. Equation for computing the total loss [1].](https://towardsdatascience.com/wp-content/uploads/2024/12/1PFjn0aofp95vq5r7Q9rLEA.png)
Later in the training phase, the weights of the VGG-based network will be frozen, which means that we will not train the model any further. Instead, the value from our loss function is going to be used to update the pixel values of the generated image. Thanks to this reason, the term "training" is actually not the most accurate way to describe this process since the network itself does not undergo training. A better term would be "optimization," since our goal is to optimize the generated image. – So, from now on, I will use the term "optimization" to refer to this process.
Gram Matrix
In the previous section I mentioned that the MSE computed for the style loss is done on the Gram matrices rather than the plain feature maps. The reason that we compute Gram matrix is because it is an effective way to extract information regarding the correlation between two or more channels within a feature map. Look at Figure 7 and 8 to see how a Gram matrix is constructed. In this illustration, I assume that our feature map has 8 channels, each having the spatial dimension of 4×4. The first thing we need to do is to flatten the spatial dimension and stack the channels vertically as shown below.
![Figure 7. Flattening the spatial dimension [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/1Gc1sHtdJPSLGiYPGttXU_w.png)
Afterwards, the resulting array will be multiplied by its transpose to construct the actual Gram matrix which has the size of C×C (in this case it’s 8×8). Such a matrix multiplication operation causes the feature map to lose its spatial information, but in return it captures the correlation between channels, representing textures and patterns that correspond to its artistic style. Hence, it should make a lot of sense now why we need to use Gram matrices for computing style loss.
![Figure 8. Gram matrix is obtained by multiplying the spatially-flattened feature map with its transpose [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/174U-Wefyp2pwG25OMFWO9A.png)
Implementing NST from Scratch
As we have understood the underlying theorem behind NST, now that we will get our hands dirty with the code. The very first thing we need to do is to import all the required modules.
# Codeblock 1
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models
import matplotlib.pyplot as plt
from PIL import Image
from tqdm import tqdm
from torchvision import transforms
from torchvision.models import VGG19_Weights
from torchvision.utils import save_image
These modules are pretty standard. I believe they should not confuse you especially if you have experience in training PyTorch models. But don’t worry if you’re not familiar with them yet, since you’ll definitely understand their use as we go.
Next, we are going to check whether our computer has a GPU installed. If it does, the code will automatically assign 'cuda'
to the device
variable. Even though this NST implementation can work without GPU, but I highly recommend you not to do that because performing NST optimization is computationally very expensive.
# Codeblock 2
device = torch.device('cuda' if torch.cuda.is_available else 'cpu')
Parameter Initialization
There are several parameters we need to configure for this optimization task, which the details can be seen in Codeblock 3 below.
# Codeblock 3
IMAGE_SIZE = 224 #(1)
EPOCHS = 20001 #(2)
LEARNING_RATE = 0.001 #(3)
ALPHA = 1 #(4)
BETA = 1000 #(5)
Here I set IMAGE_SIZE
to 224 as shown at line #(1)
. The reason that I choose this number is simply because it matches with the original VGG input shape. In fact, it is technically possible to use larger size if you want your image to have higher resolution. However, keep in mind that it causes the optimization process to be longer.
Next, I set the EPOCHS
to 20,001 (#(2)
), – yes with that extra 1. – I do admit that this number is a bit strange, but it is actually just a technical detail that allows me to get the result at epoch 20,000. – Well, you’ll know it later. – One important thing to not about EPOCHS
is that a higher number doesn’t necessarily mean a better result for everyone. This is essentially due to the nature of generative AI, where at some point it is just a matter of preference. Later in the optimization process, even though I use a large value for EPOCHS
, I will save the generated image at certain intervals so that I can choose the result I like the most.
To the LEARNING_RATE
(#(3)
), 0.001 is basically just the number that I often use for this parameter. However, theoretically speaking, changing this number should affect the speed of the optimization process. Lastly for the ALPHA
(#(4)
) and BETA
(#(5)
), I configure them such that they have the ratio of 1/1000. It is mentioned in the paper that if we use smaller ratio (i.e., setting BETA
to be even higher), it causes the artistic style too dominant, making the content of the image less visible. Look at Figure 9 below to see how different α/β ratios affect the resulting image.
![Figure 9. The generated image created with different alpha/beta ratio [1]. The style image appears to be extremely dominant (leftmost) when the ratio is set to 1/100,000. Additionally, the artistic style is getting more complex as we move towards the deeper layer.](https://towardsdatascience.com/wp-content/uploads/2024/12/12OIQrwcREfgIQvBuz3eJ5Q.png)
Image Loading & Preprocessing
After the parameters have been initialized, now that we will continue with the image loading and preprocessing function. See the implementation in Codeblock 4 below.
# Codeblock 4
def load_image(filename):
transform = transforms.Compose([
transforms.Resize(IMAGE_SIZE), #(1)
transforms.ToTensor(), #(2)
transforms.Normalize(mean=[0.485, 0.456, 0.406], #(3)
std=[0.229, 0.224, 0.225])
])
image = Image.open(filename) #(4)
image = transform(image) #(5)
image = image.unsqueeze(0) #(6)
return image
This function works by accepting the name of the image file to be loaded. Before actually loading the image, the first thing we do inside the function is to define the preprocessing steps using transforms.Compose()
, which consists of resizing (#(1)
), conversion to PyTorch tensor (#(2)
), and normalization (#(3)
). The normalization parameter I use here is obtained from the mean and the standard deviation of ImageNet, i.e., the dataset which the pretrained VGG-19 is trained on. By using the same configuration as this, we allow the pretrained model to work with its best performance.
The image itself is loaded using the Image.open()
function from PIL (#(4)
). Then, we directly preprocess it with the transformation steps we just defined (#(5)
). Lastly, we apply the unsqueeze()
method to create the batch dimension. Even though in this case we only have a single image in each batch, yet it is still necessary to add this dimension because PyTorch models are basically designed to process a batch of images.
Here we are going to use the picture of Victoria Library and the Starry Night painting. The two images in their unprocessed form are shown in Figure 10 below.
![Figure 10. The picture of Victoria Library which I took back in 2015 (left) [2] will be used as both the content and generated image, while the Starry Night painting by Vincent van Gogh (right) [3] will act as the style image.](https://towardsdatascience.com/wp-content/uploads/2024/12/1Fy1L8g3J-v9otifB1UyBeQ.png)
Now that we will load these images using the load_image()
function we defined above. See Codeblock 5 for the details.
# Codeblock 5
content_image = load_image('Victoria Library.jpg').to(device) #(1)
style_image = load_image('Starry Night.jpg').to(device) #(2)
gen_image = content_image.clone().requires_grad_(True) #(3)
Here I’m using the picture of Victoria Library as the content image (#(1)
), while the painting will serve as the style image (#(2)
). In this case, the same Victoria Library picture will also be used for the generated image (#(3)
). As I mentioned earlier, it is possible to use random noise for it. However, I decided not to do so because based on my experiment I found that the information from the content image did not transfer properly to the generated image for some reasons. Here we also need to apply requires_grad_(True)
to the generated image in order to allow its pixel values to be updated by backpropagation.
We can check if the images have been loaded and preprocessed properly by running the following code. You can see in the resulting output that both images now have the height of 224 pixels, which is exactly what we set earlier. The transforms.Resize()
function automatically adjusts the width to maintain the aspect ratio, ensuring the images look proportional. Additionally, you may also notice that their colors become darker, which is caused by the normalization process.
# Codeblock 6
plt.imshow(content_image.permute(0, 2, 3, 1).squeeze().to('cpu'))
plt.show()
plt.imshow(style_image.permute(0, 2, 3, 1).squeeze().to('cpu'))
plt.show()
![Figure 11. Both images have been successfully loaded and preprocessed (output from Codeblock 6) [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/1jXBBZOlRR9bNtY9uuN2CXA.png)
Modifying the VGG-19 Model
In PyTorch, the VGG-19 architecture can easily be loaded using models.vgg19()
. Since we want to utilize its pretrained version, we need to pass VGG19_Weights.IMAGENET1K_V1
for the weights
parameter. If this is your first time running the code, it will automatically start downloading the weights, which is around 550 MB.
# Codeblock 7
models.vgg19(weights=VGG19_Weights.IMAGENET1K_V1)
Before we actually modify the architecture, I want you to see its complete version below.
# Codeblock 7 output
VGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace=True)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace=True)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace=True)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace=True)
(16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(17): ReLU(inplace=True)
(18): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace=True)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace=True)
(23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(24): ReLU(inplace=True)
(25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(26): ReLU(inplace=True)
(27): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(29): ReLU(inplace=True)
(30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(31): ReLU(inplace=True)
(32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(33): ReLU(inplace=True)
(34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): ReLU(inplace=True)
(36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
(classifier): Sequential(
(0): Linear(in_features=25088, out_features=4096, bias=True)
(1): ReLU(inplace=True)
(2): Dropout(p=0.5, inplace=False)
(3): Linear(in_features=4096, out_features=4096, bias=True)
(4): ReLU(inplace=True)
(5): Dropout(p=0.5, inplace=False)
(6): Linear(in_features=4096, out_features=1000, bias=True)
)
)
I need to admit that the VGG-19 architecture I illustrated in Figure 2 is a bit oversimplified. However, the idea is actually the same, in a sense that we will take the output from _conv42 layer for the content image, and from _conv11, _conv21, _conv31, _conv41, and _conv51 for the style image. In the Codeblock 7 output above, _conv42 corresponds to layer number 21, whereas the five layers for the style image correspond to layer 0, 5, 10, 19, and 28, respectively. We are going to modify the pretrained model based on this requirement which I do in the ModifiedVGG()
class shown below.
# Codeblock 8
class ModifiedVGG(nn.Module):
def __init__(self):
super().__init__()
self.layer_content_idx = [21] #(1)
self.layer_style_idx = [0, 5, 10, 19, 28] #(2)
#(3)
self.model = models.vgg19(weights=VGG19_Weights.IMAGENET1K_V1).features[:29]
def forward(self, x):
content_features = [] #(4)
style_features = [] #(5)
for layer_idx, layer in enumerate(self.model):
x = layer(x) #(6)
if layer_idx in self.layer_content_idx:
content_features.append(x) #(7)
if layer_idx in self.layer_style_idx:
style_features.append(x) #(8)
return content_features, style_features #(9)
The first thing we do inside the class is to create the __init__()
method. Here we specify the indices of the layers which the feature maps are going to be extracted from, as shown at line #(1)
and #(2)
. The pretrained VGG-19 model itself is initialized at line #(3)
. Notice that here I use [:29]
to take all layers from the beginning up to layer number 28 only. This is essentially done because flowing the tensors all the way to the end of the network is just necessary for this NST task.
Next, inside the forward()
method we first allocate two lists, one for storing the feature maps from content image (#(4)
) and another one for the feature maps from style image (#(5)
). Since the VGG architecture only consists of sequential layers, we can do the forward propagation using a typical for
loop. With this approach, the feature map from the previous layer will directly be fed into the subsequent one (#(6)
). Both content_features
(#(7)
) and style_features
(#(8)
) lists will be appended with a feature map whenever their corresponding if
statement returns True
. It is worth noting that the if
statement for the content image will only be called once since we only want to keep the feature map from layer 21. Despite this behavior, I implement it in a loop anyway for the sake of flexibility so that you can take the content feature maps from multiple layers if you want.
Both the content_features
and style_features
lists will be the return values of our forward()
method (#(9)
). Later on, if you feed the content image into the network, you can just take the first output. If you pass the style image into it, then you can take the second output. And you will need to take both outputs whenever you pass the generated image into the network.
Now we can check if our ModifiedVGG()
class works properly by passing content_image
and style_image
through it. See the details in Codeblock 9 below.
# Codeblock 9
modified_vgg = ModifiedVGG().to(device).eval() #(1)
content_features = modified_vgg(content_image)[0] #(2)
style_features = modified_vgg(style_image)[1] #(3)
print('content_features lengtht:', len(content_features))
print('style_features lengtht:', len(style_features))
# Codeblock 9 output
content_features length : 1
style_features length : 5
The first thing we do in the above code is to initialize the model we just created (#(1)
). Remember that since we won’t train the network any further, we need to freeze its weights using the eval()
method. Next, we can now forward-propagate the content (#(2)
) and the style image (#(3)
). If we print out the number of elements of both outputs, we can see that content_features
consists of only a single element whereas style_features
contains 5 elements, in which every single of those corresponds to the feature map from the selected layers.
Just to make the underlying process clearer, I would like to display the feature maps stored in the two lists. To do so, there are some technical stuff you need to follow. – Well, this is actually something we need to do every time we want to display an image processed with PyTorch. – As seen in Codeblock 10, since PyTorch places the channel dimension of a tensor at the 1st axis, we need to swap it with the last axis using the permute()
method in order to allow Matplotlib to display it. Next, we also need to use squeeze()
to drop the batch dimension. Since the _conv42 layer implements 512 kernels, our content image is now represented as a feature map of 512 channels, each storing different information regarding the content of the image. For the sake of simplicity, I will only display the first 5 channels, which can be achieved using a simple indexing method.
# Codeblock 10
plt.imshow(content_features[0].permute(0, 2, 3, 1).squeeze()[:,:,0].to('cpu').detach())
plt.show()
plt.imshow(content_features[0].permute(0, 2, 3, 1).squeeze()[:,:,1].to('cpu').detach())
plt.show()
plt.imshow(content_features[0].permute(0, 2, 3, 1).squeeze()[:,:,2].to('cpu').detach())
plt.show()
plt.imshow(content_features[0].permute(0, 2, 3, 1).squeeze()[:,:,3].to('cpu').detach())
plt.show()
plt.imshow(content_features[0].permute(0, 2, 3, 1).squeeze()[:,:,4].to('cpu').detach())
plt.show()
And below is what the Victoria Library looks like after being processed by the VGG network from its input layer to the _conv42 layer. Even though these representations are abstract and may seem difficult to interpret visually, yet they contain important information that the network uses to reconstruct the content.
![Figure 12. Visualization of the content image after being processed through all layers of the VGG-19 network up to conv4_2 layer, showing channels 0, 1, 2, 3, and 4, respectively from left to right (output from Codeblock 10) [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/1M3zIdfNO_culNIsrUs4P3A.png)
With the same mechanism, we can also display the style image after being processed from the input layers up to the five selected layers. If you check the original VGG paper [4], you will see that the feature maps produced by _conv11, _conv21, _conv31, _conv41 and _conv51 are 64, 128, 256, 512, and 512, respectively. In the code below, I arbitrarily pick one channel from each feature map to be displayed.
# Codeblock 11
plt.imshow(style_features[0].permute(0, 2, 3, 1).squeeze()[:,:,60].to('cpu').detach())
plt.show()
plt.imshow(style_features[1].permute(0, 2, 3, 1).squeeze()[:,:,12].to('cpu').detach())
plt.show()
plt.imshow(style_features[2].permute(0, 2, 3, 1).squeeze()[:,:,71].to('cpu').detach())
plt.show()
plt.imshow(style_features[3].permute(0, 2, 3, 1).squeeze()[:,:,152].to('cpu').detach())
plt.show()
plt.imshow(style_features[4].permute(0, 2, 3, 1).squeeze()[:,:,76].to('cpu').detach())
plt.show()
You can see in the resulting output that the style image appears very clear in the initial layers, indicating that the feature maps from these layers are useful to preserve style information. However, it is also worth to note that taking the style information from deeper layers is also important in order to preserve higher-order artistic style. This notion is actually proven by Figure 9, where the artistic style appears to be more complex at layer _conv31 than at layer _conv11.
![Figure 13. The style image after being processed sequentially through the VGG-19 network up to conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1 layers, respectively from left to right (output from Codeblock 11) [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/1jRBnA938rRceoy5ebGPvfA.png)
Creating the Gram Matrix
Both the feature maps from style and generated image will be converted to Gram matrices before the loss is computed using MSE. The Gram matrix computation previously illustrated in Figure 7 and 8 is implemented in the compute_gram_matrix()
function below. The way this function works is pretty straightforward. It first flattens the spatial dimension (#(1)
), then the resulting tensor is matrix-multiplied with its transpose (#(2)
).
# Codeblock 12
def compute_gram_matrix(feature_map):
batch_size, num_channels, height, width = feature_map.shape
feature_map_flat = feature_map.view(num_channels, height*width) #(1)
gram_matrix = torch.matmul(feature_map_flat, feature_map_flat.t()) #(2)
return gram_matrix
Now I am going to actually apply this function to compute the Gram matrix of the style image feature maps that we stored earlier in style_features
list. Additionally, I will also visualize them so that you can have a better understanding about this matrix. Look at the Codeblock 13 below to see how I do it.
# Codeblock 13
style_features_0 = compute_gram_matrix(style_features[0])
style_features_1 = compute_gram_matrix(style_features[1])
style_features_2 = compute_gram_matrix(style_features[2])
style_features_3 = compute_gram_matrix(style_features[3])
style_features_4 = compute_gram_matrix(style_features[4])
plt.imshow(style_features_0.to('cpu').detach())
plt.show()
plt.imshow(style_features_1.to('cpu').detach())
plt.show()
plt.imshow(style_features_2.to('cpu').detach())
plt.show()
plt.imshow(style_features_3.to('cpu').detach())
plt.show()
plt.imshow(style_features_4.to('cpu').detach())
plt.show()
![Figure 14. The Gram matrices of different feature maps from style image (output from Codeblock 13) [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/1uHm3L5W7qMOfZVE2foMTrw.png)
The output shown in Figure 14 aligns with the illustration in Figure 8, where the size of each matrix matches with the number of channels in the corresponding feature map. The colors inside these matrices themselves indicates the correlation scores between two channels, in which higher value is represented by lighter colors. There is actually not much we can interpret from these matrices. However, keep in mind that they contain the style information within an image. The only thing we can see here is the subtle diagonal line spanning from top left all the way to the bottom right. This pattern makes sense because the correlation between a channel and itself (the diagonal elements) is typically higher than the correlation between different channels (the off-diagonal elements).
Loss Function & Optimizer
The pixel intensity values of the generated image will be updated based on the weighted sum of content and style loss. As I’ve mentioned earlier, these two loss functions are actually the same: the Mean Squared Error. Due to this reason, we don’t need to create separate functions for them. Meanwhile to the optimizer, there are many sources out there suggesting that we should use L-BFGS optimizer for NST. However, I didn’t find any explicit information about it in the paper. So, I think it’s completely fine for us to go with any optimizers. And in this case, I will just use Adam.
In the following codeblock, I impleement the MSE loss from scratch and initialize the Adam optimizer taken from the PyTorch module. One thing that you need to pay attention to is that we need to pass our generated image to the params
parameter, not the weights of the model. This way, each optimization step will update the pixel values of the gen_image
while keeping the model weights unchanged.
# Codeblock 14
def MSE(tensor_0, tensor_1):
return torch.mean((tensor_0-tensor_1)**2)
optimizer = optim.Adam(params=[gen_image], lr=LEARNING_RATE)
Denormalization Function
If we go back to Figure 11, you will notice that the coloration of the content and the style image became strange after being normalized. Hence, it is necessary for us to apply the so-called denormalization process on the resulting generated image so that the color returns to its original state. We implement this mechanism inside the denormalize()
function below. The mean
(#(1)
) and std
(#(2)
) parameters are the same values used in the normalization process in Codeblock 4. Using these two values, we apply the operation at line (#(3)
), which rescales the pixel values from being centered around 0 back to their original range.
# Codeblock 15
def denormalize(gen_image):
mean = torch.tensor([0.485, 0.456, 0.406], device=device).view(1, 3, 1, 1) #(1)
std = torch.tensor([0.229, 0.224, 0.225], device=device).view(1, 3, 1, 1) #(2)
gen_image = gen_image*std + mean #(3)
return gen_image
The NST Optimization
As we already got all the necessary components prepared, now that we will compile them into a single function which I name optimize()
. See the Codeblock 16a, 16b and 16c below for the details.
# Codeblock 16a
def optimize():
#(1)
content_losses = []
style_losses = []
total_losses = []
for epoch in tqdm(range(EPOCHS)):
content_features = modified_vgg(content_image)[0] #(2)
style_features = modified_vgg(style_image)[1] #(3)
gen_features = modified_vgg(gen_image)
gen_features_content, gen_features_style = gen_features #(4)
This function initially works by allocating 3 empty lists, each for keeping track of the content
, style
and total loss
(#(1)
). In each epoch, we pass the content, style and the generated image through the modified VGG network we created earlier. Remember that for the content image, we only extract the content features (#(2)
), while for the style image, we take its style features only (#(3)
). This is basically the reason that I use the indexer of [0]
and [1]
for the two features, respectively. As for the generated image, we need both its content and style features, so we store them separately in gen_features_content
and gen_features_style
(#(4)
).
Previously I mentioned that our three input images are processed simultaneously. However, in the above code I feed them one by one instead. Don’t worry about such a difference in the implementation because it’s only the matter of technical stuff. I actually do this just for the sake of simplicity so you can better understand the entire NST optimization algorithm.
# Codeblock 16b
content_loss = 0 #(1)
style_loss = 0 #(2)
for content_feature, gen_feature_content in zip(content_features, gen_features_content):
content_loss += MSE(content_feature, gen_feature_content) #(3)
for style_feature, gen_feature_style in zip(style_features, gen_features_style):
style_gram = compute_gram_matrix(style_feature) #(4)
gen_gram = compute_gram_matrix(gen_feature_style) #(5)
style_loss += MSE(style_gram, gen_gram) #(6)
total_loss = ALPHA*content_loss + BETA*style_loss #(7)
Still inside the same loop, we set the content and style loss to 0 as shown at line #(1)
and #(2)
in Codeblock 16b. Afterwards, we iterate through all the content features of the content image and the generated image to calculate the MSE (#(3)
). Again, I want to remind you that this loop will only iterate once. We create the similar loop for the style features, where in this case we compute the Gram matrix of each style feature from both the style image (#(4)
) and the generated image (#(5)
) before computing the MSE (#(6)
) and accumulating it in the style_loss
. After content_loss
and style_loss
are obtained, we then give them weightings with the ALPHA
and BETA
coefficients which we previously set to 1 and 1000.
The optimize()
function hasn’t finished yet. We will continue it with the Codeblock 16c below. In fact, the following code only implements the standard procedure for training PyTorch models. Here, we use the zero_grad()
method to clear the gradients tracked by the optimizer (#(1)
) before computing the new ones for the current epoch (#(2)
). Then, we update the trainable parameters based on the gradient value using the step()
method (#(3)
), where in our case these trainable parameters refer to the pixel intensities in the generated image.
# Codeblock 16c
optimizer.zero_grad() #(1)
total_loss.backward() #(2)
optimizer.step() #(3)
#(4)
content_losses.append(content_loss.item())
style_losses.append(style_loss.item())
total_losses.append(total_loss.item())
#(5)
if epoch % 200 == 0:
gen_denormalized = denormalize(gen_image)
save_image(gen_denormalized, f'gen_image{epoch}.png')
return content_losses, style_losses, total_losses
Afterwards, we append all loss values we obtained in the current epoch to the lists we initialized earlier (#(4)
). This step is not mandatory, but I do it anyway since I want to display how our loss values change as we iterate through the optimization process. Finally, we denormalize and save the generated image every 200 epochs so that we can choose the result we prefer the most (#(5)
).
As the optimization function is completed, we will now run it using the code below. Here I store the loss values in the content_losses
, style_losses
and total_losses
lists. Sit back and relax while the GPU blends the content and style images. In my case, I am using Kaggle Notebook with Nvidia P100 GPU enabled, and it takes around 15 minutes to complete the 20,001 optimization steps.
# Codeblock 17
losses = optimize()
content_losses, style_losses, total_losses = losses
Finally, after the process is done, we successfully got the Victoria Library picture redrawn with the style of Van Gogh’s Starry Night painting. You can see in the following figure that the effect from the style image becomes more apparent in later epochs.
![Figure 15. The NST optimization result at epoch 0, 5000, 10000, 15000 and 20000, respectively from left to right [2].](https://towardsdatascience.com/wp-content/uploads/2024/12/1JHzrDcfwOm4OlqgQ7oxDJA.png)
Talking about the training progress in Figure 16, the vertical axis of the three plots represents the loss value, whereas the horizontal axis denotes the epoch. – And well, you might notice something unusual here. – When we train a Deep Learning model, we typically have our loss decreases as the training progresses. And this is indeed the case for the style and total loss. However, what makes things strange is that the content loss having an increasing loss value instead.
Such a phenomenon occurs because our generated image was initialized with the clone of the content image, which means that our initial content loss is 0. As the training progresses, the artistic style from the style image is gradually infused to the generated image, causing the style loss to decrease yet in return makes the content loss to increase. This absolutely makes sense because the generated image slowly evolves from the content image. Theoretically speaking, if we initialize the generated image with random noise, we can expect high value for both the initial content and style loss before eventually decreasing in the subsequent epochs.
# Codeblock 18
plt.title('content_losses')
plt.plot(content_losses)
plt.show()
plt.title('style_losses')
plt.plot(style_losses)
plt.show()
plt.title('total_losses')
plt.plot(total_losses)
plt.show()

That’s pretty much everything I can explain you about the theory and the implementation of NST. Feel free to comment if you have any thoughts about this article. Thanks for reading, and have a nice day!
_P.S. You can find the code used in this article in my GitHub repo as well. Here’s the link to it._
References
[1] Leon A. Gatys, Alexander S. Ecker, Matthias Bethge. A Neural Algorithm of Artistic Style. Arxiv. https://arxiv.org/pdf/1508.06576 [Accessed October 6, 2024].
[2] Image created originally by author.
[3] Van Gogh – Starry Night – Google Art Project. Wikimedia Commons. https://commons.wikimedia.org/wiki/File:VanGogh-_StarryNight-_Google_Art_Project.jpg [Accessed October 7, 2024].
[4] Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. Arxiv. https://arxiv.org/pdf/1409.1556 [Accessed October 11, 2024].