Neural Style Transfer, Evolution

Introduction
The seminal work of Gatys et al. [R1] showed that deep neural networks (DNNs) encode not only the content but also the style information of an image. Moreover, the image style and content are somewhat separable: it is possible to change the style of an image while preserving its content. Their approach is flexible enough to combine content and style of arbitrary images. However, it relies on an optimization process that is prohibitively slow.
Fast approximations [R2, R3] with feed-forward neural networks have been proposed to speed up Neural Style Transfer. Unfortunately, the speed improvement comes at a cost: the network is either restricted to a single style, or _ the network is tied to a finite set of styles_.
Huang and Belongie [R4] resolve this fundamental flexibility-speed dilemma. It has been known that the convolutional feature statistics of a CNN can capture the style of an image. While Gatys et al. [R1] use the second-order statistics as their optimization objective, Li et al. [R5] showed that matching many other statistics, including the channel-wise mean and variance, are also effective for style transfer. Hence, we can argue that instance normalization performs a form of style normalization by normalizing the feature statistics, namely the mean and variance.
Why not Batch Normalization?
Since BN normalizes the feature statistics of a batch of samples instead of a single sample, it can be intuitively understood as normalizing a batch of samples to be centred around a single style, although different target styles are desired.
On the other hand, IN can normalize the style of each individual sample to the target style: different affine parameters can normalize the feature statistics to different values, thereby normalizing the output image to different styles.
Adaptive Instance Normalization
AdaIN receives a content input x and a style input y, and simply aligns the channel-wise mean and variance of x to match those of y. Unlike BN, IN, or CIN_(Conditional Instance Normalization), AdaIN has no learnable affine parameters. Instead, it adaptively_ computes the affine parameters from the style input.

Intuitively, let us consider a feature channel that detects brushstrokes of a certain style. A style image with this kind of strokes will produce a high average activation for this feature. Moreover, the subtle style information for this particular brushstroke would be captured by the variance. Since, AdaIN only scales and shifts the activations, spatial information of the content image is preserved.
Style Transfer Network
The AdaIN Style Transfer network T (Fig 2) takes a content image c and an arbitrary style image s as inputs, and synthesizes an output image T(c, s) that recombines the content and style of the respective input images. The network adopts a simple encoder-decoder architecture, in which the encoder f is fixed to the first few layers of a pre-trained VGG-19. After encoding the content and style images in the feature space, both the feature maps are fed to an AdaIN layer that aligns the mean and variance of the content feature maps to those of the style feature maps, producing the target feature maps t. A randomly initialized decoder g is trained to invert t back to the image space, generating the stylized image T(c, s).
![Fig 2. AdaIN based Style Transfer Network. Image taken from "[R4] Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization". Self-annotations in blue.](https://towardsdatascience.com/wp-content/uploads/2020/09/1va7XsIY2V-h3Iv0TeoIhHA.png)
Normalization Layers in the Decoder?
Apart from using nearest up-sampling to reduce checker-board effects, and using reflection padding in both f and g to avoid border artifacts, one key architectural choice is to not __ use normalization layers in the decoder. Since IN normalizes each sample to a single style while BN normalizes a batch of samples to be centred around a single style, both are undesirable when we want the decoder to generate images in vastly different styles.
Loss Functions
The style transfer network T is trained using a weighted combination of the content loss function Lc and the style loss function Ls.

The content loss is the Euclidean distance between the target features t and the features of the output image f(g(t)). The AdaIN output t is used as the content target, instead of the commonly used feature responses of the content image, since it aligns with the goal of inverting the AdaIN output t.
Since the AdaIN layer only transfers the mean and standard deviation of the style features, the style loss only matches these statistics of feature activations of the style image s and the output image g(t). Style loss is averaged over multiple layers (i=1 to L) of the VGG-19.
Conclusion
In essence, the AdaIN Style Transfer Network described above provides the flexibility of combining arbitrary content and style images in real-time.
References
- Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image Style Transfer Using Convolutional Neural Networks. In CVPR, 2016.
- Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In ECCV, 2016.
- Vincent Dumoulin, Jonathon Shlens, Manjunath Kudlur. A Learned Representation For Artistic Style. In ICLR, 2017.
- Xun Huang, Serge Belongie. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In ICCV, 2017.
- Yanghao Li, Naiyan Wang, Jiaying Liu, Xiaodi Hou. Demystifying Neural Style Transfer. In IJCAI, 2017.