
Intro
Much of the research effort made by the machine learning community in the past few years is, in my opinion, too narrowly focused. Researchers take known sample datasets, spend years training models on these datasets, and publish their papers with marginal improvements to the state-of-the-Art benchmarks made in every project. For those readers coming from the machine-learning community, think of how often you’ve seen mentions of CIFAR, MNIST, LibriVox, ImageNet, IMDB Reviews, or the Wikipedia corpus; These datasets are used all too often. We don’t see enough research being made on original data.
There is a good reason behind it: when comparing new machine learning models, we need some common data to compare them with the existing models. Benchmark datasets make it easy. Without them, it would be akin to comparing apples with oranges. But joining the race to exceed the state-of-the-art performance on the same old datasets is just boring. I always prefer spending my time on more original work.
A recent theme that is becoming prominent in the machine learning world is called Data-centric AI. As opposed to the approach detailed above in which we strive to create better models on the same dataset, we train existing models with better data to increase their inference or generation capabilities. The analog in teaching humans is, assuming most humans have a similar ability to learn, better examples and better explanations lead to a better understanding of a problem statement and how to solve it.
Data-centric AI fits very well with generative models. When training a model to create new samples in an output domain (e.g. audio, video, images…) the training data has a significant effect on the resulting outputs during inference time. Once researchers publish a new model, we can initialize the same design as a blank slate (or do some transfer learning, for those familiar with the process), and train it with our own data to get our own results.
During my travels around the world, I had the time to take many photos. Looking through them made me think it might be possible to make some drawings that resemble them. Browsing through the technical literature, I found a recent research paper (2019) that might be up to the job. The paper, titled "Image-to-image Translation by Transformation Vector Learning", describes a method to do just that. Using the model presented in the paper, I hoped a data-centric approach could be used to train their model to draw watercolor paintings, based on photos I took on my trip to Japan.
Travel
So, what exactly is Transformation Vector Learning (TraVeL for short)? In 2014 a researcher called Ian Goodfellow pioneered a technique called Generative Adversarial Networks (GANs). Back then machine learning models were making steady improvements on classification tasks but were still very limited when it came to generating content. GANs approach the content generation problem from a new angle: Instead of training a single neural network to generate content, we train two networks together to solve the problem.
The first network, called the "Generator", is trained to create the desired content based on a sample from a well-defined input domain (e.g., recordings of Frank Sinatra, pictures of raccoons, randomly initialized noise vector, etc…). The second network, called the "Discriminator", acts as a classifier on samples from the output domain and tries to determine if the input it was given is a genuine member of the output domain. The two networks train side by side, each improving the other. A better discriminator can better distinguish between real samples in the output domain and artificial samples made by the generator. This in turn forces the generator to output content better resembling true samples from the output domain to successfully fool the discriminator.
To distill the knowledge learned by the discriminator and use it to improve the generator, we back-propagate from the discriminator loss function and update the weight tensors present within the generator. This results in a feedback loop in which both networks compete: Lower discriminator losses result in higher generator losses and vice versa. We pit them against each other so that they may reach an equilibrium, naming them "adversarial networks" as per their rivaling nature.

Generators can be trained to create samples in an output space given a sample from an input space. However, they can be trained to make realistic samples even given random noise as input. To perform a style transfer (as in transforming photographs to paintings) we want to preserve some of the content from the input sample within the output sample. Naïve attempts using pixel-wise difference loss functions (Mean squared error) have resulted in some initial gains but fail to generalize when the input and output domains are more distinct from each other.
Let’s ask ourselves why this happens: notice that minimizing the pixel-wise difference between images is the same as minimizing the Euclidean distance between pixels, index-wise. Formalizing the pixel-wise loss, let x be a sample in the input space x ∈ ℝMxNx3. We want the generator to learn a function G yielding samples in an output space of the same dimensions: G(x) ∈ ℝMxNx3. Viewing each pixel as a vector p ∈ ℝ³, we iterate over the same indices in the input and output images. For each index [m,n] | 1≤m≤M, 1≤n≤N we compare the pixels at that index, across the two images. Let p₁ = x[m,n] and let p₂ = G(x)[m,n]. The squared difference between the pixels is L(p₁, p₂) =(p₁-p₂)² = (p₁-p₂) ⊗ (p₁-p₂), where ⊗ is the element-wise product between two vectors and L is a metric function L:ℝ³xℝ³⟶ℝ. The pixel-wise loss is just the mean of all the squared differences.

Mean Square Error (MSE) is minimized when the source image and the generated image are identical. This leads to a few problems: An output sample with a perfect MSE score will be easily classified as artificial by the discriminator, simply because it does not belong in the output domain. MSE is also very sensitive to translations, rotations, skews, and transformations; even having a sample with low MSE loss, a simple horizontal flip would drastically increase its MSE value, simply because the metric is calculated as a pixel-wise score between images. Lastly, MSE averages over the pixels of the input and the output sample to calculate the loss. Averaging techniques generally fail to make high-quality samples. In a way, this also makes sense: we do not want to optimize a network to converge to a sample averaged from the training set, but rather to create many unique samples based on the variations present in the samples in the training set.
TraVeL tackles the problem of image style transfer by training the generator to learn a function G that minimizes a different loss function. Instead of minimizing the distance between pixels in the image space RMxNx3, we would like to examine the higher-level features preserved in the output space and minimize the difference from the same features in the input space. However, we do not know how to track high-level features in the image space, let alone once they are transformed by the generator function G. How do we tackle this problem?
Looking at this from the researchers’ perspective, they had to experiment and try something out. What they tried was to preserve a suitable invariant: an invariant is a property of objects that remains unchanged, even when a function is applied to them. Formally, let V and W be spaces of some kind, and let f: V⟶W be a function mapping from V to W. A metric L is an invariant of f if for every x ∈ V, f(x) ∈ W, L(x) = L(f(x)).

They hoped that by teaching the generator network to keep a suitable invariant of the function G, that function will preserve some higher-level features from the input space within the output space. However, maintaining lower-level metrics such as distance or angle as invariants would not help to solve the task, because, just like the case with the MSE metric, a perfect loss value from the generator model can only be achieved when G is the identity transform (that is, x = G(x) ∀x ∈ V ⇔ L(x, G(x)) = 0 ∀x ∈ V). Knowing this, which invariant metric do we train the generator network to keep?
Luckily, they recalled a technique first used in the domain of natural language processing called Latent-Space Encoding. When treating V as a vector space, any subspace V ⊂ V can be considered a latent space of V. Latent spaces are useful to us for two reasons: due to the properties of the mapping functions between V and V, and due to the relationships between vectors within them.
Images of the same size can be seen as vectors within a vector space called Image Space. Let V be a vector space such that V ~ ℝMxNx3. V can be spanned by NxMx3 standard basis vectors, each representing a single subpixel within the image vectors present in the space. This means that each image is a linear combination of the subpixels spanning the Image Space V, but also means they are linearly independent: each basis vector encodes just one subpixel, with no relationship whatsoever between subpixels of the same image.
If we can "compress" this representation we could establish a relationship between the components of each image. We could map the images using a mapping S: V ⟶ V* to any subspace of V, allowing us to span the same images using a smaller basis within that subspace, establishing a one-to-many relationship between each basis vector in the latent space and the images in the original space. This mapping can also be seen as "discarding", because any mapping to a subspace of a smaller dimension, by definition, loses information on the vectors embedded within that subspace.
Methods for finding such transformations have been known for a few decades now, the most notable of which is Principal Component Analysis. The recent research from the field of natural language processing emerged after realizing that encoding to a latent space can be done using gradient-based methods such as Neural Networks. Instead of slaving for years trying to find a direct mapping that minimizes an arbitrary metric, use a non-linear function learned by a neural network, optimized on that same metric.
Because we can learn a mapping that optimizes any metric, we can specifically learn a mapping that optimizes metrics within the latent space. For example, picking two images x₁, x₂ ∈ V, the difference x₁ – x₂ is just the pixel-wise difference between the images. But because the mapping S discards information about the images, the difference S(x₁) – S(x₂) means something entirely different. The smaller the dimension of the latent space V*, the more information we discard, and the more significant the information we retain. Of course, discarding most or all of the information would yield meaningless latent space encodings, and we never know in advance which features will be learned (because neural networks are generally non-explainable, and we cannot plan in advance which minimum it will converge towards). But as a rule of thumb the more information we discard – the higher the level of the features we maintain within the latent space.
Armed with this knowledge, we can train the generator network to optimize G, while adding a third network to optimize S. The invariant we train G to maintain is L(S(x₁) – S(x₂)) = L(S∘G(x₁) – S∘G(x₂)). The vector S(x₁) – S(x₂) is named the Transformation Vector between S(x₁) and S(x₂), hence the term Transformation Vector Learning. The metric L is likewise called the TraVeL metric.

Model
I admit to deviating from the original design when writing the code for this model. The specification in the paper calls for a three-part model, each trying to optimize a different loss function. A generator, trying to draw the paintings; a discriminator, trying to distinguish between real drawings and artificial ones, and an encoder network, trying to maintain the TraVeL metric as an invariant of the generator function.
The deep dive into the model internals might be worth a post of its own one day, but from a bird’s eye perspective it can be described as follows: the discriminator network is a fully convolutional network made from five convolutional layers, each layer having twice as many filters as the one before. The output is then flattened and passed through a fully connected layer, yielding a single logit deciding whether a sample is original or artificial based on a loss function we will describe further on. The encoder network has the same design, except the fully connected layer outputs an embedding of the image within a latent space (a vector, vs. a single logit for the discriminator). The generator uses a well-known design called U-Net, which is essentially a glorified convolutional auto-encoder – The image is passed down through successive convolution and pooling layers, then upscaled by the same factors in reverse order. The design is very good at learning both lower-level and higher-level feature maps for the convolution kernels and uses residual links to prevent "forgetting" features extracted earlier in the model when yielding the output image.
The above is what you might call a "standard design" for a GAN (plus the encoder network as an addition), but if you are new to GANs and want a better understanding there are many great sources available that explain their internal workings in detail. I do however wish to cover two aspects of the design in greater detail: the loss functions and the backpropagation.
Loss
TraVeLGAN has three different loss terms which it attempts to optimize. The first is the "standard loss" used for GAN discriminators: a binary cross-entropy loss (also called minimax loss in this context). Let G: RMxNx3 ⟶RMxNx3 be the function learned by the generator and let D: RMxNx3 ⟶R be the function learned by the discriminator. We denote the minimax loss as L(adv), defined as follows:

The model, as designed and tuned in the research paper, was calibrated for a different dataset. When I started training my own implementation, the model would not converge. To fix this, I decided to change the adversarial loss function to the Wasserstein loss. Wasserstein loss calculates a metric called the Wasserstein distance, which can be seen as measuring the "minimum" distance between two probability distributions. It is defined as:

The formula essentially states that we iterate through all combinations of a vector x∈ℙᵣ and a vector y∈ℙg, and view them as random samples from the product distribution Π (or equivalently, view each as a random sample from its own distribution). We then select a parameter vector γ that, when summing over all combinations of x and y, minimizes the sum of the distances ||x – y||. The Wasserstein metric itself is the sum of all scalar weights comprising the minimal γ.
While a distance metric between probability distributions is much more robust than a distance metric between samples, and GANs optimizing a distance between distributions have historically reached better results than those optimizing a distance between samples, calculating the Wasserstein distance in the form shown above is an intractable computation. Even when training a model with discrete samples selected from the input space, permuting all possible samples of ℙᵣ and ℙg within each mini-batch runs in polynomial time complexity – repeated for each choice of parameterizing γ.
However, a rather obscure theorem called the Kantorovich-Rubinstein duality can be used to calculate the Wasserstein Distance more feasibly. The theorem views the Wasserstein Distance as a solution to an optimization problem. Optimization problems can be stated in two forms: Either minimizing a constraint or maximizing its inverse. Formulating the optimization problem as a minimization problem is called the Prime Form of the problem while formulating it as a maximization problem is called the Dual Form of the problem. The Kantorovich-Rubinstein duality establishes and solves the dual form of the Wasserstein Distance, providing an easier equation to measure it. In dual form, it can be expressed as:

In the context of our problem, ℙᵣ is expressed as D(x) and ℙg is expressed as D∘G(x). Assuming the gradient descent process makes D converge to the optimal function f, we can rewrite the Wasserstein loss as:

The solution comes with a caveat: For the result to hold, the functions generating ℙᵣ and ℙg must satisfy a constraint called Lipschitz Continuity (specifically, 1-Lipschitz continuity). Because images in ℙg are generated by G and classified by D, both G and D must be Lipschitz Continuous. According to the formulation of Lipschitz Continuity, a function f: ℝ⟶ℝ is K-Lipschitz Continuous if there exists some constant K that satisfies:

The researchers that pioneered the use of the Wasserstein distance as the loss metric for GANs struggled to enforce the Lipschitz constraint on their model, attempting methods such as gradient clipping which severely restricted the learning capacity of their models. However, a recent insight regarding normalization methods suggests a solution by using a technique called Spectral Normalization. The spectral norm is a non-standard matrix norm, comparable to an element-wise L₂ norm across all scalar entries of a given matrix. It can be proven that for any matrix W, the spectral norm of W is equal to the maximum singular value of W, denotes σ(w):

Spectral normalization is simply the act of normalizing a matrix using its spectral norm:

According to that insight, Lipschitz continuity can be established by spectrally normalizing all weight matrices in the layers of G and D. Furthermore, a technique called "Power Iteration" can be used to calculate a fast approximation of the spectral norm. The technique exploits the properties of the eigenvectors of W, ensuring convergence to the spectral norm itself (i.e., the approximation approaches the true norm across the training epochs). By performing spectral normalization on the weights of the model after training on each mini-batch, the Lipschitz continuity is enforced, allowing us to compute the Wasserstein metric using its dual form.
The second loss term is called Margin Contrastive Loss in the machine-learning literature and is used with latent-space encoders to ensure they do not degenerate to learning the zero function (i.e., f(x) = 0 ∀x). We choose a constant δ representing the minimum distance between vectors in the latent space, and for every pair of vectors S(x₁), S(x₂) we calculate the following loss term:

Notice that the term S(x₁) – S(x₂) is the transformation vector between S(x₁) and S(x₂). Denoting that vector as v, we can rewrite the above term as:

If the distance is greater than or equal to δ, the loss for S(x1), S(x2) equals 0. Otherwise, the margin contrastive loss will accrue additional value (up to δ for each pair of vectors), penalizing the encoder network in the process. This loss function is also called Siamese loss because it iterates over pairs of vectors from the mini-batch. We denote the Siamese loss as Ls and formulate it as:

The third and final loss term is called the TraVeL loss and is the same invariant metric we discussed in the previous section. Denoted LTraVeL, the original formulation allows for any distance metric to be used. I opted for a combination of cosine distance and Euclidian distance, learning to preserve both angle and length. Like with Siamese loss, we sum the loss terms for each pair of vectors, and is formulated as:

Propagation
In standard GAN designs, the loss function is shared between the generator and the discriminator. The discriminator tries to minimize its loss function when classifying samples from the output domain, while the generator tries to maximize the same loss (or equivalently, minimize the negated loss).
When performing back-propagation in a GAN, the two networks are interconnected with each other, meaning that gradients for the discriminator are back-propagated to the generator. Formally, we can view G as a composition of the individual layer functions. Let GL be the number of layers the generator is composed of and let Gᵢ be the i-layer within G (1≤i≤ GL). G can be rewritten as:

Viewing D in a similar manner, we can rewrite D∘G as:

The back-propagation through the discriminator layers is done by the standard procedure, using the discriminator loss. When propagating through the generator, we must calculate the gradients for the discriminator first, but with respect to the negated loss. Denoting the discriminator loss LD, and the generator loss LG, we can use the chain rule for derivatives to calculate the loss for a given generator layer Gᵢ:

Then, we skip the gradients for the discriminator layers (which we apply w.r.t. LD, not w.r.t LG) and only apply generator layer gradients to the generator weight matrices.

The encoder network undergoes a process similar to that of the discriminator. However, because the generator needs to optimize G according to the loss calculated by the encoder function S, we need to perform a second update to the generator weights at every epoch. Denoting the terms similarly, we get:


We thus apply two gradient updates to the generator for each mini-batch. Visualizing the gradient vectors from a geometric perspective, the gradient descent process takes the average path between the descent towards the minima of the classification gradients and the minima of the invariance gradients. To adhere to the learning rate for the gradient descent process, it is common to divide the gradient by two (i.e., calculate the mean gradient) – which is also the default behavior of the deep-learning coding frameworks (TensorFlow, PyTorch, etc…).

Data
Adversarial networks are trained using supervised learning. That is – we need to provide data for both the source and target domains and labels for calculating the classification loss. Because we are using a GAN architecture, the labels are provided by the model itself (we know which images we give the discriminator, thus we can provide matching labels). The images, however, must be collected beforehand.
Because I wanted to convert my photos into watercolor paintings, I had to collect both landscape photos and paintings alike. For that purpose, I built two web crawlers, which scan selected websites for matching images. The first web crawler scanned a website called Flickr, which has many high-resolution landscape photos available for download. The second web crawler scanned a website called the Ukiyo-e Database, hosting over 200k scans of Japanese woodblock printings made over the course of the last three centuries. After some optimizations to both crawlers, I collected 10k images from each domain and saved them to my cloud storage.
Each image was normalized to the (-1, 1) range. Due to scanning problems with the watercolor prints (many of them had annotations, borders, artifacts, gridlines, and other unwanted elements), I determined the least interference will be present at the center of each image (empirically). Therefore, I cropped each image, keeping 256×256 pixels in the center and discarding the rest. The images were then grouped in batches of 32 images each.
Training
Four different instances of the model were deployed using Google Collaboratory on available GPUs (I was usually allocated Tesla P100 GPUs). The reason behind the redundant deployment was that initial experiments showed the model tended to converge towards random local minima based on the random selection of the initial weights. These minima were stable in the color palette they represented, but most color palettes were completely off, resulting in non-natural colors for the output domain images. After 30 epochs, when the color palettes stabilized, only the two best models were kept and trained for a total of 100 epochs.
After training, a small test set of images collected from my private albums was used to judge the quality of the paintings outputted by the generator. Each image was sliced into squares measuring 256×256 pixels each. The squares were batched together and provided as input for the generator. The output was reassembled, resulting in a new continuous image in the output domain.

During initial testing, many artifacts were present around the borders between the square patches, resulting in a visible grid segmenting between the patches. The effect was corrected using two methods: cropping of the training set images around the center, and by a small modification to the generator. Each image provided as input to the generator was split into four patches, and the output patches were concatenated in the same order before being fed to the discriminator. The discriminator learned to classify these images as artificial samples, thus teaching the generator to somewhat correct the discontinuities between patches.
Results
The two resulting networks can be compared to different artists, each with a slightly different style. The "best" network (as in, most faithful to the color scheme of the input domain) ended up drawing very accurate paintings of the images it was fed. The "second best" was drawing similarly accurate paintings, but with a color palette that puts more emphasis on red and pink tones.

Both networks generated very accurate watercolor drawings when fed with photos from a distribution matching the input domain. That is – landscape images and still nature images (vistas, plants, trees, and that sort). When fed with photos from different distributions (animals, people, buildings, etc…) the results were less accurate, and the network could not cancel out the border artifacts between image patches. This behavior was to be expected, as the network was trained on a dataset of landscape images, similar in theme to the watercolor paintings from the target domain. In a sense, it would be like asking Rembrandt to recreate the art of Jackson Pollock – a painting could be made, but it’s probable nobody is going to like it.
Ultimately, I chose a photo of the Shinkyo bridge. Crossing the river flowing through the town of Nikko in northern Japan, it is considered one of the finest examples of Japanese bridges to have been built. The striking contrast between the red bridge, the blue river, and the green forest around it were beautifully captured by the model; painted in simple strokes that still preserve much of the intricate detail in the original picture.

The photo was printed on Japanese Washi paper with dimensions 188×110 cm, in a way intended to resemble the art of pressing a woodblock against a sheet of traditional Japanese paper. As they say, beauty is in the eye of the beholder. But if I may boast, it is one of the finest pieces of art I have managed to make so far.

Code
The code and the data for training the network are available at the following link: https://drive.google.com/drive/folders/1SMQEbd7HUzoOxb5kxMhtF1lDajwoFWNB