Looking for a stable equilibrium — Photo by Tyler Milligan on Unsplash

Generating synthetic financial time series with WGANs

A first experiment with Pytorch code

8 min readJun 19, 2020

Introduction

Overfitting is one of the problems researchers encounter when they try to apply machine learning techniques to time series. This problem occurs because we train our models using the only time series path we know of: the realized history. After all, we have only seen one state of the world for our entire life: boring isn’t it? =)

In particular, not every possible market regime or event can be well represented in our data. In fact, very unlikely events, such as periods of unusually high volatility, are normally underrepresented and this prevents models from being trained well enough. So, when these trained algorithms face new scenarios, they may fail; and this happens especially if we keep them in production for sufficient time without any precaution.

Now, what if we could teach a model to generate new data for the same asset? What if we had a tool that produced alternative realistic time series that share the same statistical properties of the original ones? With such a tool, we could augment our training set with data about unlikely events or rare market regimes; therefore researchers would be able to build models that generalize better and practitioners would perform superior simulations and trading backtests. In this regard, the authors of a recently published paper in the Journal of Financial Data Science show how training deep models on synthetic data mitigates overfitting and improves performance.

It turns out that Generative Adversarial Networks (GANs) could do the trick. GANs are a machine learning framework where two neural nets, a generator (G) and a discriminator (D), play a game against each other where the former tries to trick the latter generating fake data that resemble the real thing. When I say “game” I am referring to game theory here!

No player should win, however: GAN researchers are always looking for a Nash equilibrium between G and D where hopefully — GANs can be hard to train — the generator has learnt how to produce fake but realistic samples and the discriminator’s best choice is to guess randomly.

Objective of this article

My purpose here is not to dive into GANs theory (there are great papers and resources online) but rather to briefly touch the basics and share the results of the experiments I am carrying on with my teammate and friend Andrea Politano.

Though our final objective is using one or more trained generators to produce many time series at once, we choose to start simple and proceed gradually. In this respect, we first check if GANs can learn series whose data generating process (DGP) we fully understand. Of course, only a few — if any — understand the DGP of financial asset returns; hence we start generating fake sine waves.

Our model

We first code a Pytorch dataset to produce different sine functions. Pytorch datasets are convenient utilities which make data loading easier and improve code readability. Check them out here.

Pytorch dataset to generate sines

We then opt for a Wasserstein GAN (WGAN) as this particular class of models have solid theoretical foundations and significantly improve training stability; in addition, the loss correlates with the generator’s convergence and sample quality — this is extremely useful because researchers do not need to constantly check the generated samples to understand if the model is improving or not. Finally, WGANs are more robust to mode collapse and architectural changes than standard GANs.

How do WGANs differ from GANs?

The loss function in a standard GAN quantifies the distance between the training and the generated data distributions. In fact, GANs are based on the idea that as the training advances, the model’s parameters are gradually updated and the distribution learnt by G converges to the real data distribution. This convergence must be as smooth as possible and it is essential to remember that the way this sequence of distributions converges depends on how we compute the distance between each pair of distributions.

Now, WGANs differ from standard GANs in how they quantify this distance. Regular GANs do so by Jensen–Shannon divergence (JS) whereas WGANs use the Wasserstein distance, which is characterized by nicer properties. In particular, the Wasserstein metric is continuous, has well behaved gradients everywhere and therefore allows a smoother convergence in distribution even when the supports of the true and generated distributions lie in non-overlapping low dimensional manifolds. Check out this good article to explore more details.

Image taken directly from the WGANs original paper. It shows how the discriminator of a standard GAN saturates and results in vanishing gradients whereas the WGAN critic provides very clean gradients on all parts of the space.

Lipschitz constraint

The plain Wasserstein Distance is rather intractable; hence the need to apply a smart trick — Kantorovich-Rubinstein duality — to overcome the obstacle and obtain the final form of our problem. As the theory goes, the function f_w in the new form of our Wasserstein metric must be K-Lipschitz continuous. f_w is learnt by our critic and belongs to a sequence of parametrized functions; w represents the set of parameters i.e., the weights.

Final form of the objective function — the Wasserstein metric. The first expectation is computed with respect to the real distribution whereas the second expectaction is computed with respect to the noise distribution. z is the latent tensor and g_theta(z) represents the fake data produced by G. This objective function shows that the critic does not directly attribute a probability. Instead, it is trained to learn a K-Lipschitz continuous function to help compute the the final form of our Wasserstein distance.

We say that a differentiable function is K-Lipschitz continuous if and only if it has gradients with norm at most K everywhere. K is called Lipschitz constant.

Lipschitz continuity is a promising technique for improving GANs training. Unfortunately, its implementation remains challenging. In fact, this is an active area of research and there are several ways to enforce the constraint. The original WGAN paper proposes weight clipping but we adopt a “gradient penalty” (GP) approach as weight clipping can cause capacity problems and requires additional parameters to define the space where the weights lie. On the other hand, the GP approach enables stabler training with almost no hyper-parameter tuning.

A first architecture

Our first architecture comes from this thesis which, among other things, presents a WGAN-GP architecture to produce univariate synthetic financial time series — it’s always a good idea not to start from scratch. The proposed architecture is a mix of linear and convolutional layers in both G and D and it works out of the box. Unfortunately, with this setting the training does not look very stable and D is implemented with batch normalization (BN) despite the original WGAN-GP paper clearly prescribes “No critic batch normalization”.

So we get rid of BN and go for spectral normalization (SN). In simple terms, SN ensures D is K-Lipschitz continuous. To do so, it progressively acts on each layer of your critic to bound its Lipschitz constant. Refer to this great article and this paper for further details.

Though with SN one could theoretically get rid of the GP term in the loss — SN and GP should be considered alternative ways to enforce Lipschitz continuity — our tests did not support this proposition. Nevertheless, SN improves training stability and makes convergence faster. We, therefore, adopt it both in G and D.

Finally, the original WGAN-GP paper suggests to use Adam optimizers for both G and D but we empirically found that RMSprop better suited our needs.

Pytorch code for Critic and Generator

To give it a try you will also need a piece of code that computes losses, does the backward pass, updates model weights, saves logs, training samples etc… You will find it below. If you need more details about how to use this code please refer to this repo.

Trainer Code

Results

The good news is that our model learns to generate realistic sine samples; the bad news is that sines are not asset returns! =) So, what’s the next step to increase our trust in this model?

Different realistic sine waves generated with a trained model

Why don’t we get rid of simple sines and feed the GAN with samples coming from an ARMA process whose parameters we know?!

We choose a simple ARMA(1, 1) process with p=0.7 and q=0.2, generate real samples with a new Pytorch dataset and train the model.

Pytorch ARMA dataset

We now generate a hundred of fake samples, estimate p and q and have a look at the results below. p and q are the only parameters of our DGP.

Synthetic ARMA(1, 1) samples generated with a trained model

Distributions of the parameter estimates done on synthetic ARMA samples. As it can be seen, the model learns the DGP quite well. In fact, the modes of these distributions are near the true parameters of the DGP, p and q, which are respectively 0.7 and 0.2.

These results strengthen our trust and encourage further research.

Bonus Section

How should losses look?

During our first experiments, we kept asking ourselves what to expect from our losses. Of course, everything depends on the training data, the optimization algorithm and the learning rate you choose but we empirically found that successful trainings are characterized by losses that, though not stable in the beginning, then converge gradually towards lower values. All other things being equal, lowering the learning rate may stabilize training.

How to check for mode collapse?

To check for mode collapse we use a different latent tensor each time we generate fake samples during training. With such a procedure we can check what happens when sampling different portions of the noise space. If you sample different random tensors and G keeps producing the same series then you are experiencing mode collapse. =(

Do GANs provide an advantage over other generating mechanisms?

This is an important question I would like to answer with further experiments: GANs complexity must be justified by significantly better performance to be tolerated.

GP Limits

We have to mention that, according to the authors of this paper, the standard GP approach may not be the best implementation of Lipschitz regularization. In addition, spectral normalization may unnecessarily constrain the critic and impair its ability to provide sound gradients to G. The authors propose an alternative method to use when the standard approaches fail. We will experiment with their suggestions soon.

Why should we train D more than G?

A well trained D is vital in a WGAN setting because the critic estimates the Wasserstein distance between the real and fake distributions. An optimal critic will provide good estimates of our distance metric and this, in turn, will cause the gradients to be healthy!

Bibliography

[2019] Towards Efficient and Unbiased Implementation of Lipschitz Continuity in GANs — Zhou, Shen et al
[2019] Enriching Financial Datasets with Generative Adversarial Networks, de Meer Pardo
[2018] Spectral Normalization for Generative Adversarial Networks — Miyato, Kataoka et al
[2017] Improved Training of Wasserstein GANs — Gulrajani, Ahmed et al
[2017] Wasserstein GAN — Arjovsky, Chintala et al