The world’s leading publication for data science, AI, and ML professionals.

cCorrGAN: Conditional Correlation GAN for Learning Empirical Conditional Distributions in the…

Generative Modelling in Finance: Motivation and Challenges

Sharing financial data in financial services organisations is severely limited due to privacy, regulatory and business requirements, both internally (between functions of the organisation, or using data before access is granted) and externally (with research organisations) [6]. Methods that generate realistic synthetic data are therefore essential, especially where there is a lack of historical data, and where data anonymisation is paramount. For example, there is a dearth of stock market time series data for a variety of market conditions, essentially because there is only one ‘history’. This makes it difficult to perform stress-testing of portfolios subject to different scenarios.

Generating realistic datasets requires defining an appropriate distance between datasets and measuring how close they are. While financial data can be broadly categorised into retail banking data which are typically tabular and market microstructure data which are typically in a time-series, here we will focus closely on generating market microstructure data approximated as empirical financial correlation matrices (CM). Generating multivariate time series data is a much more challenging problem as it involves capturing the dependence structure between univariate time series, and other features such as autocorrelation and distributional properties [1]. Empirical CMs are seminal in the study of multivariate financial time series and are important in the risk management of asset allocation. CMs have many applications in finance, such as testing the robustness of trading strategies, stress-testing portfolios, or objectively comparing empirical methods. Synthetic CMs should also exhibit the empirical properties of real data, known as stylised facts (SF). So far, SFs are only evaluated qualitatively and are yet to be condensed into a single metric.

Mathematical Intuition of FCMs

We shall gain an intuition of financial correlation matrices by turning to Markowitz’s theory of optimal portfolios. This theory suggests that the optimal portfolio can be found by minimising its risk for a given total return by optimising the weight of each asset (i.e. the amount of capital invested in that asset). Formally [4]:

Image by Author
Image by Author

Using the total variance to describe the total risk of a portfolio, [4]

Image by Author
Image by Author

The optimal balance of assets in the portfolio is the one that minimises the variance for a fixed portfolio return. The least risky portfolio has a large weight on the eigenvectors of the covariance matrix that have the smallest eigenvalues. This can be interpreted as investing the most capital on the set of N uncorrelated portfolios (the eigenvectors of the covariance matrix) that have the smallest variances (associated eigenvalues) [4]. Random Matrix Theory (RMT) is a branch of probabilistic analysis applied to matrices of random variables, and many of its results are useful when applied to financial CMs. In [4], the authors use RMT to clean the empirical CMs by partially removing the noise. The correlations are less likely to contain useful information, or more likely to be noisy, if the ratio of T to N is small i.e. if the finite time series length is small compared to the number of assets in the portfolio. This method of obtaining correlation matrices requires computing them and then implementing a ‘cleaning’ procedure. Unclean CMs underestimate the true risk, and while the cleaned CMs account for risk more accurately in optimised portfolios, the true risk is always greater than the estimated risk. Instead, sampling correlation matrices can be a more effective way of generating realistic CMs. Generative Adversarial Networks (GANs) have been successfully and extensively applied to computer vision tasks (image and video generation) and to language and audio processing (text, speech, and music generation). However, very little has been explored in the way of generative Deep Learning in finance, until very recently. CorrGAN [2] is a recent attempt to address this gap, which is a GAN that is trained to learn the distribution of correlation matrices, such that realistic financial CMs can be sampled.

What RMT, and other similar bodies of work regarding financial correlation matrices, do provide, however, are SFs that convey properties of financial CMs that allow us to better understand them. These are [2, 3]:

  1. The distribution of pairwise correlations is significantly shited to the positive.
  2. Eigenvalues follow the Marchenko-Pastur distribution, except for a very large first eigenvalue (corresponding to the ‘market’), and approximately 5% of other large eigenvalues (corresponding to natural clusters within the market such as industries).
  3. Perron-Frobenius property: The first eigenvector has positive entries.
  4. The correlations follow a hierarchical structure, which is true of financial correlations as shown by Mantegna [5].
  5. The scale-free property of the CM’s corresponding Minimum Spanning Tree.

These stylised facts allow for a qualitative evaluation of the similarity between synthetic CMs and empirical CMs. This list is not exhaustive, and examining the latent space learned by a GAN, for example, could reveal more characteristics of financial correlations that were previously unknown.

CorrGAN

CorrGAN is the first model that demonstrates the results of sampling financial CMs using GANs and verifies how realistic they are. It is possible to generate univariate time series using conditional GANs, but these ignore the dependence between multiple assets, so would be unsuitable for portfolio management where multivariate correlations need to be considered. In [2], the mathematical setup of the elliptope (the set of correlation matrices) is described:

Image by Author
Image by Author
Image by Author
Image by Author

The N by N case is both statistically and computationally much more difficult, and requires High Performance Computing (HPC) infrastructure to become a staple of the toolbox in quantitative Finance. It is harder to qualitatively assess the quality of the generated matrices, and harder to train because of data inefficiency. The data inefficiency is due to the fact that a correlation matrix for N assets has N! possible matrices describing the same correlation structure because the order of the stocks is irrelevant. A GAN requires diversity in the data used to train it to ensure full coverage of the distribution of data. Enforcing permutation invariance enables more variety per training update by avoiding the redundancy induced by matrix equivalence. In the CorrGAN paper, this is engineered by using a hierarchical clustering algorithm, to induce a permutation on the CM:

Image by Author
Image by Author

The architecture follows the Deep Convolutional GAN (DCGAN), chosen because CNNs exhibit attributes that also work well for financial CMs according to the SFs. These are local shift invariance, locality, and hierarchical compositionality (ability to learn hierarchical structure of the correlations).

Figure 1: Sampling 3x3 correlations from a GAN. The figure is from [2].
Figure 1: Sampling 3×3 correlations from a GAN. The figure is from [2].

In Figure 1, the blue represents 10,000 points sampled uniformly from the three-dimensional elliptope using the onion method. The orange points are true empirical matrices computed using daily returns over 252 business days in one year, by randomly sampling three stocks without replacement from the S&P 500. The green points are the synthetic points generated by sampling from the trained CorrGAN. Already, it is possible to see that the synthetic distribution closely matches the empirical one, visually verifying one of the stylised facts that financial CMs concentrate around the high and positive range of values. By plotting the first eigenvector’s entries, the distribution of correlations, visual representations of randomly selected CMs, a log-log plot of the distribution of the degrees of nodes in the MSTs, and the distribution of eigenvalues, the results from [2] show how CorrGAN generates realistic samples according to the SFs. However, they are not strictly correlation matrices as the leading diagonal elements aren’t exactly equal to 1. The matrices are also not exactly positive semi-definite, resulting in small negative eigenvalues. Post-processing projection methods are required to find the nearest true correlation matrix with respect to the Frobenius norm.

CorrGAN’s limitations are that the empirical and synthetic distributions do not match perfectly, and more importantly that there is instability between each trained model. This is a well-known defect of GANs, namely that they are not good at capturing the different modes of the distribution. This means that synthetic matrices may not span the whole subspace of realistic financial CMs, but only a restricted subspace, due to mode collapse during training.

Conditional CorrGAN: cCorrGAN

Earlier this year, the same authors of [2] published cCorrGAN, and improvement over CorrGAN which implements a conditional GAN given three broad regimes of market stability: Stressed, Steady, Rallying. Using this newly labelled training set, the performance can be evaluated qualitatively by cross-examining with the SFs, as well as PCA projections to ensure well-defined clusters and visual matching of empirical and synthetic distributions. Quantitatively, one can use the Wasserstein metric to compare the empirical and synthetic distributions. Finally, the authors of [3] show how cCorrGAN can be applied to test the robustness of investment strategies and portfolio allocation methods, and draw insights on which of these methods benefit more from certain market conditions (i.e. the market regime).

For future research, generating correlation matrices in a fully end-to-end manner is a desirable goal and a common practice of deep learning, which would prevent the need to perform post-training projections. Equally, while cCorrGAN generates financial CMs, a challenge still remains for researchers to be able to generate realistic multivariate financial time series.

Notes

  1. All inline mathematics are written by the author, and adapted from the referenced articles.

References:

[1] Markowitz’ Theory of Optimal Portfolios

E.J. Elton and M.J. Gruber, Modern Portfolio Theory and Investment Analysis (1995), J.Wiley and Sons, New York, 1995; H.Markowitz, Portfolio Selection: Efficient Diversification of Investments (1959), J.Wiley and Sons, New York, 1959.

[2] CorrGAN

G. Marti, CorrGAN: sampling realistic financial correlation matrices using generative adversarial networks (2020), ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8459–8463.

[3] cCorrGAN

V. Goubet G. Marti and F. Nielsen, cCorrGan: Conditional Correlation Gan for Learning Empirical Conditional Distributions in the Elliptope (2021), Geometric Science of Information, pages 613–620, Springer International Publishing.

[4] Random Matrix Theory and Financial Correlations

L. Laloux, P. Cizeau, M. Potters, and J-P. Bouchaud, Random matrix theory and financial correlations (2000), International Journal of Theoretical and Applied Finance, vol. 3, no. 03, pp. 391–397.

[5] Hierarchical Structure of Financial Correlations

R. N. Mantegna, Hierarchical structure in financial markets (1999), The European Physical Journal B-Condensed Matter and Complex Systems, vol. 11, no. 1, pp. 193–197.

[6] Generating Synthetic Data in Finance

S. Assefa, Generating synthetic data in finance: opportunities, challenges and pitfalls. Challenges and Pitfalls (2020), InfoSciRN: Data Protection (Topic), 2020.


Related Articles