Making Sense of Big Data

Synthesis of Tabular Data in Finance using Generative Algorithms

Gaurav Shekhar

Published in

Towards Data Science

9 min readNov 27, 2020

1. Introduction

Within the finance domain, a large amount of complex data gets captured across digital channels and is stored primarily in structured formats (excel, csv, relational database etc.). Due to regulatory norms like GDPR, FERPA, HIPAA etc., a lot of restrictions have been put in place, to mitigate the privacy concerns of the end customer.

However, at the same time, the organizations need to build a competitive advantage by analyzing and acting on insights generated from the stored data. There are many situations when this data is not accessible when required for analysis purpose.

Below are some familiar conversations we often come across and highlight challenges in securing data access:

“We want to move data from on-premises to cloud environment but do not get the necessary approvals from Risk and Security teams.”

“This space has so much potential for application of AI and ML solutions but we do not have sufficient data with us to build complex models.”

Synthetic Data Generation can help us solve the above and many other challenges effectively. It is fast emerging as the go to approach when firms want to use their data for analysis purpose without concerns of privacy or violation of regulatory norms.

Below diagram captures few of the potential applications of Synthetic Data Generation.

2. Focus Area

Synthetic data is generated programmatically by mimicking real-world phenomena and learning the properties of the real data. It is not a new concept, and there are many research papers that have studied the subject and proposed innovative approaches to generate synthetic data.

Recently however there is renewed interest in this space, largely due to the research work done in the space of Generative Algorithms based on Deep learning techniques like Variational Autoencoders and Generative Adversarial Networks. A majority of this research work is carried out on image data.

However as we highlighted above, a major chunk of data present in financial firms is stored in the form of structured data like excel, relational databases etc. In this article, we will focus on exploring the work done in generation of synthetic data for financial tabular data using Generative Algorithms.

3. Key Challenges

Before moving ahead to dig deeper into the Synthetic Data Generation, let’s first understand some of the key challenges in synthetic data generation for tabular data.

Below diagram has identified the key challenges across the four dimensions of Data Complexity, Preserving Correlations, Enforcing Constraints and Maintaining Referential Integrity. Any solution should be judged by its effectiveness in meeting these challenges.

4. Solution Approach for Synthetic Data Generation

The different solution approaches can be broadly classified into i) Statistical and ii) Generative Algorithms based. Statistical techniques used to dominate this field for a long period, however recent advances in Deep Learning field has led to use of Generative algorithms dominating the field of synthetic data generation through superior quality of data getting generated.

I. Statistical Approach

The Statistical approach solves the Synthetic Data Generation problem by fitting a joint multivariate probability distribution on the real data and then drawing samples from that distribution as required. While this works for simple datasets which could be easily explained by probability distributions, most of the real datasets are very complex and the approach is limited by availability of complex distribution which could be used. Popular distributions that are used here include Gaussian Copula, Mixture Models etc.

II. Bayesian Networks

A Bayesian network is a graphical model of the joint probability distribution for a set of variables. A Bayesian network could be used to create multiple synthetic data sets. Informative prior information is generally needed in order to assign appropriate weights to each network if synthetic data are to have both good inferential properties. While this limits it’s application in some domains, Bayesian Networks continue to remain a popular choice in many domains like Healthcare.

III. Variational Autoencoders

Variational Autoencoders (VaE) belong to the family of Generative Models and their architecture is based on two Neural Networks i) Encoder and ii) Decoder.

The VaE model first feeds real objects to the Encoder model which learns the best way to represent the realistic object in some vector form belonging to a latent distribution. The vector generated by encoder is then fed to Decoder which reconstructs the image that was originally fed to encoder. After training is complete, the encoder gets discarded and new objects are produced from the decoder. The Variational part of the Autoencoder helps to bring randomness to this process.

VaE’s are being widely used for generation of Synthetic data along with GAN’s which we discuss next. A key distinction between VAE’s and GAN is that the encoder in VaE sees the real data where as in GAN the real data is not visible to the models.

IV. Generative Adversarial Networks(GAN)

GAN’s also belong to the family of Generative algorithms and have been very successful in solving problems of generating synthetic data. The GAN architecture comprises of two main components, a generator G and a discriminator D, both of which are deep neural networks. The generator is a generative model which outputs data from a noise input, and the discriminator is a classification model that classifies input data as real or fake. In very simple terms the goal of Generator is to fool the discriminator while the discriminator tries to get better in identifying the real from fake image.

5. Evolution of GAN architecture over time

GAN architecture has attracted the interest of research community and this has resulted in many architectural improvements over the last few years. Below we capture a very high level overview of the evolution of GAN architecture with primary focus on improvements that are relevant to solving the problems for tabular data synthesis.

The Vanilla GAN architecture was based on the original paper published by Ian Goodfellow. W-GAN architecture introduced a different loss function that was able to solve some of the major challenges like Mode Collapse and Vanishing Gradient problem which plagued the original GAN architecture. Conditional GAN paper enabled GAN to produce focused output from a single class.

A variant of GAN, CTGAN was first proposed in NeurIPS 2019 conference and has established itself as a popular choice for generation of Tabular data. The paper forms the core of our understanding and is shared in References section.

6. Synthetic Tabular Data Generation using CTGAN

CTGAN is a popular approach that builds on the GAN architecture to model tabular data and sample rows conditionally from the generated model to create Synthetic Data. At the moment this can be considered cutting edge and has been able to surpass performance of other approaches for the given problem.

Some of the key innovations introduced in CTGAN paper are detailed below:

I: Representing Multimodal Distribution in Continuous Variables

A mode in a distribution of data is an area with a high concentration of data. A multi-modal distribution can be thought of as a distribution having many peaks in its probability distribution curve. A Vanilla GAN model is not able to account for all the modes of continuous variables and applying standard normalization techniques leads to Mode Collapse problems, where a Generator generate samples around a single node which greatly inhibits its learning.

How CTGAN handles multi modal distribution for continuous variable:

CTGAN architecture introduces a Mode Specific Normalization technique to solve the above challenge. Below is a high level overview of the approach

Mode Specific Normalization, Source: (https://arxiv.org/abs/1907.00503)

Each continuous value is finally represented using a One Hot Vector indicating the mode and a normalized scalar indicating the value within model.

II. How to handle imbalanced categorical variables in tabular data

Discrete variables in real world tabular data are highly imbalanced due to which few of the major categories occur in a majority of the instances. If we try to sample directly the minor classes will be ignored due to their small proportion and the statistical distribution won’t show much impact either. This will lead to missing out on the important insights present in the minor category values.

How CTGAN handles imbalanced data in Categorical Variables

CTGAN introduces Training by Sampling and Conditional Generators approach to resample the training data so that all the categories in discrete variables get fair chance to be included in the sample from which GAN learns.

Below is a high level approach followed at a high level

**Training by Sampling and Conditional Generators,** Source: https://arxiv.org/abs/1907.00503

III. Preserve Linear and Non Linear Correlation between attributes

Complex linear and non-linear correlation present between variables needs to be captured and preserved in the synthetic data.

How CTGAN captures Linear and Non Linear Correlation between variables

Fully Connected Hidden Networks are present in both the Generators and Critic models to capture correlation between attributes. Apart from the hidden layers, the architecture uses mix activation functions like Leaky Relu to capture Non Linear interactions between variables and Dropout techniques to prevent any overfit.

7. Framework to Evaluate Synthetic Data Generation

There is a lack of well-established framework to compare the results generated by synthetic data models. From a technical perspective, the focus so far has largely been on measuring the distance between real and synthetic distribution through various fit metrics.

Another approach for evaluating synthetic data is to split the data into train and test sets and use only the train dataset for generation of synthetic data. Once the synthetic data is generated, standard ML models are developed on original training data as well as the synthetic data. These models are then evaluated on test data and their performance compared.

8. Practical Implementation

Researchers who had introduced the original paper of CTGAN in NeurIPS 2019 have released the Synthetic Data Vault, a set of open-source tools meant to expand data access without compromising privacy. The code has integration with Python and is available at link https://sdv.dev/CTGAN/.

9. Future Challenges

In this article we have seen how the GAN architecture has evolved over time to effectively address some of the challenges in generating synthetic models for tabular data. These included representing model distributions for continuous variables, handling imbalanced data etc.

However there are many requirements which still need to get addressed to help the solution receive more widespread adoption. Below we have identified some of the future challenges in this space:

i. Generating Synthetic Data for Multiple Tables: Most of the study on Tabular Synthetic data has been conducted on single tables. But real world data is normalized and split into multiple tables having primary key foreign key relationships. While we have seen some attempts on understanding the metadata for a group of tables, the results are far from satisfactory and there is a need for more study in this field.

ii. Identify and Preserve Unknown constraints in data: Real world data has many constraints which are intrinsically built in it. These constraints can be as simple as enforcing positive values for age columns, ensuring that total portfolio sales equals sum of sales in different products etc. At present these constraints can be built in using user defined constraints. However future solution will need to get smarter to capture thee unknown constraints in data.

iii. Need for Robust Evaluation Frameworks: There is a clear lack of well-established framework to compare the success of synthetic data work. There is a need for evaluation metrics on the functional side which can evaluate how effectively these business constraints and hidden relationship are preserved in Synthetic data.

iv. Increase End User Acceptance: Despite all the innovations, end user acceptance, especially from the business side, remains a concern. There has to be better awareness of this emerging field, more POC’s conducted and success stories need to be talked in detail to increase end user acceptance.

10. References

1. Modeling Tabular Data using Conditional GAN https://arxiv.org/pdf/1907.00503.pdf

2. Generating synthetic data in finance: opportunities, challenges and pitfalls https://www.jpmorgan.com/content/dam/jpm/cib/complex/content/technology/ai-research-publications/pdf-8.pdf

3. The Synthetic data vault

https://github.com/sdv-dev/CTGAN

Disclaimer: The opinions shared in this article are my own and do not necessarily reflect those of Fidelity International or any of its affiliated parties.