Synthetic data to the rescue!

Don’t have enough data? Don’t worry! Synthetic data is here!

Published in

Towards Data Science

4 min readJan 16, 2021

Data shortage has become an important problem to be addressed in industries such as health, sports, manufacturing, and law due to the lack of data, privacy, and confidentiality. I’d love to connect on Linkedin.

Problem Statement: To generate synthetic data

Today, great advancements are being made in multiple sectors of society, leveraging machine learning techniques for carrying out tasks such as prediction, classification, segmentation, etc. Though it has a high impact in some sectors, a few sectors, namely healthcare, and sports are the ones where machine learning techniques have not been used at the highest levels due to scarcity of data. Also, due to a lack of data, machine learning models do not yield the most optimum results, and this would not be suitable in cases where the margin of error is meager. Thus, here the generation of synthetic data becomes a must.

Synthetic data produced would act as an imitation of real-data, mimicking its properties and increasing the data volume for feeding to the machine learning models. Also, synthetic data plays an important role in producing situations that could be forthcoming to examine the event and ensure necessary actions and precautions can be taken beforehand.

There are two types of synthetic data — fully synthetic, partially synthetic.

If there is no original data in a dataset, then it is a totally synthetic data set. If there is any original data in a newly generated data set, it is partly a synthetic data set. Only sensitive information is regenerated using synthetic data generation techniques in a partly synthetic data collection.

Thus, realizing that there is a great need to have sufficient data to train a model with higher accuracy and efficiency, we have proposed a methodology to generate synthetic data with statistical qualities similar to real datasets using an iterative regression analysis set of random numbers. FIFA video games dataset from data.world has been used to support this methodology and show it’s use-case.

Initially, a deep understanding of the data was done. An observation of the correlation between the different attributes was made. The five most highly correlated attributes were selected that could be mimicked to obtain the artificial data. The attributes were- ball control, dribbling, special, short passing, long passing.

Next, we ensured the data was preprocessed by checking for any missing values, duplicates, or outliers.

Let’s dive into getting an overview of the methodology, architecture, and results.

First, a model was used to generate one M-dimensional vector of synthetic data using one M-dimensional vector of random integers. For the following steps, the same process was iterated, only ensuring that the newly generated data after the first step was also included as the input data for step two. This process was carried out until all the new columns were generated.

To analyze the results obtained, we proposed the following metrics to measure the accuracy of a regression model

Mean Absolute Error (MAE)

2. Residual sum of squares (RSS)

3. R2-Score

Lastly, I would like to conclude by stating that this is a brief overview of my recent paper, which I presented at 4th ICECA, 2020; it successfully got published in IEEE Explore.

The aim was to generate synthetic data using a simple yet effective technique called Iterative Regression Analysis.

Thank you for taking out the time to read and get a brief overview of the work I have done; if you would be interested in knowing more about the project or would want to discuss it, please feel free to reach out to me on Linkedin.

If you would be interested in reading the paper, please check it out here.

I worked on the paper with my colleagues Jil Kothari and Sanskar Shah.

References for the paper:

References to the article :

https://www.simerse.com/synthetic-data

Synthetic data to the rescue!

Don’t have enough data? Don’t worry! Synthetic data is here!

Problem Statement: To generate synthetic data

Written by Darshan Gandhi