Modeling and Generating Time-Series Data using TimeGAN

Generating time-series data using a library with a high-level implementation of TimeGAN

Archit Yadav
Towards Data Science

--

Photo by Agê Barros on Unsplash

1. Background

In a previous article, the idea of generating artificial or synthetic data was explored, given a limited amount of dataset as a starter. The data taken at that time was tabular, which is like a regular dataset which we usually encounter. In this article however, we will look at time-series data and explore a way to generate synthetic time-series data.

2. Timeseries data — a brief overview

So how does a time-series data differ from a regular tabular data? A time-series dataset has one extra dimension — time. Think of it as a 3D dataset. Say, we have a dataset with 5 features and 5 instances of input.

Image by Author

Then a time-series data is basically an extension of this table in the 3rd dimension, where each new table is just another dataset at a new timestep.

Image by Author

The main difference is that time-series data has a lot more instances of data points compared to tabular ones.

3. Case study: Energy Dataset

If we look at the energy dataset (available here and originally taken from here), it actually looks like a regular tabular dataset only, with every row implying a new timestep and having its corresponding datapoints in the form of features. Each of the entries is recorded after a duration of 10 minutes as per the data coloumn.

But we saw in the previous section how it's ‘expected’ to look like a 3D tabular dataset, right? This is where we can make use of a clever way of sampling datapoints in order to create the 3rd dimension.

We take a window of size 24 and run it along the rows of the dataset, shifting by one position at a time and hence obtaining a certain number of 2D matrices, each having a length of 24 and with all our coloumn features.

Depiction of sampling done under the hood
Image by Author

In the energy dataset, there were 19736 rows. By shift-sampling every 24 rows, we get 19712 entries, each having 24 rows and 28 features. We can then of course randomly mix them in order to make them Independent and Identically Distributed (IID). So essentially we got a dataset of dimensions (19712, (24, 28)), where each of the 19712 instances has 24 rows (aka timesteps) and 28 features. This implementation can be found here.

4. Generating time-series data using TimeGAN

TimeGAN (Time-series Generative Adversarial Network) is an implementation for synthetic time-series data. It’s based on a paper by the same authors. YData’s ydata-synthetic TimeGAN implementation essentially makes it available in an easy-to-use library which can be installed as a Python Package Index (PyPi). In this article, we will use version 0.3.0, the latest at the time of writing.

pip install ydata-synthetic==0.3.0

More details about this are on ydata-synthetic repository. In this section, we will look at generating a time-series dataset by using the energy dataset as the input source.

We first read the energy dataset and then apply some pre-processing in the form of data transformation. This pre-processing essentially scales the data in the range [0, 1] and applies the data transformation we saw in the previous section.

Now generating the actual synthetic data from this time-series data (energy_data) is the simplest part. We essentially train the TimeGAN model on our energy_data and then use that trained model to generate more.

Where the parameters to be fed to TimeGAN constructor have to be defined appropriately according to our requirements. We have n_seq defined as 28 (features), seq_len defined as 24 (timesteps). The rest of the parameters are defined as follows:

Now that we have our generated synth_data, let’s see how it compares visually in comparison to our original data. We can make a plot for each of the 28 features and see their variation with respect to timestep.

Image by Author

Being only mildly interesting, these plots may not be useful for comparison purposes. It’s a no-brainer that synthetic data would definitely be different than the original (Real) data, else there wouldn’t be any point in doing so. And given that the dataset has so many features, it’s also difficult to visualize and interpret them together in an intuitive manner. What would be more interesting and helpful is the visualization (and comparison) of this generated data in a dimension that is more comprehendible and intuitive to us.

5. Evaluation and Visualization

We can make use of the following two well know visualization techniques:

  1. PCA — Principal Component Analysis
  2. t-SNE — t-Distributed Stochastic Neighbor Embedding

The essential idea behind these techniques is to apply dimensionality reduction in order to visualize those datasets which have a large number of dimensions, i.e., lots of features. Both PCA and t-SNE are able to achieve these, with the main difference between them being that PCA tries to preserve the global structure of the data (because it looks at ways in which a dataset’s variance is retained, globally, across the entire dataset), whereas t-SNE tries to preserve the local structure (by ensuring that a point’s neighbours which are close together in original data are also close together in the reduced dimensional space). An excellent answer detailing the difference can be found here.

For our use case, we will use the PCA and TSNE objects from sklearn.

Now that our data to plot is prepared, we can use matplotlib to plot both original and synthetic transformations. pca_real and pca_synth together give us the PCA results, and tsne_results contain both original and synthetic (due to concatenation) t-SNE transformations.

Image by Author

What do these graphs tell us? They show us that this is how the whole dataset would possibly look if it was transformed to a dataset with fewer features (two axes corresponding to the two features). The PCA plot may not be sufficient to draw a proper conclusion, but the t-SNE plot seems to tell us that the original data (black) and synthetic data (red) seem to follow similar distribution. Additionally, a dataset-specific observation is that there are 7 groups (clusters) in the whole dataset, the datapoints of which are (visibly) similar to each other (hence clustered together).

6. Conclusion

We looked briefly at time-series data and how it differs from tabular data. The use of TimeGAN architecture through ydata-synthetic library was done in order to generate more time-series data. The complete implementation in a notebook can be found here.

--

--