Notes from Industry

Benchmarking time series datasets with style

How to cross-check/verify large time-series datasets?

Published in

Towards Data Science

4 min readOct 20, 2021

I work on energy system modelling, oil and gas well modelling, and well integrity management systems. On many occasions, it is of fundamental importance to benchmark datasets or spot differences between time series data from different sources or coming from multiple modelling results.

A time series is a series of data points with time.

Benchmarking large datasets and quickly spotting differences and in a style gives you the edge to save time for driving insights.

For example, in a global energy dataset on over 100 countries and 60 sectors, it would be tedious to plot the entire datasets to benchmark two different data sources or even after you run the model for forecasting different scenarios.

In the BP Statistical Review of World Energy different energy scenarios are quantified showing the views of BP on the future of energy in different energy settings (Bussiness as Usual Scenario and Net-zero Scenario).

The data is presented for main consuming countries and regions over the years 2025,2030,2035,2040,2045 and 2050.

How can we spot differences between time series forecasts (between 2 energy scenarios) without plotting them? How can we skip plotting 60 time-series data between 2 energy scenarios (120 line plots) to spot where major changes are?

Where did we get the data from?

The data we rely on in building the case is the BP statistical review on world energy 2020 where we showed in previous work how to extra it in clean form using python.

The conventional method to spot differences between time series data

The goal is to differentiate between the forecasted demand of different fuels (Oil, Gas, Renewables, Nuclear, Coal) between the Business as Usual Scenario and the Net Zero Scenario for several countries/regions (United States, EU, China, India, ….).

The conventional way is to plot a line plot between the two scenarios

Line plot to spot the difference between two time series data. Image by Author

The Problem with the conventional approach?

It would require plotting 60 time-series data between 2 energy scenarios (120 line plots) to spot where major changes are. If you have a larger model or several features it would get more difficult to use plotting as a solution.

How to spot differences faster in a style?

The ability to see the big picture of where differences are in your time series data between different modelling scenarios or to check where discrepancies are between different data updates or during a benchmarking exercise on a large scale would save more time to do high-quality analysis and deliver insights.

The steps followed to benchmark the 2 energy scenarios results (time series data) are:

Extract the data.
Restructure the data in long format and have two columns, one for each time series (energy scenario) you want to benchmark.
Calculate the dimensionless difference between the two time series datasets.

For example, the dimensionless difference in year 2035 for coal consumption in Africa is {(5–2)/5} = 0.6

The structured clean dataset in long format. Image by Author

4. Sum the dimensionless differences between the two time series for each sector in each county.

Sum of difference between the two energy scenario datasets (time series). Image by Author

5. Restructure the dataframe such that the two most important variables are index and column names. For example, the country is index and sector is columns.

6. Use a heatmap to visualize the results.

Heatmap to spot differences between (benchmark) two energy scenarios (time series data). Image by Author

The heatmap shows the light colours as areas where the two datasets match. Meaning, the two energy scenarios showed similar results. An example would be the Hydro consumption in the United States. However, Renewable consumption in the Middle East varies greatly between the Business as Usual Scenario and the Net Zero Scenario.

More example showing how the resulting heatmap is helping to spot differences between different time series data is shown below.

Summary of benchmarking time series data. Image by Author

The whole work is available on Github.