Go Big by Being Small

A method to extrapolate a small time-series dataset.

Published in

Towards Data Science

9 min readAug 25, 2020

When I was in my final year as a university student, I was preparing and collecting sufficient datasets for my research paper as my final year project. I was just casually scrolling through the internet and voila! It didn’t take me long to gather all of the datasets I needed. But when I thought everything went smooth sailing with my boat, a Kraken appeared — of course not the sea monster but it required tons of brainstorming sessions. The dataset that I’ve been collecting is too small to work with, I’m talking 20 to 30 periodic observations, yikes. You may ask, why didn’t you realize that it’s insufficient just by looking at the number of observations? Well, to be frank, I did feel a little bit worried when I saw the “handful” amount of observations. But it hit me when I realized it’s not enough to be implemented in the model I was researching.

After quite a few hours, a book, and a glass of coffee, I’ve finally found inspiration on how to work with these small datasets, extrapolate it, appropriately. At first, I genuinely thought my idea is going to cause quite an error in the model, but thankfully, it went well and I finished my paper. So in this article, I wanted to share the methods that I used working with a univariate dataset and a new method that I’ve developed for a multivariate dataset.

Let’s start easy, Univariate Dataset with Margin of Error (MOE)

A dataset with provided MOE is so useful in this extrapolation method because the MOE is one of the key factors on how accurate the extrapolated values will be. In this case, I’ll be using the US Annual Mean Income, gathered from the United States Census Bureau, Table S1901. With the MOE on board, we can easily get the minimum and maximum values of mean income for each year. By knowing these values, we extrapolate it according to its annual values by generating random variates from the Uniform(0,1) Distribution, to represent the standardized values of the mean income. Then, we convert the standardized values back to the actual values using the minimum and the maximum values like so

Say that I wanted to extrapolate the dataset because I want to recreate monthly mean income, I’ll be needing 12 random uniform variates to be converted each year. Here’s a side by side plot comparison of the real and the extrapolated datasets.

Extrapolated (Left), Real (Right), Image by Author

As we can see, the increasing trend is still there, it’s just noisier since now it has monthly instead of annual values. And if we check the difference between the statistical properties

Percentage Difference of the Statistical Properties, Image by Author

it doesn’t differ much :)

Univariate Dataset without MOE

Now, this condition was the problem I mentioned before. I was confused about how I’m supposed to get info on the periodical variance of the data that I was working on. Luckily, the solution only requires two main features: A time-series model that fits the distribution of the dataset and some randomizing standardized values.

In this example, I’m going to use the monthly sunspots dataset which you can acquire here. And yes, it’s already a huge dataset so no need for extrapolation, am I right? But let’s say you’re only given the last 3 years of observations and was told to generate daily values for the last 3 years based on that.

Monthly Sunspots from 1981–1984, Image by Author

Now let’s pick the model. From the beginning, we know that this is a monthly dataset. So why don’t we pick something simple? We’re going to use a linear seasonal regression model to be fitted to the dataset. Here’s the result:

That’s quite a great fit. Now we’re going to use the estimate and the standard error from this result to extrapolate the data. In other words, if we look back to the previous example, we can use the estimates and standard errors as the “mean income” and MOE respectively. Since we’re going to generate daily values, the values will be generated according to the number of days in the month along with the estimate and standard error — I’m using a confidence level of 95% from this point on. Here are the extrapolated daily values:

One thing that immediately feels off is the lack of a decreasing trend in the original dataset. I’m doing it on purpose to show how important it is to pick an appropriate model according to the dataset we’re working on. By this result, we can conclude that the linear seasonal regression model is not the perfect fit for this dataset. Moreover, by using a regression we immediately assume a stationary condition in the dataset, which causing the extrapolated values to look like a stationary time series.

Multivariate Dataset

Down to the last example, it took me quite a while to think of a way to extrapolate a multivariate dataset. Nevertheless, here’s one of the methods of doing it. In this last example, I’m using New Delhi Climate Training Dataset from Kaggle.

Likewise, let’s investigate the dataset first. Since I was expecting a correlation between the variables, I’ll start with the scatterplots between the variables.

Now my eyes immediately make its way to the pressure section albeit the apparent negative correlation between the temperature and humidity. Something feels off with the plot, and I immediately realize it must be some outliers knowing some values differ much from the rest. I understand that I’m no expert in this climate section of knowledge, so I’m calling our best friend and jack-of-all-trades, Google, to help me to find out the normal values for air pressure, and it sent me here. Turns out, the values should be around 1013.25 millibars. Hence, according to the dataset and the website, pressure values that lie between 990 and 1024 will be considered as normal. Then, the outliers will be replaced according to the distribution of the dataset.

You might be wondering, there must be a twist to this example since there are already a lot of observations. YOU GUESSED IT RIGHT! (really sorry for my corny jokes trying to get your attention back lol)

The twist here is that you’re actually given the monthly average from each variable and you need to convert it back to daily values. Now, based on the last two examples I gave out before, please answer this question

Is it going to work? Is it possible to do so?

Save your answer until the end of this article, and let’s see.

First, as we did earlier, let’s take a look at the scatterplots between the variables.

Well, seems like our dataset is correlated to each other. Here’s what I can see from this plot:

The most definite relation is between temperature and pressure, it’s a negative correlation.
The rest might have quite a moderate correlation and it looks like it might fit into a quadratic model.

With these in mind, I decided to create a linear and quadratic regression model for every possible pair of variables, then compare their R-Squared and Adjusted R-Squared values. Also, I’m going to create a linear seasonal regression model for each variable since it definitely has a seasonal pattern based on the plots below.

Before doing the regressions, it’s best for the values to be standardized since the variation of values isn’t similar. Here’s the result of the model fitting:

Let’s focus on the relation between the variables. Excluding the seasonal regression results (row 1–4), the highest R-Squared value is the quadratic model where pressure as the independent variable and temperature as the dependent one. Whereas the other model doesn’t seem to have a great fit albeit the scatterplot showed an indication of correlation. Fortunately, the seasonal model is a great fit for all variables. With these in mind, here’s my plan:

And now, the moment you’ve been waiting for, the comparison of the real versus the extrapolated values (the blue line is the extrapolated one).

Each extrapolated values fit well with the actual values, and not so bad with the temperature. But, our million-dollar question hasn’t been answered yet. To convert the values back to daily values, we’re going to need a little bit of math here.

in which n is the number of samples. Then, we can acquire the variance of the monthly averages, which is

in which Yj^s is the standardized version of the monthly averages. Finally, we derive the standard error of the daily values with this set of equations:

Aaaandd without further ado, let’s see how the daily extrapolated values turned out.

My first reaction was “What kind of noisy time-series is this? This is nuts!”. I don’t think we need to explain anything to answer the question, it’s a definite no, at least using this method. The extrapolated values become too noisy and only effective for the short-term since we use extrapolated data to extrapolate — #extrapo-ception. Moreover, the monthly average values don’t carry the “jumps” as the daily values do, causing the extrapolated daily values unable to capture it.

Conclusion

This extrapolation method is only able to create values according to the dataset used in the calculations and the generated values will follow the characteristics of it.
The stationarity assumption might be affecting the inability of detecting “jump”(s). Therefore, a more appropriate model might be a solution to generate more fitting extrapolated values.
Even if the extrapolated values are perfect, it doesn’t mean it would be a perfect representation of the population. Nevertheless, it’s still better to get an estimated depiction of the population might be.

What’s next?

I might be not the expert in this, but I did learn to work creatively with a time-series dataset. Even so, I would like to hear your suggestions that may improve this method even more. So, below is my GitHub repo of this time-series extrapolation method. I will definitely post more data science or actuarial science projects in the near future, so stay tuned!

nicholasatyahadi/Time-Series-Dataset-Extrapolation

Sometimes we receive data samples with a small amount of data. This small amount might be affected by the periodicity…

github.com