Filling Missing Wind Speed Data Using Various Regression Technique

Another method besides using the mean, median, and modus of the data

Muhammad Ryan
Towards Data Science

--

Photo by Irina on Unsplash

Missing data is very common when you do collect data. But it will be a problem when you in the data analysis phase. A common and the best practice at least for me is just ignoring the missing data. Why? Because no matter how good your method to fill the missing pieces, there is always an error introduced by the method. And then the filler data can’t be a missing piece of pattern in the data. This is because the method to predict the missing data based on the available pattern in the data.

The most common approach to fill the missing data is using the mean of the data if your data have a normal distribution. If not, use the median or mode. Most of the time, people use the mean because the data always assumed to have a normal distribution (I think this why it’s called normal because normally, the data will have this distribution). Why is that? Because here we try to guess what is the missing data and in the normal distribution we know that most of the time the value of a single datum lies around the mean of the data so the best guess is the mean of the data. But that doesn’t apply with the data with non-normal distribution so the best guess is the value of the “center” in the data or the most frequent value. That why we choose median or mode.

Another approach is using a regression technique. We try to predict the value of missing data based on its surrounding data. This is what we will discuss in this post. In this post we will use Ordinary Linear Regression (OLR), Ridge Linear Regression (RLR), Lasso Linear Regression (LLR), Bayesian Linear Regression (BLR), Random Forest (RF), and Support Vector Regression (SVR), Gradient Boosting (GB), and Ada Boost (AB). As a benchmark technique, we will use the mean and the median of the data.

The data we will use is wind speed data in the METAR code at Lombok meteorological station. You can query any METAR data of the meteorological station in Indonesia with http://aviation.bmkg.go.id/latest/metar.php?i=<IATA_code>&y=<year>&m=<month>. So, if I want to get the METAR data of Lombok in January 2020, the link is http://aviation.bmkg.go.id/latest/metar.php?i=WADL&y=2020&m=1

The output of the example link. Image by Author

Our windspeed data still encoded in METAR code so we must decode the METAR first to extract the wind speed data. Here is my code to extract the wind speed data.

In this case, I retrieve 3-month data of Lombok’s wind speed (January, February, and March). Let’s split the data to simulate missing data. We will group our data into a group with each group has sequential data with length 5 and treat data in index 3 as missing data (label) and the rest as a parameter.

Our data is ready. Now, as the first step for analyzing the data, let’s look at the distribution of our data here. Here the code

And it’s output is

Histogram and distribution of the data. Image by Author

As you can see, our data have a non-normal distribution. To test our theory about who is the best in what condition, let’s check the mean, median, and mode of this data and check the Mean Absolute Error (MAE) of it against our “missing data” (label).

Image by Author

It’s 0 for mode, 3.53 for mean, and 2 for mode. Let’s see, which one has the least MAE against “the missing data”.

Image by Author

As shown in the above image, MAE for the mean is 3.23, the median is 3.13 and the modus is 3.52. So the best filler missing data is the median. The MAE of the median will be used as a benchmark for the upcoming regression model.

Here the code to train and test the various regression model against the wind speed missing data. In this code, we use the last 600th data as the test data and the rest is train data.

and the result is

Image by Author

All the models surpass our benchmark MAE. As you can see, generally all linear models (OLR, RLR, LLR, BLR) have better results than non-linear models (RF, SVR, GB, AB). And the best model is the OLR. Wind speed fluctuations in the METAR data we extract here seem to be linear and predictable.

Ok, that’s it the real example of how to use regression models to fill the missing data in your dataset. Summary, the method you choose to make filler data was based on the characteristic of your data.

--

--