Weather forecasting with Machine Learning, using Python

Simple, yet powerful application of Machine Learning for weather forecasting

Published in

Towards Data Science

5 min readApr 18, 2021

Physicists define climate as a “complex system”. While there are a lot of interpretations about it, in this specific case we can consider “complex” to be “unsolvable in analytical ways”.

This may seems discouraging, but it actually paves the way to a wide range of numerical algorithms that aim to solve the climate challenges. With the computational developments of the last years, Machine Learning algorithms are certainly part of them.

The challenge I want to discuss is based on forecasting the average temperature using traditional machine learning algorithms: Auto Regressive Integrated Moving Average models (ARIMA).

While this post doesn’t want to be detailed in terms of the theoretical background, it does want to be a step-by-step guide on how to use these models in Python and apply it to real world data.

So let’s start by describing the Python framework.

0. The Libraries

The libraries that have been used are the most famous ones for data analysis, plot and mathematical operations (pandas, matplotlib, numpy). Then there are some of them for advanced data visualization (like folium) and some of them are specific libraries for ARIMA models (like statsmodels). Here is the code for the import:

1. The Dataset/Dataset Exploration

The Dataset is open source and can be found here.

If you want to know the cities in your dataset, select them by using this line of pandas:

If we want to plot these cities in a world map, we need to slightly change the latitude and longitude. In order to do that, let’s use these few lines of code:

And display the cities:

2. Preprocessing, Advanced Visualization, Stationarity

I’ve chosen to isolate Chicago and consider the data of that city to be my dataset. There are no special reasons to do that… I just like Chicago :) . Of course you can use your own city and follow the next steps with your own dataset.

The target is the AverageTemperature column, that is the Average Temperature for that specific month. We have data from 1743 to 2013.

With this line we identify the NaN values and display them with a pie chart:

As they are not a consistent part of the dataset, I’ve decided to fill the missing values with the previous ones. I did the same for the Average Temperature Uncertainty.

The ‘dt’ column is the one that identifies the year and the month. For the next operations, it is handier to convert this column into a datetime object and to explicitly identify the year and the month in two different columns. We can do that by using the following lines:

Using this dataset it is possible to obtain a scatter plot like this one:

But it is not easy to read, so we should do something better.

Now let’s describe three super-basic functions I created:

get_timeseries(start_year,end_year) extract the portion of the dataset between the two years
plot_timeseries(start_year,end_year) plots the timeseries extracted in get_timeseries in a readable way
plot_from_data(data, time, display_options) plots the data (AverageTemperature) wrt the time (dt) in a readable way. The display options permit to display the ticks, change the colors, set the label …

When it is done, we can make plots like this one:

When we use ARIMA models, we should be considering stationary time series. In order to check if the timeseries we are considering is stationary, we can check the correlation and autocorrelation plots:

It is suggesting us that the timeseries is not stationary. Nonetheless, if we perform the AD Fuller Test on the entire dataset it tells us that the dataset is stationary.

But it is true just because we are looking at the entire dataset. In fact, if we analyze a single decade, it is clear that the dataset is absolutely not stationary for the decade period of time.

In order to take account of this non-stationarity, a differentiation term will be considered in the ARIMA models.

3. Machine Learning Algorithms

Let’s consider the 1992–2013 decade and plot it:

Performing the train/test split:

Plotting the split:

The Machine Learning algorithms are the ARIMA models. These are based on an optimization procedure that adopts the Maximum Likelihood function.

The zero-differentiated ARIMA models are considered and evaluated using the AIC.

While the first-differentiated models are considered by using these lines:

The total summary is highlighted with this function and it shows that the (2,1,5) model and the (2,1,6) model are the best ones.

As it is possible to see the statistical summary values are almost identical

And the same thing happens between the statistical plots :

Model (2, 1, 5)

Model (2, 1, 6)

Nonetheless, it is preferable to use low index models both to avoid overfitting and reduce the computational stress on your computer. For this reason, the (2, 1, 5) model has been considered.

4. Forecasting

Let’s plot the results of the forecasting operation:

And now let’s consider the specific predicted zone with the correspondent Uncertainty (the one given by the dataset) and the confidence interval (given by the algorithm):

Finally, let’s consider a more readable version of the plot:

5. Conclusions

These methods are extremely easy to adopt as they don’t require any specific computational power like Deep Learning methods (RNN, CNN … ).

Nonetheless, predictions perfectly fit in the error range designed by the dataset itself. It is important to consider that we only have examined monthly average values while it may be interesting to consider daily values too and have daily predictions.

If you liked the article and you want to know more about Machine Learning, or you just want to ask me something you can:

A. Follow me on Linkedin, where I publish all my stories
B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have.
C. Become a referred member, so you won’t have any “maximum number of stories for the month” and you can read whatever I (and thousands of other Machine Learning and Data Science top writer) write about the newest technology available.

Ciao! :)