Rainfall Time Series Analysis and Forecasting

Banten, Indonesia 2019–2020 Rainfall forecasting using “R” Language

Published in

Towards Data Science

12 min readMay 17, 2020

A forecast is calculation or estimation of future events, especially for financial trends or coming weather. Until this year, forecasting was very helpful as a foundation to create any action or policy before facing any events. As an example, in the tropics region which several countries only had two seasons in a year (dry season and rainy season), many countries especially country which relies so much on agricultural commodities will need to forecast rainfall in term to decide the best time to start planting their products and maximizing their harvest. Another example is forecast can be used for a company to predict raw material prices movements and arrange the best strategy to maximize profit from it.

In this article, we will try to do Rainfall forecasting in Banten Province located in Indonesia (One of the tropical country which relies on their agriculture commodity), we have 2006–2018 historical rainfall data¹ and will try to forecast using “R” Language.

A simple workflow will be used during this process:

Fig-1. Process Workflow

Import and Selecting Data

This data set contains Banten Province, Indonesia, rainfall historical data from January 2005 until December 2018. For this forecast, I will drop 2005 and start from 2006–2018 as a foundation for our forecast.

#Import Necessary Library
library(lubridate)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(astsa)
library(forecast)
library(readxl)
library(urca)
library(ggfortify)
library(tsutils)

#Import Data
hujan <- read_excel("../input/hujan-update/Hujan_Update.xlsx", 
 sheet = 'Sheet1')
hujan <- hujan %>% gather(key = "Tahun", value = "Curah_Hujan", 
 -Bulan)#Converting To Time Series
hujan_ts <- ts(data = hujan[,3], frequency = 12, start = c(2005,1))
#Selecting Data 
hujan_ts <- window(hujan_ts, start=c(2006,1))hujan_ts

After running those code, we will get this following time series data:

Exploratory Time Series Data Analysis

The first step on exploratory data analysis for any time series data is to visualize the value against the time. We will visualize our rainfall data into time series plot (Line chart, values against time) with this following code:

#Plot Time Series Dataautoplot(hujan_ts) + ylab("Rainfall (mm2)") + xlab("Datetime") + 
  scale_x_date(date_labels = '%b - %Y', breaks = '1 year', minor_breaks = '2 month') +
  theme_bw() + ggtitle("Banten Rainfall 2006 - 2018")

Time series plot visualizes that rainfall has seasonality pattern without any trends occurred; rainfall will reach its higher value at the end of the years until January (Rainy Season) and decreased start from March to August (Dry Season). This pattern will always be repeated from year to year during 2006–2018 periods.

We will decompose our time series data into more detail based on Trend, Seasonality, and Remainder component. Using this decomposition result, we hope to gain more precise insight into rainfall behavior during 2006–2018 periods.

Decomposition will be done using stl() function and will automatically divide the time series into three components (Trend, Seasonality, Remainder).

#Decomposition using stl()
decomp <- stl(hujan_ts[,1], s.window = 'periodic')#Plot decomposition
autoplot(decomp) + theme_bw() + scale_x_date(date_labels = '%b - %Y', breaks = '1 year', minor_breaks = '2 month') +
    ggtitle("Remainder")

The trend cycle and the seasonal plot shows there’s seasonal fluctuation occurred with no specific trend and fairly random remainder/residual.

There’s a calculation to measure trend and seasonality strength:

Ft: Trend Strength

Fs: Seasonal Strength

The strength of the trend and seasonal measured between 0 and 1, while “1” means there’s very strong of trend and seasonal occurred.

Tt <- trendcycle(decomp)
St <- seasonal(decomp)
Rt <- remainder(decomp)#Trend Strength Calculation
Ft <- round(max(0,1 - (var(Rt)/var(Tt + Rt))),1)#Seasonal Strength Calculation
Fs <- round(max(0,1 - (var(Rt)/var(St + Rt))),1)

data.frame('Trend Strength' = Ft , 'Seasonal Strength' = Fs)

Fig-7. Trend/Seasonal Strength Calculation Result

By using the formula for measuring both trend and seasonal strength, we’re proving that our data has a seasonality pattern (Seasonal strength: 0.6) with no trend occurred (Trend Strength: 0.2).

Seasonality Analysis

We know that our data has a seasonality pattern. So, to explore more about our rainfall data seasonality; seasonal plot, seasonal-subseries plot, and seasonal boxplot will provide a much more insightful explanation about our data.

#Seasonal Plot
seasonplot(hujan_ts, year.labels = TRUE, col = 1:13, 
   main =  "Seasonal Plot", ylab= "Rainfall (mm2)")

Fig-8. Rainfall Time Series Seasonal Plot

Seasonal plot indeed shows a seasonal pattern that occurred each year. Still, due to variances on several years during the period, we can’t see the pattern with only using this plot. Further exploration will use Seasonal Boxplot and Subseries plot to gain more in-depth analysis and insight from our data.

#Seasonal Sub-Series Plot
seasplot(hujan_ts, outplot = 3, trend = FALSE, 
      main = "Seasonal Subseries Plot", ylab= "Rainfall (mm2)")#Seasonal Boxplot
seasplot(hujan_ts, outplot = 2, trend = FALSE, 
     main = "Seasonal Box Plot", ylab= "Rainfall (mm2)")

Using seasonal boxplot and sub-series plot, we can more clearly see the data pattern. The horizontal lines indicate rainfall value means grouped by month, with using this information we’ve got the insight that Rainfall will start to decrease from April and reach its lowest point in August and September. Rainfall will begin to climb again after September and reach its peak in January.

With this, we can assign Dry Season on April-September period and Rainy Season on October-March. It can be a beneficial insight for the country which relies on agriculture commodity like Indonesia. Dry and Rainy season prediction can be used to determine the right time to start planting agriculture commodities and maximize its output.

Also, this information can help the government to prepare any policy as a prevention method against a flood that occurred due to heavy rain on the rainy season or against drought on dry season.

2019–2020 Rainfall Forecasting

The first step in forecasting is to choose the right model. To do so, we need to split our time series data set into the train and test set. The train set will be used to train several models, and further, this model should be tested on the test set.

Split Train/Test Set

Train set: We will use all of the data until December-2017 as our training set

Test set: 2018 Period (January-December) will act as our test set

#Create Train Set
hujan_train <- window(hujan_ts, end = c(2017,12))#Create Test Set 
hujan_test <- window(hujan_ts, start = c(2018,1))

Stationary Test

Train set data should be checked about its stationary before starting to build an ARIMA model. A stationary test can be done using Kwiatkowski–Phillips–Schmidt–Shin Test (KPSS) and Dickey-Fuller Test (D-F Test) from URCA package.

#Kwiatkowski–Phillips–Schmidt–Shin Test
summary(ur.kpss(hujan_train)) #Dickey-Fuller Test
summary(ur.df(hujan_train)) ####################### 
# KPSS Unit Root Test # 
####################### 

Test is of type: mu with 4 lags. 

Value of test-statistic is: 0.0531 

Critical value for a significance level of: 
                10pct  5pct 2.5pct  1pct
critical values 0.347 0.463  0.574 0.739############################################### 
# Augmented Dickey-Fuller Test Unit Root Test # 
############################################### 

Test regression none 


Call:
lm(formula = z.diff ~ z.lag.1 - 1 + z.diff.lag)

Residuals:
    Min      1Q  Median      3Q     Max 
-176.15  -43.66    3.80   68.21  355.75 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
z.lag.1    -0.15081    0.05230  -2.883 0.004555 ** 
z.diff.lag -0.28241    0.08111  -3.482 0.000664 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 96.77 on 140 degrees of freedom
Multiple R-squared:  0.1755,	Adjusted R-squared:  0.1637 
F-statistic:  14.9 on 2 and 140 DF,  p-value: 1.36e-06


Value of test-statistic is: -2.8835 

Critical values for test statistics: 
      1pct  5pct 10pct
tau1 -2.58 -1.95 -1.62

Using 95% as confidence level, the null hypothesis (ho) for both of test defined as:

KPSS Test: Data are stationary
D-F Test: Data are non-stationary

So, for KPSS Test we want p-value > 0.5 which we can accept null hypothesis and for D-F Test we want p-value < 0.05 to reject its null hypothesis. Based on the test which been done before, we can comfortably say that our training data is stationary.

Model Assignment

Forecasting will be done using both of ARIMA and ETS model, the comparison between those models also will be evaluated using some parameters against the test set.

ARIMA Model

The first step in building the ARIMA model is to create an autocorrelation plot on stationary time series data. Sometimes to have stationary data, we need to do differencing; for our case, we already have a stationary set. Therefore the number of differences (d, D) on our model can be set as zero.

Our dataset has seasonality, so we need to build ARIMA (p,d,q)(P, D, Q)m, to get (p, P,q, Q) we will see autocorrelation plot (ACF/PACF) and derived those parameters from the plot.

acf2(hujan_train)

ACF Plot is used to get MA parameter (q, Q), there’s a significant spike at lag 2 and the sinusoidal curve indicates annual seasonality (m = 12). PACF Plot is used to get AR parameter (p, P), there’s a significant spike at lag 1 for AR parameter.

This ACF/PACF plot suggests that the appropriate model might be ARIMA(1,0,2)(1,0,2). To make sure about this model, we will set other model based on our suggestion with modifying (AR) and (MA) component by ±1.

Model-1 : ARIMA(1,0,2)(1,0,2)
Model-2 : ARIMA(2,0,2)(2,0,2)
Model-3 : ARIMA(1,0,1)(1,0,1)
Model-4 : ARIMA(2,0,1)(2,0,1)
Model-5 : ARIMA(0,0,2)(0,0,2)
Model-6 : auto.arima()

we will also set auto.arima() as another comparison for our model and expecting to find a better fit for our time series.

fit1 <- Arima(hujan_train, order = c(1,0,2), seasonal = c(1,0,2))
fit2 <- Arima(hujan_train, order = c(2,0,2), seasonal = c(2,0,2))
fit3 <- Arima(hujan_train, order = c(1,0,1), seasonal = c(1,0,1))
fit4 <- Arima(hujan_train, order = c(2,0,1), seasonal = c(2,0,1))
fit5 <- Arima(hujan_train, order = c(0,0,2), seasonal = c(0,0,2))
fit6 <- auto.arima(hujan_train, stepwise = FALSE, 
        approximation = FALSE)

To choose the best fit among all of the ARIMA models for our data, we will compare AICc value between those models. The model with minimum AICc often is the best model for forecasting.

data.frame('Model-1' = fit1$aicc, 'Model-2' = fit2$aicc, 
     'Model-3' = fit3$aicc,
     'Model-4' = fit4$aicc, 
     'Model-5' =  fit5$aicc,'Auto.Arima'= fit6$aicc,
      row.names =   "AICc Value")

AICc value of Model-1 is the lowest among other models, that’s why we will choose this model as our ARIMA model for forecasting. But, we also need to have residuals checked for this model to make sure this model will be appropriate for our time series forecasting.

checkresiduals(fit1)	Ljung-Box test

data:  Residuals from ARIMA(1,0,2)(1,0,2)[12] with non-zero mean
Q* = 10.187, df = 17, p-value = 0.8956

Model df: 7.   Total lags used: 24

Based on the Ljung-Box test and ACF plot of model residuals, we can conclude that this model is appropriate for forecasting since its residuals show white noise behavior and uncorrelated against each other.

ETS Model

We will build ETS model and compares its model with our chosen ARIMA model to see which model is better against our Test Set.

#ETS Model
fit_ets <- ets(hujan_train, damped =TRUE)

Similar to the ARIMA model, we also need to check its residuals behavior to make sure this model will work well for forecasting.

checkresiduals(fit_ets)Ljung-Box test

data:  Residuals from ETS(A,Ad,A)
Q* = 40.433, df = 7, p-value = 1.04e-06

Model df: 17.   Total lags used: 24

After a residual check, ACF Plot shows ETS Model residuals have little correlation between each other on several lag, but most of the residuals are still within the limits and we will stay using this model as a comparison with our chosen ARIMA model.

ARIMA vs ETS Model on Test Set

Better models for our time series data can be checked using the test set. We will use both of ARIMA and ETS models to predict and see their accuracy against the test set (2018, Jan-Dec).

In the first step, we need to plot visualization between ARIMA Model, ETS Model, and our actual 2018 data. But since ggfortify package doesn’t fit nicely with the other packages, we should little modify our code to show beautiful visualization.

note: if you didn’t load ggfortify package, you can directly use : autoplot(actual data) + autolayer(forecast_data) , to do visualization.

#Modifying Data For ggplotmodel_1 <- forecast(fit1, h=12) 
model_ets <- forecast(fit_ets, h=12)

model_1 <- as.data.frame(model_1$mean)
model_ets <- as.data.frame(model_ets$mean)

colnames(model_1) <- "Curah_Hujan"
colnames(model_ets) <- "Curah_Hujan"

hujan_train_df <- as.data.frame(hujan_train)

model_1_plot <- rbind(hujan_train_df,model_1)
model_ets_plot <- rbind(hujan_train_df, model_ets)

model_1_plot <- model_1_plot %>% 
  mutate('Date' = seq(from = as.Date("2006-01-01", '%Y-%m-%d'), to = as.Date("2018-12-31",'%Y-%m-%d'),by = 'month'))

model_ets_plot <- model_ets_plot %>% 
  mutate('Date' = seq(from = as.Date("2006-01-01", '%Y-%m-%d'), to = as.Date("2018-12-31",'%Y-%m-%d'),by = 'month'))

hujan_ts_df <- as.data.frame(hujan_ts)

hujan_ts_df <- hujan_ts_df %>% 
  mutate('Date' = seq(from = as.Date("2006-01-01", '%Y-%m-%d'), to = as.Date("2018-12-31",'%Y-%m-%d'),by = 'month'))

hujan_train_df <- hujan_train_df %>% 
  mutate('Date' = seq(from = as.Date("2006-01-01", '%Y-%m-%d'), to = as.Date("2017-12-31",'%Y-%m-%d'),by = 'month'))

colors <- c("ARIMA Model Forecast 2018" = "blue", "ETS Model Forecast 2018" = "red", "Actual Data" = "black")


#Creating Plotggplot() + geom_line(model_1_plot,
   mapping = aes(x=Date, y=Curah_Hujan, 
   color= "ARIMA Model Forecast 2018"),lty = 2) +
  geom_line(model_ets_plot,
  mapping = aes(x=Date, y=Curah_Hujan, 
  color= "ETS Model Forecast 2018"),lty= 2) +
  geom_line(hujan_ts_df,mapping = aes(x=Date, y=Curah_Hujan, 
  color= "Actual Data"), lty = 1, show.legend = TRUE) +
  ylab("Rainfall (mm2)") + xlab("Datetime") + 
  scale_x_date(date_labels = '%b - %Y', breaks = '1 year', 
  minor_breaks = '2 month') +
  theme_bw() + ggtitle("Banten Rainfall 2006 - 2018") + 
  scale_color_manual(values=colors)

Fig-15. Actual Data vs ARIMA & ETS Forecasting

Even though both ARIMA and ETS models are not exactly fit the same value with actual data, but surely both of them plotting a quite similar movement against it.

In numbers, we can calculate accuracy between those model with actual data and decide which one is most accurate with our data:

#ARIMA Model Accuracy
accuracy(forecast(fit1, h=12), hujan_test)

#ETS Model Accuracy
accuracy(forecast(fit_ets, h=12), hujan_test)

based on the accuracy, ETS Model doing better in both training and test set compared to ARIMA Model.

2019–2020 Rainfall Forecasting

Using the same parameter with the model that created using our train set, we will forecast 2019–2020 rainfall forecasting (h=24).

Forecasting and Plot:

#Create Model
ARIMA_Model <- Arima(hujan_ts, order = c(1,0,2), 
   seasonal = c(1,0,2))
ETS_Model <- ets(hujan_ts, damped = TRUE, model = "AAA")

#ARIMA Model Forecast
autoplot(forecast(ARIMA_Model, h=24)) + theme_bw() + 
  ylab("Rainfall (mm2)") + xlab("Datetime") + 
  scale_x_date(date_labels = '%b - %Y', 
  breaks = '1 year', minor_breaks = '2 month') +
  theme_bw() + ggtitle("Banten Rainfall Forecast 2019-2020 
  ARIMA Model")
#ETS Model Forecast
autoplot(forecast(ETS_Model, h=24)) + theme_bw() + 
  ylab("Rainfall (mm2)") + xlab("Datetime") + 
  scale_x_date(date_labels = '%b - %Y', breaks = '1 year', 
  minor_breaks = '2 month') +
  theme_bw() + ggtitle("Banten Rainfall Forecast 2019-2020 
  ETS Model")

Fig-18. ARIMA Model Forecasting 2019–2020

Forecasting was done using both of the models, and they share similar movement based on the plot with the lowest value of rainfall will occur during August on both of 2019 and 2020.

Our prediction can be useful for a farmer who wants to know which the best month to start planting and also for the government who need to prepare any policy for preventing flood on rainy season & drought on dry season. The most important thing is that this forecasting is based only on the historical trend, the more accurate prediction must be combined using meteorological data and some expertise from climate experts.

[1]banten.bps.go.id.Accessed on May,17th 2020

[2]Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2.Accessed on May,17th 2020.

Rainfall Time Series Analysis and Forecasting

Banten, Indonesia 2019–2020 Rainfall forecasting using “R” Language

Import and Selecting Data

Exploratory Time Series Data Analysis

2019–2020 Rainfall Forecasting

Written by Dery Kurniawan