
What is Time-Series Analysis?
One of the key concepts in Data Science is time-series analysis which involves the process of using a statistical model to predict future values of a time series (i.e. financial prices, weather, COVID-19 positive cases/deaths) based on past results. Some components that might be seen in a time-series analysis are:
- Trend : Shows a general direction of time series data over a period of time – trends can be increasing (upward), decreasing (downward), or horizontal (stationary).
- Seasonality : This component exhibits a trend that repeats with respect to timing, magnitude, and direction – such as the increase in ice cream sales during the summer months or increase in subway riders during colder months.
- Cyclical Component : A trend that has no set repetition over a certain time period. A cycle can be a period of ups and downs, mostly seen in business cycles – cycles do not exhibit a seasonality trend.
- Irregular Variation : Fluctuations in time-series data that is erratic, unpredictable and may/may not be random.
When conducting time-series analysis, there are either Univariate Time-Series analysis or Multivariate Time-Series analysis. Univariate is utilized when only one variable is being observed against time, whereas Multivariate is utilized if there are two or more variables being observed against time.
What is ARIMA? Why Use Pmdarima?
ARIMA is an acronym which stands for Auto Regressive Integrated Moving Average and is a way of modeling time-series data for forecasting and is specified by three order parameters (p,d,q):
- AR(p): pattern of growth/decline in the data is accounted for
- I (d): rate of change of the growth/decline is accounted for
- MA (q): noise between time points is accounted for
There are three types of ARIMA models, ARIMA, SARIMA, and SARIMAX which differ depending on seasonality and/or use of exogenous variables.
Pmdarima’s auto_arima function is extremely useful when building an ARIMA model as it helps us identify the most optimal p,d,q parameters and return a fitted ARIMA model.
As a newcomer to data science, when conducting time-series analysis, I took the "long" way before coming across pmdarima’s auto_arima function to build a high performance time-series model. For this article, I will focus on the Univariate Time-Series analysis to forecast the number of airline passengers (from Kaggle) and discuss through the traditional ARIMA implementation versus the more efficient, auto_arima way.
The general steps to implement an ARIMA model:
- Load and prepare data
- Check for stationarity (make data stationary if necessary) and determine d value
- Create ACF and PACF plots to determine p and q values
- Fit ARIMA model
- Predict values on test set
- Calculate r²
First, I loaded and prepared the data by changing the date to a datetime object, setting the date to index using the set_index method, and checking for null values.
df=pd.read_csv('AirPassengers.csv')
df=df.rename(columns={'#Passengers':'passengers','Month':'date'})
df['date'] = pd.to_datetime(df['date'])
df.set_index(df['date'], inplace=True)
df=df.drop(columns=['date'])
df.head()

I then took a preliminary look at the average monthly number of airline passengers, which revealed that the data was not stationary. This was further confirmed by conducting a Dickey-Fuller test which is a unit root test for stationarity, as shown in the image below:


After differencing our data twice, our p-value was less than our alpha (0.05), so we were able to reject the null hypothesis and accept the alternative hypothesis that the data is stationary. We then modeled our time-series data by setting the d parameter to 2. Next, I looked at our ACF/PACF plots using the differenced data to visualize the lags that will likely be influential when modeling the number of passengers.


From our visualizations, I determined that our p parameter is 0 and q parameter is 2 – our p,d,q parameters will be (0,2,2) for the ARIMA model. After splitting the data into training and testing groups and fitting the ARIMA model on the training set to predict the test set, we obtained a r² value of -1.52 – telling us that the model did not follow the trend of data at all.

I most likely calculated the p,d,q values incorrectly which caused the r² value to be negative, but in the mean time let’s try to build another ARIMA model using pmdarima.
Using pmdarima for Auto ARIMA model
In the previous method, checking for stationarity, making data stationary if necessary, and determining the values of p and q using the ACF/PACF plots can be time-consuming and less efficient. Using pmdarima’s auto_arima() function makes this task easier for us by eliminating steps 2 and 3 for implementing an ARIMA model. Let’s try it with the current dataset.
After loading and preparing the data, we can use pmdarima’s ADFTest() function to conduct a Dickey-Fuller test.
adf_test=ADFTest(alpha=0.05)
adf_test.should_diff(df)
# Output
(0.01, False)
This result indicates that the data is not stationary, so we need to use the "Integrated (I)" concept (d parameter) to make the data stationary while building the Auto ARIMA model.
Next, I split the dataset into training and test (80%/20%) sets to build the Auto ARIMA model on the training set and forecast using the test dataset
train=df[:114]
test=df[-30:]
plt.plot(train)
plt.plot(test)

Then, we build the Auto ARIMA model by using pmdarima’s auto_arima() function. Using the auto_arima() function calls for small p,d,q values which represent non-seasonal components and uppercase P,D,Q values which represent seasonal components. Auto_arima() is similar to other hyperparameter tuning methods, and is determined to find the optimal values for p,d,q using different combinations. The final p,d,q values are determined with lower AIC and BIC parameters taken into consideration.
model=auto_arima(train,start_p=0,d=1,start_q=0,
max_p=5,max_d=5,max_q=5, start_P=0,
D=1, start_Q=0, max_P=5,max_D=5,
max_Q=5, m=12, seasonal=True,
error_action='warn',trace=True,
supress_warnings=True,stepwise=True,
random_state=20,n_fits=50)

We can view the model summary:

Next, we can using the trained model to forecast the number of airline passengers on the test set and create a visualization.
prediction = pd.DataFrame(model.predict(n_periods = 30),index=test.index)
prediction.columns = ['predicted_passengers']
plt.figure(figsize=(8,5))
plt.plot(train,label="Training")
plt.plot(test,label="Test")
plt.plot(prediction,label="Predicted")
plt.legend(loc = 'upper left')
plt.savefig('SecondPrection.jpg')
plt.show()

The Auto ARIMA model gave us a r² value of 0.65 – this model did a much better job at capturing the trend in the data compared to my first implementation of the ARIMA model.
In this article, I demonstrated the traditional implementation of an ARIMA model compared to the Auto ARIMA model using auto_arima(). While the traditional ARIMA implementation requires one to perform differencing and plotting ACF and PACF plots, the Auto ARIMA model using pmdarima’s auto_arima() function is more efficient in determining the optimal p,d,q values.
For more information about pmdarima’s auto_arima() function, please see the following documentation
pmdarima: ARIMA estimators for Python – pmdarima 1.8.0 documentation
Thank you for reading! All code is available on my GitHub 🙂