There are many definitions of time series data, all of which indicate the same meaning in a different way. A straightforward definition is that time series data includes data points attached to sequential time stamps.
The sources of time series data are periodic measurements or observations. Just to give a few examples:
- Stock prices over time
- Daily, weekly, monthly sales
- Periodic measurements in a process
- Power or gas consumption rates over time
Pandas was created by Wes Mckinney to provide an efficient and flexible tool to work with financial data. Therefore, it is a very good choice to work on time series.
In this post, we will cover some of the functions and techniques that are used to analyze, manipulate, and visualize time series data.
We will be using pandas for data analysis and manipulation and matplotlib to create visualizations. Let’s start by importing these libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
There are various ways to obtain stock price data. A simple one is the pandas-datareader module. It creates a dataframe that contains 6 different pieces of information about a stock during a given period.
The following syntax creates two dataframes that contain the stock price data of Google and Apple from January, 2019 to December, 2020.
from pandas_datareader import data
apple = data.DataReader("AAPL", start='2019-1-1', end='2020-12-1',
data_source='yahoo')
google = data.DataReader("GOOG", start='2019-1-1', end='2020-12-1',
data_source='yahoo')
We just need to pass the stock name, start and end dates, and the data source to the data function. Here are the first 5 rows of the returned dataframe.
The most fundamental tool used in time series analysis is line plot. It shows how prices change over time.
Note: When working with time series, it is convenient to keep the dates or times as index. It makes both the analysis and creating plots easier.
Let’s create a simple line plot of the closing price of Apple stock.
plt.figure(figsize=(12,6))
apple['Close'].plot()
We observe an increasing trend with a few exceptions. The biggest downward movement occurs around April 2020 which is probably due to the global corona virus pandemic.
We can compare the stock prices of Google and Apple by plotting them on the same figure. The subplots function of matplotlib can be used as follows:
fig, ax = plt.subplots(nrow=2, sharex=True, figsize=(12,6))
apple['Close'].plot(ax=ax[0], title="Apple Stock", legend=False)
google['Close'].plot(ax=ax[1], title="Google Stock", legend=False)
The subplot function creates a grid of axes objects on a figure. The number of axes objects and their alignment are determined by the nrows and ncols parameters.
After creating the figure and axes, we specify the position of each plot in the grid by using the ax parameter.
From the visualization above, we can conclude that the stock prices of Google and Apple follow similar trends during the given period.
Let’s also compare these two stocks in terms of the daily volumes. All we need to do is to change the column name to volume.
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(12,6))
apple['Volume'].plot(ax=ax[0], title="Apple Stock - Volume", legend=False)
google['Volume'].plot(ax=ax[1], title="Google Stock - Volume", legend=False)
It is hard to compare the volumes based on daily frequencies. In this case, we can resample the time series data.
Resampling basically means representing the data with a different frequency. If we increase the frequency, it is called up-sampling. The opposite is down-sampling which means decreasing the frequency.
In our case, we will down-sample the data to get a better overview for comparison. Pandas provides two methods for resampling which are the resample and asfreq functions.
- Resample: Aggregates data based on specified frequency and aggregation function.
- Asfreq: Selects data based on the specified frequency and returns the value at the end of the specified interval.
I will use the resample function to down-sample the volume data to 7-day periods.
After selecting the data to be plotted, we will call the resample function along with the aggregation (mean in our case). The resampled data will be plotted instead of the original data.
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(12,6))
apple['Volume'].resample('7D').mean()
.plot(ax=ax[0], title="Apple Stock - Volume", legend=False)
google['Volume'].resample('7D').mean()
.plot(ax=ax[1], title="Google Stock - Volume", legend=False)
The resample function is quite flexible in terms of defining the frequency. You can even pass a string to specify the desired frequency.
The down-sampled data provides a more clear picture for comparison. The trend in volume for Google and Apple stocks seem be to similar with some exceptions.
Another useful and commonly used operation on time series data is rolling. The idea is based on creating a window of specific size that rolls through the data. While the window is rolling, some kind of calculations are done on the data inside the window.
The figure below explains the concept of rolling.
It is important to note that the calculation starts when the whole window is in the data. In other words, if the size of the window is three, the first aggregation is done at the third row.
Rolling is similar to down-sampling in some sense. In both techniques, we divide the data into smaller chunks and do some kind of aggregation. The difference is that each data point is used once in down-sampling. When we do rolling, each data point is used as many times as the size of the window (except for the first ones).
Let’s create the volume plots using a rolling window of size 7. We can then compare it to the previous one created by down-sampling to 7-day periods.
fig, ax = plt.subplots(nrows=2, sharex=True, figsize=(12,6))
apple['Volume'].rolling(7).mean()
.plot(ax=ax[0], title="Apple Stock - Volume", legend=False)
google['Volume'].rolling(7).mean()
.plot(ax=ax[1], title="Google Stock - Volume", legend=False)
The only difference in syntax is that we have used the rolling function with 7, instead of using the resample function with ‘7D’.
The plots are quite similar as expected. We can easily observe the same trend. However, the plot created by rolling is not as smooth as the one created by down-sampling. In that sense, rolling carries more detail than down-sampling.
Conclusion
Predictive analytics is highly valuable in the Data Science field and time series data is at the core of many problems that predictive analytics aims to solve.
There are many tools to work with time series data. Pandas is one of the highly efficient and common ones. Hence, if you plan to do time series analysis, I suggest to get familiar with Pandas.
What we have covered in this article can be considered as the basics of time series analysis. Once you are comfortable with the basics, you can easily built up your knowledge.
Thank you for reading. Please let me know if you have any feedback.