The world’s leading publication for data science, AI, and ML professionals.

Mastering Pandas

Data Manipulation with Pandas

Photo by Hitesh Choudhary on Pexels
Photo by Hitesh Choudhary on Pexels

Pandas is a popular python library used for Data manipulation and analysis. It enables a variety of reading functions for a wide range of data formats. It also provides a variety of tools for data manipulation such as merging, joining and concatenation. In this post, we will focus on using pandas’ computational tools for statistical analysis, windowing, and generating rolling statistics. The documentation for pandas can be found [here](https://pandas.pydata.org/docs/user_guide/computation.html) and the pandas computational tools documentation can be found here.

Let’s get started!

We will be using ‘TSLA’ and ‘AAPL’ historical stock prices (past 5 years of data) to demonstrate these methods in pandas. The data for ‘TSLA’ can be found [here](https://finance.yahoo.com/quote/AAPL/history?p=AAPL) and the data for ‘AAPL’ can be found here.

First, let’s import the data using pandas:

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df_TSLA = pd.read_csv("TSLA.csv")

Let’s print the first five rows:

print(df_TSLA.head())

The first thing we can do is use the ‘pct_change’ method which is available for both pandas series and data frames. Let’s try this out on the ‘Open’ column:

print(df_TSLA['Open'].pct_change().head())

We can also do the same for ‘Open’, ‘High’, ‘Low’, and ‘Close’:

print(df_TSLA[["Open", "High", "Low", "Close"]].pct_change().head())

You can also specify a period in the ‘pct_change’ method. For example, if we want to calculate the percent change in each column every 3 records:

print(df_TSLA[["Open", "High", "Low", "Close"]].pct_change(periods = 3).head(6))

You can also calculate the covariance between columns. Let’s calculate the covariance between Open ‘TSLA’ prices and Open ‘AAPL’ prices. First, let’s read in the ‘AAPL’ data:

df_AAPL = pd.read_csv("AAPL.csv")

Let’s specify two series, one for Open ‘TSLA" prices and one for Open "AAPL" prices:

Series_TSLA = df_TSLA['Open']
Series_AAPL = df_AAPL['Open']

Now let’s print the covariance:

print("Covariance Between TSLA and AAPL:", Series_TSLA.cov(Series_AAPL))

You can also calculate pairwise covariances among the series in the data frame. For ‘TSLA’ we get:

print(df_TSLA.cov().head())

And for ‘AAPL’:

print(df_AAPL.cov().head())

We can do the same with correlation:

print("Correlation Between TSLA and AAPL:", Series_TSLA.corr(Series_AAPL))

And for pairwise correlation for series within the ‘TSLA’ data frame:

print(df_TSLA.corr().head())

And for ‘AAPL’:

print(df_AAPL.corr().head())

There are several methods in pandas provided for computing windowing and rolling statistics. This includes sum, mean, median, correlation and more. We will demonstrate a few of these below.

Let’s calculate the cumulative sum of the Open ‘AAPL’ price:

print(df_AAPL['Open'].cumsum().head())

We can also calculate the rolling mean with a window size of our choice. Let’s calculate the rolling mean Open price for ‘AAPL’ with a window size of 10:

print(df_AAPL['Open'].rolling(window = 10).mean().head(20))

We can also visualize these results. First, let’s convert the dates into date-time objects, set the date-time to be the data frame index, and select the ‘Open’ column:

df_AAPL = df_AAPL[['Date', 'Open']]
df_AAPL['Date'] = pd.to_datetime(df_AAPL['Date'])
df_AAPL.set_index('Date', inplace = True)
df_AAPL = df_AAPL['Open']

Now let’s plot using matplotlib and seaborn:

import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
plt.xlabel('Date')
plt.ylabel('Price')
df_AAPL.plot(style = 'k--')
df_AAPL.rolling(window = 10).mean().plot(style = 'k')

I will stop here but feel free to play around with some of the methods I went over in this post. Try finding additional data sets where these methods may give useful insights into trends in data. I also encourage you to explore some of the additional computation tools available in pandas such as custom function applying, time aware windowing, aggregation and much more.

To recap, in this post I discussed some computational tools available in the Python pandas library. I went over how to generate summary statistics like covariance and correlation between pandas series and generating rolling statistics like cumulative sum and rolling mean. I hope you found this post useful. The code from this post is available on GitHub. Thank you for reading!


Related Articles