ESSENTIAL PANDAS

Following acquisition of raw data, data wrangling is the most essential step to transform raw data into more functional form for Data Analysis, model building and data visualization. If involves preprocessing, restructuring and cleaning operations and the end product is a dataset in a readily accessible format, ready to be analyzed and visualized.
The aim of this article is to present some key operations for performing these tasks in pandas, a python based tool for data manipulation. The idea here is to not be extensive in describing all available pandas operations but instead to give beginner and intermediate level users a handy guide to essential features.
The contents of this article have been divided into following directories.
A. INTRODUCTION TO DATAFRAMES
- Creating a DataFrame
- Retrieving data from a DataFrame
- Summarizing data
- Combining DataFrames
- Reshaping DataFrames
- Dealing with null values
Note : All images except for those at the very top are by author including the downloadable cheatsheet at the very end and the code / code results.
INTRODUCTION TO DATAFRAMES
A DataFrame is the most commonly used structured API in Pandas for storing data in rows (observations) and columns (variables) just like in a table. The row labels are refered to as ‘index‘ while the column labels are refered to as ‘columns‘. We begin by discussing the two most common ways of defining a dataframe. While discussion in this article is limited to two dimensional datasets, the frameworks discussed here are totally adept at handeling higher dimensional data.
Creating a DataFrames
Let us begin by defining a pandas dataframe as shown below.

This dataframe consists of three columns (containing three variables labeled : col1, col2, col3) and three rows (containing three observations/index labeled : 1, 2, and 3). Easiest way to define a dataframe is to provide a list of observations that are then arranged as rows in the dataframe. __ Alternatively we can provide data as a dictionary with each entry acting as a column in the dataframe. Below is the code for both of these methods.
# Defining DataFrame by specifing a list of observations
df_1= DataFrame([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']],
index = [1,2,3], columns = ['col1', 'col2', 'col3'])
# Defining DataFrame by specifing a dictionary of columns
df_2= DataFrame({'col1': ['a', 'd', 'g'],
'col2': ['b', 'e', 'h'],
'col3': ['c', 'f', 'i']},
index = [1,2,3])
Both df_1
and df_2
defined above are identical

Retrieving Data from a Dataframe
The index labels (1, 2, 3), column labels (col1, col2, col3), and data values(a through i) stored in the dataframe df_1
can be retrieved usingdf_1.index,
df_1.columns,
and df_1.values
respectively.
For much of this article we will be using the Iris Dataset instead of the 3X3 dataframe we defined above. This publically available dataset contains data on three different varieties of iris flower (Setosa, Versicolour, and Virginica) in form of four different parameters/variables (sepal length, sepal width, petal length and petal width). We can import and save it as a pandas dataframe using iris = pd.read_csv('[https://raw.githubusercontent.com/mwaskom/seaborndata/master/iris.csv'](https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'))

loc
is a label based function to access data for a particular index and/or column and returns all the entries stored for that particular index/column. iris.loc[[0,1]]
will return rows indexed 0 and 1 :

If instead we only want some specific columns of the rows labeled 0 & 1, we can use iris.loc[[0,1], ['sepal_width', 'petal_length']]
to specify column labels along with row labels.

iloc
is the second option for slicing dataframes. It essentially does the same thing as loc
but makes use of integer based positions to access data instead of index and column labels. If we useiris.iloc[1:4,0:3]
: 1:4
will return rows indexed 1, 2 & 3 in a 0-based indexing of the dataframe iris
and 0:3
will return columns indexed 0, 1 & 2.

While loc
and iloc
would suffice for slicing dataframes most of the times, a complete list of slicing methods can be found here.
Summarizing Data
This section lists some of the operations commonly used to inspect a dataframe to gain better understanding of its contents.
iris.head(n)
andiris.tail(n)
returns the top and bottom n rows of the dataframe respectively.iris.describe()
returns summary of all numerical variables (columns) in the dataset.

iris['species'].value_counts()
will return the number of rows that contain each of the unique values of variable species.

iris['sepal_length'].nunique()
will return the number of unique values in this column, which happens to be 35- Numerical operations can be called on columns or rows include :
min()
(minimum value),max()
(maximum value),var()
(variance),std()
(standard deviation),sum()
(addition) etc.
DATA WRANGLING
Combining DataFrames
In this section we discuss two very important methods for combining dataframe
merge
to combine observations from different dataframes using a common variable as referencegroupby
to group data within a single dataframe based on entries of a reference variable.
Let us begin by defining two dataframes that have been extracted from the Iris dataframe discussed earlier using the iloc
operation for data slicing. First we use df_1 = iris.iloc[[1,51],3:]
to slice the last two columns of rows 1 and 51.

We then define df_2
using iris.iloc[[1,51,101],[0,2,4]]
which should give us the first, thrid and fifth column of the iris database for rows indexed 1, 51 and 101.

The merge
operation can combine these two dataframes using species
as a reference variablepd.merge(df_1, df_2, how = 'outer', on= 'species')
. Since not all the species mentioned in df_2
are to be found in df_1
, we need to specify which species should be included in the merged dataframe using the how
argument. how = 'outer'
means all species from either of the two datasets must be included in the merged dataset. Since there is no entry for the species virginica
in df_1
, its value is NaN
for the variable petal_width
in the merged dataframe.

If however we set how = 'inner'
, only those species will be included in the merged dataframe that are present in both df_1
and df_2
. pd.merge(df_1, df_2, how = 'inner', on= 'species')
will give :

The venn diagrams below visualize the use of how
argument. Each of the two circles represents a dataframe and the shaded area shows the observations that will make it to the merged dataframe.

Let us now retreive another dataframe from the original iris dataframe, this time made of 8 rows df_3 = iris.iloc[[1,2,3,4,51,52,53,54],[0,2,4]]

The groupby
operation can be used to split and combine data for the two species (setosa
& versicolor
) in df_3
usingdf_3.groupby('species')
. We can then access data for each of the two species
separately using get_group
.
a.get_group('versicolor')

Reshaping Dataframes
In this section we discuss key operations for reshaping data. Lets begin by defining a dataframe df_4 = iris.iloc[1:5,0:3]

Adding new column to df_4
requires specifying the variable name and associated data
df_4['Species'] = ['versicolor', 'versicolor', 'setosa', 'setosa']
. Note that the dimensions of the list must match number of rows in the existing dataframe.

Now to add a row to this dataframe, we make use of the loc
operation discussed earlier. df_4.loc[5] = [4.9, 1.4, 1.3, 'setosa']
will return :

For deleting rows and columns we make use of the drop
operation. df_4.drop('Species', axis=1)
will delete the Species
column and df_4.drop(2, axis = 0)
will delete the row indexed 2. Note that we are specifing the index and column names here instead of the corresponding 0-based index values. axis
argument is used to distinguish between rows and columns. 0, the default value is for rows, 1 is for columns.
We can also drop rows and columns simultaneously by specifying index and column labels like df_4.drop(index = 1, columns = ['Species']).
To make a dataframe comprising of two or more copies of an original dataframe, we can use the concatenate
function. pd.concat([df_4,df_4])
will join df_4
as shown below

Notice how the index labels of the original dataframe have been retained as two copies of df_4
are joined. Passing an additional argument ignore_index=True
will result in index labels rangeing from 0 to 9 instead.
For the next operation we will use a dataframe that I have stored as a excel file and imported using df = pd.read_excel("weather.xlsx")

Notice the structure of this dataframe. There is a single variable here temperature
. Day
and city
are the two different observations as the temperature
is recorded for five Days
(Monday to Friday) in two different city
( Boston and NYC). We can re-structure this dataframe using the pivot
operationdf.pivot(index = 'city', columns= 'Day')
to give:

city
is the only index in this re-structured dataframe and Day
variable has been transformed into columns. The dataframe in this format is much easier to read, repetition of the Day
variable has been avoided and the temperature
values for the two cities are easy to compare as they are next to each other.
To introduce the next reshaping operation, lets import another excel file as a dataframe. df_1 = pd.read_excel('city_weather.xlsx')

This dataframe has the exact same data as the df
dataframe we discussed earlier thought the format is quiet different. Using the melt
operation, we can transform this dataframe to look similar to the one we imported from a excel file earlier.
pd.melt(df_1, id_vars = ["day"], var_name = ['city'], value_name = 'temperature')

melt
therefore does opposite of what pivot
did in terms of dataframe restructuring.
For the final data restructuring operation that we will discuss, let us define a custom dataframe as follows :
df_2 = DataFrame(np.arange(16).reshape(4,4),
index = [['A', 'A', 'B', 'B'],['a','b','c','d']],
columns = [[1,1,2,2], [11,11,22,22]])
df_2.index.names = ['Index1', 'Index2']
df_2.columns.names = ['City', 'Town']
Here is what the above code does : it defines a dataframe composed of integers ranging from 0 to 15 arranged in a 4X4
format. Two different index (['A', 'A', 'B', 'B']
& ['a','b','c','d']
) and column ([1,1,2,2]
&[11,11,22,22]
) labels are specified along with a name for each of these labels. The resulting dataframe looks as shown below :

The hierarchical index and column labels are not uncommon in complex datasets especially when describing relationship between many different variables. We can now use the stack
operation to re-structure this dataframe. df_2.stack(level = 0)
will restructure the dataframe at the City
variable to give :-

Similarly, df_2.stack(level = 1)
will transform the Town
variable as below

Dealing with null values
Often raw datasets contain some null values (NaN
: not a number) owing to lack of availability of data. These null values need to be accounted for and replaced before proceeding with data modelling and visualization. Let us begin by defining df_na_1
dataframe which contains 5 null values.
npn = np.nan
df_na_1 = DataFrame([[1, 2, 3, 4],
[4, 5, 6, npn],
[6, npn, 7, 8],
[9, npn, npn, npn]])

Calling the isnull
operation on this dataframe (df_na_1.isnull()
)will return true for all the null values

We can then use the dropna
operation df_na_1.dropna()
to get rid of all rows /observations with a single null value and we are left with just the first row.

If we want to instead drop the columns with a single null value, set axis = 1
(df_na_1.dropna(axis=1))
and now we are left with only the first column.

Instead of dropping the null values, we can replace them with a value of choice. Using df_na_1.fillna({1:100, 2:200, 3:300})
we can replace null values with different values for different columns. All null values in the column labeled 1
are replaced by 100,
in column labeled 2
are replaced by 200
and so forth.

Let us define another dataframe df_na_2
df_na_2 = DataFrame([[100, 100, 100, 100],
[200, 200, 200, 200],
[300, 300, 300, 300],
[400, 400, 400, 400]])

Using df_na_1.combine_first(df_na_2)
we can combine the two dataframes such that where ever the values of df_na_1
is null, it will get replaced by the corresponding values in df_na_2
.

Downloading Cheatsheet

The cheatsheet shown above contains most of the key concepts discussed in this article and can be downloaded from here. For brevity, detailed explanation in not possible in the cheetsheet so please refer to the text of this article for those detials. Thanks for reading.