
Pandas is a Python library used for analyzing, transforming, and generating statistics from data. In this post, we will discuss several useful methods in Pandas for data wrangling and exploration. For our purposes, we will be using the Medical Cost Personal Datasets data from Kaggle.
Let’s get started!
Reading Data
To begin, let’s import the pandas library:
import pandas as pd
Next, let’s read into Pandas data frame. A Pandas data frame is basically a tabular array-like data structure with rows and columns. To read in the data we can use the ‘.read_csv()’ method:
df = pd.read_csv("insurance.csv")
Next, we can use the ‘.head()’ and ‘.tail()’ methods to look at the first and last five rows of data respectively:
print(df.head())

print(df.tail())

We can also look at the column names :
print(df.columns)

This is particularly useful for data sets with a significant number of columns.
Cleaning Data
Often times, when using real data, we have to deal with missing values in the data columns. Using pandas, we can quickly get an idea about how sparse the data is with the ‘.isnull()’ method:
print(df.isnull().sum())

We see that this data doesn’t contain any missing values. Let’s artificially add missing values for the ‘children’ and ‘region’ columns, to demonstrate how we go about removing these values:
df.loc[df.region == 'southwest', 'region'] = np.nan
df.loc[df.children == 1, 'children'] = np.nan
Now let’s count the number of missing values we’ve added:
print(df.isnull().sum())

We can use the ‘.dropna()’ method to remove these missing values. This can either be done in place or we can return a value which we can store in a new variable. To drop missing values in place we do the following:
df.dropna(inplace=True)
print(df.isnull().sum())
Here, the df variable has been modified in place. Alternatively we can do the following:
df = df.dropna()
Let’s print the missing values once more:
print(df.isnull().sum())

We can check the length of the data frame before and after dropping the missing values:
print("Length Before:", len(df))
df.dropna(inplace=True)
print("Length After:", len(df))

Additionally, if you would like to fill missing values in a data frame you can use the ‘.fillna()’ method. To do so in place:
df.fillna(0, inplace=True)
and alternatively:
df = df.fillna(0)
Since we are imputing missing values, the length of the data frame should not change:
print("Length Before:", len(df))
df.fillna(0, inplace=True)
print("Length After:", len(df))

Filtering Data
We can easily filter data frames based on column values. For example, if we want records corresponding to patients younger than 30 years old we can write:
df = df[df['age'] < 30]
print(df.head())

If we want records corresponding to charges greater than $10,000 we can write:
df = df[df['charges'] > 10000]
print(df.head())

We can also filter the data frame to only include smokers:
df = df[df['smoker'] == 'yes']
Let’s print the first five rows:
print(df.head())

Notice that the index has been modified since data was removed. We can fix this by using the ‘reset_index’ method:
df.reset_index(inplace=True)
del df['index']
print(df.head())

We can also filter with multiple conditions. Suppose we want to pull data corresponding to female smokers. We can use the ‘.loc[]’ method in the following way:
df = df.loc[(df.sex == 'female') & (df.smoker == 'yes')]
df.reset_index(inplace=True)
del df['index']
print(df.head())

We can even add more than two conditions. Let’s filter for female smokers over 50:
df = df.loc[(df.sex == 'female') & (df.smoker == 'yes') & (df.age >= 50)]
df.reset_index(inplace=True)
del df['index']
print(df.head())

Selecting Rows and Columns
Now we will discuss how to use the ‘.iloc[]’ method to select indices. To select the first, second and last indices in dataset we do the following:
print(df.head())
print("---------------------First---------------------")
print(df.iloc[0])
print("---------------------Second---------------------")
print(df.iloc[1])
print("---------------------Last---------------------")
print(df.iloc[-1])

You can do something similar with ‘.loc[]’ for specific columns. To select the first and second rows we do the following:
print("---------------------First---------------------")
print(df.loc[0, 'sex'])
print("---------------------Second---------------------")
print(df.loc[1, 'sex'])

We can also select multiple rows within a column:
print("---------------------First---------------------")
print(df.loc[0:3, 'sex'])
print("---------------------Second---------------------")
print(df.loc[3:6, 'sex'])

Aggregating Data & Generating Statistics
Now we will discuss how to generate statistics from the data in our data frame. We can create separate data frames for specific categories and generate statistics from the resulting data frames. Let’s create separate data frames for male and female records:
df_female = df[df['sex'] == 'female']
df_male = df[df['sex'] == 'male']
Let’s get the average charge for females:
print("Female charges: ", df_female['charges'].mean())

Let’s also get the average charge for males:
print("Male charges: ", df_male['charges'].mean())

We can also find the maximum value for any numerical column using the ‘.max()’ method. Let’s do this for the full data set:
print("Maximum Value: ", df['charges'].max())

We can also find the minimum value:
print("Minimum Value: ", df['charges'].min())

We can apply these methods to other numerical columns. Let’s do so for the ‘age’ column:
print("Maximum Value: ", df['age'].max())
print("Minimum Value: ", df['age'].min())

Let’s do the same for the ‘bmi’ column:
print("Maximum Value: ", df['bmi'].max())
print("Minimum Value: ", df['bmi'].min())

Another useful method is the ‘.groupby()’ method which can be used for aggregating data. Let’s say we want to know the number of male and female smokers:
df_yes = df[df['smoker'] == 'yes']
df_yes = df_yes.groupby(['sex'])['smoker'].count()
print(df_yes.head())

We see that the number of male smokers is greater than the number of female smokers. We can also look at the grouped statistics for non-smokers:
df_no = df[df['smoker'] == 'no']
df_no = df_no.groupby(['sex'])['smoker'].count()
print(df_no.head())

We can also use the ‘.groupby()’ method to pull the average medical cost across category types. Earlier we looked at the average medical cost for males and females. We can generate those statistics again with ‘.groupby()’:
df = df.groupby(['sex'])['charges'].mean()
print(df.head())

We can also generate these statistics for each region group:
df = df.groupby(['region'])['charges'].mean()
print(df.head())

It would also be interesting to look at the average medical cost for each smoker group:
df = df.groupby(['smoker'])['charges'].mean()
print(df.head())

As we’d expect, smokers have significantly higher medical costs than non-smokers. We can also group by smoker and sex:
df = df.groupby(['smoker', 'sex'])['charges'].mean()
print(df.head())

We can also look at other statistics like standard deviation in charges across categories. The standard deviation measures the amount of dispersion in a set of values. In this case we will be considering standard deviation in charges, which corresponds to the dispersion in the charges data. Let’s look at the standard deviation in charges across sexes:
df = df.groupby(['sex'])['charges'].std()
print(df.head())

We can also look at the standard deviation in charges across regions:
df = df.groupby(['region'])['charges'].std()
print(df.head())

Or we can apply the ‘.groupby()’ across multiple columns. Let’s calculate the standard deviation in charges across region/sex groups:
df = df.groupby(['region', 'sex'])['charges'].std()
print(df.head())

Next, let’s calculate the standard deviation in charges across smoker/sex groups:
df = df.groupby(['smoker', 'sex'])['charges'].std()
print(df.head())

Iterating over Data Frames
Next, we will discuss how to iterate over data frame rows. We can use a method called ‘.iterrows()’ which will allow us to iterate over row and index values:
for index, rows in df.iterrows():
print(index, rows)
Below is a screenshot of a few of the output values:

We can also select specific rows for our iterations. Let’s do so for the ‘sex’, ‘charges’ and ‘smoker’ columns:
for index, rows in df.iterrows():
print('sex:', rows['sex'], 'charges:', rows['charges'], 'smoker:', rows['smoker'])

We can also create new columns conditioned on the values of other columns. Suppose we want to create a new column that specifies whether or not a record corresponds to a female smoker. We can use the ‘.iterrows()’ and ‘.at[]’ methods to label female smokers with boolean values:
for index, rows in df.iterrows():
if (rows.sex == 'female') and (rows.smoker == 'yes'):
df.at[index, 'female_smoker'] = True
else:
df.at[index, 'female_smoker'] = False
Let’s print the first five rows of our modified data frame:
print(df.head())

We can perform even more complicated labeling with ‘.iterrows()’ and ‘.at[]’. Suppose we want to create a column of boolean values corresponding to male smokers over the age of 50 with children:
for index, rows in df.iterrows():
if (rows.sex == 'male') and (rows.smoker == 'yes') and (rows.age > 50) and (rows.children > 0):
df.at[index, 'male_smoker_with_children'] = True
else:
df.at[index, 'male_smoker_with_children'] = False
Let’s print the first five rows of the resulting data frame:
print(df.head())

Another thing we can do is use the ‘Counter’ method from the collections module to get an idea of what the distributions in boolean values are for our new columns. Let’s apply the ‘Counter’ method to the ‘female_smoker’ column:
from collections import Counter
print(Counter(df['female_smoker']))

This corresponds to 115 records of female smokers. Let’s apply ‘Counter’ to the male smokers over 50 with children column:
print(Counter(df['male_smoker_with_children']))

This corresponds to 21 records of male smokers over 50 with children.
Writing to Files
Finally, if we have altered our data frame enough that we would want to save it to a separate file we use the ‘.to_csv()’ method:
df.to_csv("insurance_edit.csv")
Conclusions
To summarize, in this post we discussed several methods in Pandas. We discussed how to read, clean, and filter data using Pandas methods. We also discussed how to generate aggregate statistics, iterate over data frames and write data to a new file. I hope this was helpful. The code in this post is available on GitHub. Thank you for reading!