The world’s leading publication for data science, AI, and ML professionals.

Practical EDA Guide with Pandas

An analysis of student performances on different tests.

Photo by Jeswin Thomas on Unsplash
Photo by Jeswin Thomas on Unsplash

Pandas being a widely-used Data Analysis library provides numerous functions and methods to work on tabular data. The rich selection of easy-to-use functions makes the exploratory data analysis (EDA) process fairly easy.

In this post, we will explore the student performance dataset available on Kaggle. The dataset contains some personal information about students and their performance on certain tests.

Let’s start by reading the dataset into a Pandas dataframe.

import numpy as np
import pandas as pd
df = pd.read_csv("/content/StudentsPerformance.csv")
df.shape
(1000,8)
df.head()
(image by author)
(image by author)

There are 5 categorical features and scores of 3 different tests. The goal is to check how these features affect the test scores.

We can start by checking the distribution of test scores. The plot function of pandas can be used to create a kernel density plot (KDE).

df['reading score'].plot(kind='kde', figsize=(10,6), title='Distribution of Reading Score')
(image by author)
(image by author)

The scores of the reading test are normally distributed. The other test scores are similar to that of the reading test.

The "race/ethnicity" column contains 5 groups. I want to see the mean test scores for each group. The groupbyfunction can perform this task. It is better to also count the number of students in each group.

df[['race/ethnicity','math score','reading score','writing score']].groupby('race/ethnicity').agg({'math score':'mean', 'reading score':'mean', 'writing score':['mean','count']})
.round(1)
(image by author)
(image by author)

If you want to apply different aggregate functions to different columns, you can pass a dictionary to the agg function. Since the count is the same for all tests, it is enough to apply the count function to only one column.

The results show that the average scores steadily increase from group A to group E.

We can use the same logic to get an overview of the relationship between a categorical variable and the test scores.

To take it one step further, nested groups can be created. Let’s do an example using the "gender", "test preparation course", and "math score" columns. The students will be grouped by gender and taking the preparation course. Then, the average math score will be calculated for each group.

df[['gender','test preparation course','math score']].groupby(['gender','test preparation course']).agg(['mean','count']).round(1)
(image by author)
(image by author)

The test preparation course has a positive effect on math scores for both females and males. In general, males perform better than females at math tests.

You may have noticed the round function at the end of the code. It just rounds up the floating-point numbers to the desired number of decimal points.

We can also check how the groups in two categorical variables are related. Consider the "race/ethnicity" and "parental level of education" columns. The education level distribution might be different for each ethnicity group.

We will first create a dataframe that contains the number of people for each ethnicity group-education level combination.

education = df[['race/ethnicity','parental level of education','lunch']].groupby(['race/ethnicity','parental level of education']).count().reset_index()
education.rename(columns={'lunch':'count'}, inplace=True)
education.head()
(image by author)
(image by author)

The third column can be any column because it is just used to count the observations (i.e. rows) that belong to a particular ethnicity-education level combination. That is the reason why we changed the column name from lunch to count.

The education level of "group A" is close to being evenly distributed except for the "master’s degree". There are many options to visualize the education dataframe we just created. I will the polar line plot of plotly library.

import plotly.express as px
fig = px.line_polar(education, r='count', theta='parental level of education',color='race/ethnicity', line_close=True, width=800, height=500)
fig.show()
(image by author)
(image by author)

It gives us an overview of the education level distribution of different groups. It is important to note that this polar plot cannot be used to directly compare the education levels of different groups because the number of students in each group is not the same.

(image by author)
(image by author)

Group C contains the most students. Thus, we use this polar plot to check the education level distribution within each group.

For instance, "associate’s degree" is the most dominant education level in group C. We can confirm the results using the value_counts function.

(image by author)
(image by author)

Another handy function to compare categories is the pivot_table. It creates a cross table that consists of category combinations. Let’s do an example using the "lunch" and "parental level of education" columns on the "writing score".

pd.pivot_table(df, values='writing score',index='parental level of education', columns='lunch',aggfunc='mean')
(image by author)
(image by author)

The categorical variables (i.e. columns) are passed to the index and column parameters and the numerical variable is passed to the values parameter. Depending on the selected aggregate function (aggfunc parameter), the values are calculated.

The pivot table shows that having a standard lunch is likely to increase the writing test score. Similarly, as the level of parental education increases, the students tend to score better.


Conclusion

We have discovered some insights into what affects student performances on tests. There are, of course, many different measures you can check. However, the techniques are pretty similar.

Depending on what you look for, the techniques you use lean towards a particular direction. However, once you are comfortable with working Pandas, you can pretty much accomplish any task on tabular data.

Thank you for reading. Please let me know if you have any feedback.


Related Articles