
Pandas being a widely-used Data Analysis library provides numerous functions and methods to work on tabular data. The rich selection of easy-to-use functions makes the exploratory data analysis (EDA) process fairly easy.
In this post, we will explore the student performance dataset available on Kaggle. The dataset contains some personal information about students and their performance on certain tests.
Let’s start by reading the dataset into a Pandas dataframe.
import numpy as np
import pandas as pd
df = pd.read_csv("/content/StudentsPerformance.csv")
df.shape
(1000,8)
df.head()

There are 5 categorical features and scores of 3 different tests. The goal is to check how these features affect the test scores.
We can start by checking the distribution of test scores. The plot
function of pandas can be used to create a kernel density plot (KDE).
df['reading score'].plot(kind='kde', figsize=(10,6), title='Distribution of Reading Score')

The scores of the reading test are normally distributed. The other test scores are similar to that of the reading test.
The "race/ethnicity" column contains 5 groups. I want to see the mean test scores for each group. The groupby
function can perform this task. It is better to also count the number of students in each group.
df[['race/ethnicity','math score','reading score','writing score']].groupby('race/ethnicity').agg({'math score':'mean', 'reading score':'mean', 'writing score':['mean','count']})
.round(1)

If you want to apply different aggregate functions to different columns, you can pass a dictionary to the agg
function. Since the count is the same for all tests, it is enough to apply the count
function to only one column.
The results show that the average scores steadily increase from group A to group E.
We can use the same logic to get an overview of the relationship between a categorical variable and the test scores.
To take it one step further, nested groups can be created. Let’s do an example using the "gender", "test preparation course", and "math score" columns. The students will be grouped by gender and taking the preparation course. Then, the average math score will be calculated for each group.
df[['gender','test preparation course','math score']].groupby(['gender','test preparation course']).agg(['mean','count']).round(1)

The test preparation course has a positive effect on math scores for both females and males. In general, males perform better than females at math tests.
You may have noticed the round
function at the end of the code. It just rounds up the floating-point numbers to the desired number of decimal points.
We can also check how the groups in two categorical variables are related. Consider the "race/ethnicity" and "parental level of education" columns. The education level distribution might be different for each ethnicity group.
We will first create a dataframe that contains the number of people for each ethnicity group-education level combination.
education = df[['race/ethnicity','parental level of education','lunch']].groupby(['race/ethnicity','parental level of education']).count().reset_index()
education.rename(columns={'lunch':'count'}, inplace=True)
education.head()

The third column can be any column because it is just used to count the observations (i.e. rows) that belong to a particular ethnicity-education level combination. That is the reason why we changed the column name from lunch to count.
The education level of "group A" is close to being evenly distributed except for the "master’s degree". There are many options to visualize the education dataframe we just created. I will the polar line plot of plotly library.
import plotly.express as px
fig = px.line_polar(education, r='count', theta='parental level of education',color='race/ethnicity', line_close=True, width=800, height=500)
fig.show()

It gives us an overview of the education level distribution of different groups. It is important to note that this polar plot cannot be used to directly compare the education levels of different groups because the number of students in each group is not the same.

Group C contains the most students. Thus, we use this polar plot to check the education level distribution within each group.
For instance, "associate’s degree" is the most dominant education level in group C. We can confirm the results using the value_counts
function.

Another handy function to compare categories is the pivot_table
. It creates a cross table that consists of category combinations. Let’s do an example using the "lunch" and "parental level of education" columns on the "writing score".
pd.pivot_table(df, values='writing score',index='parental level of education', columns='lunch',aggfunc='mean')

The categorical variables (i.e. columns) are passed to the index and column parameters and the numerical variable is passed to the values parameter. Depending on the selected aggregate function (aggfunc parameter), the values are calculated.
The pivot table shows that having a standard lunch is likely to increase the writing test score. Similarly, as the level of parental education increases, the students tend to score better.
Conclusion
We have discovered some insights into what affects student performances on tests. There are, of course, many different measures you can check. However, the techniques are pretty similar.
Depending on what you look for, the techniques you use lean towards a particular direction. However, once you are comfortable with working Pandas, you can pretty much accomplish any task on tabular data.
Thank you for reading. Please let me know if you have any feedback.