
EDA (Exploratory Data Analysis) is a crucial part of the Data Science journey – it allows us to understand the given data and guides us in our modeling decisions. Getting a "feel" for your data before making any assumptions can be critical in formulating valuable insights. The bad news? It can often be time-consuming.
My last project used the data provided by the Austin Animal Shelter from the City of Austin Open Data Portal. A predictive model was used to determine if an animal were to be adopted based off of their intake information (animal type, age, breed, color, month of intake, etc.). With over 150,000 animals in the shelter, it was important to explore these different distributions in order to get an understanding of the intakes in Austin.
As I started exploring and creating bar plots, I found myself writing the same code over and over again. Often when this happens I ask myself, "Should I create a function instead?" Ultimately, creating this function saved me a lot of time. Work smarter, not harder. Here’s the function I created:
def initial_eda(df):
# List of categorical columns
cat_cols = df.select_dtypes('object').columns
for col in cat_cols:
# Formatting
column_name = col.title().replace('_', ' ')
title= 'Distribution of ' + column_name
# Unique values <= 12 to avoid overcrowding
if len(df[col].value_counts())<=12:
plt.figure(figsize = (8, 6))
sns.countplot(x=df[col],
data=df,
palette="Paired",
order = df[col].value_counts().index)
plt.title(title, fontsize = 18, pad = 12)
plt.xlabel(column_name, fontsize = 15)
plt.xticks(rotation=20)
plt.ylabel("Frequency",fontsize = 15)
plt.show();
else:
print(f'{column_name} has {len(df[col].value_counts())} unique values. Alternative EDA should be considered.')
return
When running the function on my data, here was the output:









Notice that plots from the color and breed columns were not created. Without Line 11 in my function, these two visuals would have been created:


Yikes! Not nearly as informational, and clearly not a visual I would want to include in any presentation. This explains why I limited the function to create plots from columns that had less than 13 unique values. This maximum can be adjusted, however. Instead of these ‘not-so-pretty’ plots, the number of unique values from these columns were printed to guide further exploration on these columns.
For the amount of visuals that were created, little coding was needed. The function can be explored further in the pursuit of perfecting these graphs, but for initial EDA this function can save a lot of time before entering the modeling phase. I plan on using this function in future projects and I hope others can find this useful as well!
To read more about this project, you can view it on my personal GitHub here.