The world’s leading publication for data science, AI, and ML professionals.

Four Useful Functions For Exploring Data in Python

Exploring and Visualizing Data in Python

During the process of exploring Data I often find myself repeatedly defining similar python logic in order to carry out simple analytical tasks. For example, I often calculate the mean and standard deviation of a numerical column for specific categories within data. I also often analyze the frequency of categorical values within the data. In order to save time, I’ve written a few functions that allow me to do this type of analysis without rewriting much code.

In this post, I will share four useful functions I frequently use during the exploratory data analysis step of model building. I will then show how we can use these functions in order to explore the Wine Reviews data set. The data set can be found here. The folder contains three ‘.csv’ files. I will be using the one titled ‘winemag-data_first150k.csv’.

Let’s get started!

  1. COUNTER

The first function I will discuss allows us to look at how frequently categorical values appear in a data set. It takes as input a dataframe, column name, and limit. When called it prints a dictionary of categorical values and how frequently they appear:

def return_counter(data_frame, column_name, limit):
   from collections import Counter    print(dict(Counter(data_frame[column_name].values).most_common(limit)))

Let’s print the first five rows of the data set:

import pandas as pd
df = pd.read_csv('winemag-data_first150k.csv')
print(df.head())

We can see that there are several categorical columns. Let’s apply our function to the ‘country’ column and limit our results to the five most common countries:

return_counter(df, 'country', 5)

We can see most of the wine records correspond to wines made in the US.

Let’s apply our function to the ‘variety ‘ column:

return_counter(df, 'variety', 5)

Most of the wines are Chardonnays and Pinot Noirs. This is useful as a quick test to see if there is any significant imbalance in the data which is often a crucial thing to deal with when it comes to model building.

  1. SUMMARY STATISTICS

The next function is a summary statistic function (a bit similar to df.describe()). This function takes a dataframe, a categorical column and numerical column. The mean and standard deviation of the numerical column for each category is stored in a data frame and the data frame is sorted in descending order according to the mean. This is useful if you want to quickly see if certain categories have higher or lower mean and/or standard deviation values for a particular numerical column.

def return_statistics(data_frame, categorical_column, numerical_column):
    mean = []
    std = []
    field = []
    for i in set(list(data_frame[categorical_column].values)):
        new_data = data_frame[data_frame[categorical_column] == i]
        field.append(i)
        mean.append(new_data[numerical_column].mean())
        std.append(new_data[numerical_column].std())
    df = pd.DataFrame({'{}'.format(categorical_column): field, 'mean {}'.format(numerical_column): mean, 'std in {}'.format(numerical_column): std})
    df.sort_values('mean {}'.format(numerical_column), inplace = True, ascending = False)
    df.dropna(inplace = True)
    return df

We can look at summary statistics of for ‘varieties’ and ‘prices’:

stats = return_statistics(df, 'varieties', 'prices')
print(stats.head())

The variey Muscadel is has the highest mean price.

We can do the same for countries:

stats = return_statistics(df, 'countries', 'prices')
print(stats.head())

England has the highest mean price.

  1. BOXPLOT

The next function is the boxplot function. We use boxplots to visualize the distribution in numeric values based on the minimum, maximum, median, first quartile, and third quartile. If you are unfamiliar with them take a look at the article Understanding Boxplots.

Similar to the summary statistics function, this function takes a data frame, categorical column and numerical column and displays boxplots for the most common categories based on the limit:

def get_boxplot_of_categories(data_frame, categorical_column, numerical_column, limit):
    import seaborn as sns
    import matplotlib.pyplot as plt
    keys = []
    for i in dict(Counter(df[categorical_column].values).most_common(limit)):
        keys.append(i)
    print(keys)

    df_new = df[df[categorical_column].isin(keys)]
    sns.boxplot(x = df_new[categorical_column], y = df_new[numerical_column])

Let’s generate boxplots for wine prices in the 5 most commonly occuring countries:

get_boxplot_of_categories(df, 'country', 'price', 5)

As we can see, in all five coutry categories the wine prices have signiciant outliers. We can do the same for ‘variety’. I limited the category values to three countries for better visualization:

get_boxplot_of_categories(df, 'variety', 'price', 3)

4.SCATTERPLOT

The last function is the scatterplot function. This function takes a data frame, categorical column, categorical value and two numerical columns as input and displays a scatterplot:

def get_scatter_plot_category(data_frame, categorical_column, categorical_value, numerical_column_one, numerical_column_two):
    import matplotlib.pyplot as plt
    import seaborn as sns
df_new = data_frame[data_frame[categorical_column] == categorical_value]
    sns.set()
    plt.scatter(x= df_new[numerical_column_one], y = df_new[numerical_column_two])
    plt.xlabel(numerical_column_one)
    plt.ylabel(numerical_column_two)

Let’s generate a scatterplot of points vs price for wines in the US:

get_scatter_plot_category(df, 'country', 'US', 'points', 'price')

There seems to be a slight positive relationship between price and points awarded. I will stop here but please feel free to play around with the data and code yourself.

To recap, in this post I went over four useful functions that I often use in the process of exploratory data analysis. I went over methods for analyzing data including visualizing the data with box plots and scatterplots. I also defined functions for generating summary statistics like the mean, standard deviation and counts for categorical values. I hope this post was helpful.

The code from this post is available on GitHub. Thank you for reading!


Related Articles