Pandas is a useful python library that can be used for a variety of data tasks including statistical analysis, data imputation, data wrangling and much more. In this post, we will go over three useful custom functions that allow us to generate statistics from data.
Let’s get started!
For our purposes, we will be working with the Wines Reviews data set which can be found here.
To start, let’s import the pandas and numpy packages:
import pandas as pd
import numpy as np
Next, let’s set the maximum number of display columns and rows to ‘None’:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
Now, let’s read in our data:
df = pd.read_csv('winemag-data-130k-v2.csv')
Next we will print the first five rows of data to get an idea of the column types and their values (column results are truncated):
print(df.head())

An interesting function we can consider is one that generates the average value for a numerical column for a given category in a categorical column. This function will take a categorical column name, a categorical value for that categorical column, and a numerical column name:
def get_category_mean(categorical_column, categorical_value, numerical_column):
Within the function we need to filter the dataframe for that categorical value and take the mean of the numerical values for that category:
def get_category_mean(categorical_column, categorical_value, numerical_column):
df_mean = df[df[categorical_column] == categorical_value]
mean_value = df_mean[numerical_column].mean()
Finally, we can use python "f-strings" to print the result:
def get_category_mean(categorical_column, categorical_value, numerical_column):
df_mean = df[df[categorical_column] == categorical_value]
mean_value = df_mean[numerical_column].mean()
print(f"Mean {categorical_column} for {numerical_column}", np.round(mean_value, 2))
Now, let’s call the function with the categorical column ‘country’, the country value ‘Italy’, and the numerical column ‘price’:
get_category_mean('country', 'Italy', 'price')

We can also call the function with the category ‘variety’, the variety value ‘Pinot Noir’ and the numerical column ‘price’:
get_category_mean('variety', 'Pinot Noir', 'price')

Another function we can consider is one that generates the mean of a numerical column for each categorical value in a categorical column. We can use the Pandas ‘groupby’ method to achieve this:
def groupby_category_mean(categorical_column, numerical_column):
df_groupby = df.groupby(categorical_column)[numerical_column].mean()
df_groupby = np.round(df_groupby,2)
print(df_groupby)
If we call our function with ‘country’ and ‘price’, we will print the mean price for each country (the results are truncated for clarity):
groupby_category_mean('country', 'price')

For a sanity check, we see that we get that same mean price for Italy as we did in the previous function. We can also apply this function to wine variety and price (the results are truncated for clarity):
groupby_category_mean('variety', 'price')

The last function we will discuss is one that generates the mode of a categorical column for each value of another categorical column. We can use the Pandas ‘groupby’ method to achieve this as well:
def groupby_category_mode(categorical_column1, categorical_column2):
df_groupby = df.groupby(categorical_column1)[categorical_column2].agg(pd.Series.mode)
print(df_groupby)
Let’s find the mode wine variety for each country. This will tell us the most frequently appearing wine variety for each country:
groupby_category_mode('country', 'variety')

I’ll stop here but feel free to play around with the code and data yourself.
CONCLUSIONS
To summarize, in this post we discussed how to define three custom functions using Pandas to generate statistical insights from data. First, we showed how to define a function that calculates the mean of a numerical column given a categorical column and category value. We then showed how to use the ‘groupby’ method to generate the mean value for a numerical column for each category in a categorical column. Finally, we showed how to generate the mode of a categorical column for each value of another categorical column. If you are interested in learning more about data manipulation with pandas, machine learning or even just some of the basics of python Programming check out _Python for Data Science and Machine Learning: Python Programming, Pandas and Scikit-learn Tutorials for Beginners_. I hope you found this post useful/interesting. The code in this post is available on GitHub. Thank you for reading!