Working with Data Frames and Visualization Using Basic Python Libraries

This article aims to introduce fundamental Python approaches in interacting with structured data using a case study in the payment services industry.

Dexter Nguyen
Towards Data Science

--

Photo by Scott Graham on Unsplash

Context

Last year, my boss came to my place and gave me a small assignment to analyze the AS-IS online payment industry. Motivated by my passion for Python at that time, I thought why I just tried completing this exercise in this programming language. Regarding the context, Excel, SQL, and Tableau were the three most popular tools in my job, considering their convenience and simplicity for our simple analysis. Python is, meanwhile, prioritized for more predictive and advanced analytics. Then, spending the next two days, I researched the possible methodologies to integrate Python to analyze and share my work with non-Python audiences.

What this article will cover

Including a case study in the payment services industry, this article is structured in a combination of technical and business processes, in which I used basic analysis with data frames and visualization to solve business questions from a perspective of a payment solution provider. I will also include relevant Python codes and libraries besides presenting findings in non-technical formats.

What this article will not cover

First, I agree that many other tools can help end up with the same result, even more efficiently. For example, people can do some data errors detection quickly in Excel, merge some data tables with shorter codes in SQL, or easily visualize some numbers in Tableau. R is also another great tool to play with data frames. What I am experimenting with here is Python. Second, Python is very powerful in advanced data analysis areas as Predictive Analytics, Machine Learning, or Neutral Networks. However, these applications will not be covered in this article.

Case study and Data

Ok, so let’s talk about my “small assignment.” Working in a market research firm, my team was supporting one of the clients with their initial analysis about one of the current trends in the financial services industry: Buy Now Pay Later. Some insights into both two sides of this business model: Seller and Buyer, should be addressed such as: “Which industries are adopting this new payment solution?”, “Which customer segments are being attracted by these new payment solution providers?”.
I did some raw data modification to prepare for this article. It means that the resulting analysis here is not “real” and only serves the simulation and “educational” purpose. Our focus here should be the methodologies themselves. Please don’t consider the case study data and analysis (some are censored) for any practical use.
(1) Funnel data: Data about events occurring on the payment service providers’ checkout product (try it out at a merchant like Casper to get a sense of the flow). It includes the merchants that use the service, the user — Buyer, and the user’s current status in the checkout funnel.
(2) Loan data: Data on each loan from the ‘Completed Checkout’ action. It includes more detailed information about users’ loans, like amount, interest rate, and length. It also covers data about the User/Buyer: personal info, demographic, credit score, etc.
(3) Merchant data: Data on each merchant that integrates the payment solution provider’s product (industry, company information).

Analysis

So I will cover three parts in my case study analysis: (1) Data cleaning; (2) Detecting data anomalies; and (3) Data analysis. I used these libraries for this analysis:

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML

# Tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# libraries for visualization
import matplotlib.pyplot as plt # Library for 2D plots of arrays
from mpl_toolkits.mplot3d import Axes3D # Library for 3D plotting
import matplotlib as mpl # Library for creating static, animated, and interactive visualizations
import seaborn as sns # Library based on matplotlib
import plotly.express as px # Contains functions that can create entire figures at once

(1) Data cleaning

As a very first step in any data project, data cleaning is strongly correlated with the associated efficiency of later work. For example, after briefly reviewing the Loan data table, I tried doing some basic cleaning and transformation, such as Create/Remove columns; and Consolidate information by grouping column(s) based on conditions:

# Create a new column age by using current year - user_dob_year
loansdata['age'] = (2021 - loansdata['user_dob_year'])
# Create a list of our conditions
conditions = [
(loansdata['age'] < 30),
(loansdata['age'] >= 30) & (loansdata['age'] <= 44),
(loansdata['age'] >= 45) & (loansdata['age'] <= 64),
(loansdata['age'] >= 65)
]

# create a list of the values we want to assign for each condition
values = ['Younger than 30', 'Age 30-44', 'Age 45-64', '65 and older']

# create a new column age_group and use np.select to assign values to it using our lists as arguments
loansdata['age_group'] = np.select(conditions, values)

(2) Detecting data anomalies

There are few ways to detect data anomalies or outliers, like identifying missing values, using frequency counts, or histogram/boxplots visualization. I started with a primary count of missing values for our three data tables from each variable slice and presented them in a percentage view:

# Check missing data
funnelmissing = pd.DataFrame({'Total missing values': funneldata.isnull().sum(),
'% missing values': round((funneldata.isnull().sum() * 100/ len(funneldata)),2).sort_values(ascending=False)})
funnelmissing.sort_values(by=['% missing values'], ascending=False)

However, this may not work for every case since there are possible overlapping between 0 and missing values. There is no other way to solve this confusion than to use your domain knowledge to research each variable. I then did frequency counts of our data columns to have another view of missing values. We can see that there are 67202 users with a value of 0, which may be missing values.

# Frequency Counts of all columns in funnel data
for col in funneldata.columns:
try:
print('Frequency Count of ', col)
print(funneldata[col].value_counts())
print('')
except ValueError:
print('This column can not be represented as a histogram')

Another option to detect data anomalies is using a summary stats table to observe different metrics for numerical data variables. For example, of Loan data, we can see some potential anomalies like the highest age at 122; some buyers checked out with 0 fico score; unusual high numbers of down payment and loan amount.

# Looking at the histograms, we may see some potential outliers most of the features. Lets use summary stats for more info:
loansdata[['age', 'apr', 'down_payment_amount',
'fico_score', 'loan_amount','loan_length_months',
'loan_return_percentage', 'mdr', 'user_dob_year']].describe()

Last, I used some visualization to dive deep into potential data problems. These graphs provide a visual view of insights covered in the prior analysis (frequency and stats summary). The most popular ones are histograms and boxplots, which are very good at quickly showing outliers.

Applying these layers to examine other data tables in our analysis will optimize our detection of potential data problems. I didn’t include in this analysis the solution for missing values and outliers since we will need to consider the domain knowledge and the understanding of the business situation, besides technical methods that are widely discussed out there. Knowing how to detect data problems, however, can work for almost every case.

(3) Data analysis

Let’s dive deep into business insights in this case study. This section will cover different interactions with data frames, including adding new columns, removing irrelevant columns, creating calculation/metrics, group data, and creating customized pivot tables.
To answer the first question, “Which merchant industry to focus on?” we are going to use two metrics to evaluate the merchant industries: revenue performance and checkout funnel.
First, seeking the revenue performance index requires three steps: Merge two datasets: loans data and merchants data; add a new column that calculates the revenue, and group by merchant category to see the total revenue performance by category/industry.

# The first step is to merge loansdata and merchansdata
loansmerchants = pd.merge(loansdata, merchantsdata,
on='merchant_id',
how='left')
# Use formula: (mdr + loan_return_percentage) * loan_amount)
# We have two revenue sources: returns from consumers and transaction charges from merchants

loansmerchants['revenue'] = round((loansmerchants['mdr'] + loansmerchants['loan_return_percentage']) * loansmerchants['loan_amount'])
# Group by merchant category to see the total revenue performance by category
loansmerchants2 = loansmerchants.groupby(['category']).sum()
# Drop irrelevant columns that might be incorrectly calculate using sum, keep revenue column
# We have the total revenue by merchant category
columns = ['revenue']
loansmerchantsrevenue = loansmerchants2[columns]

Even though the total revenue from merchants is in order of decreasing: Furniture, Apparel, Music, Jewelry, we may want to consider the total number of customers per user in each merchant category as well. This is when the domain knowledge comes in beside our technical skills. It seems to be more reasonable to use a new metric: revenue per user/customer for each of four categories.

# Group by merchant category to see the total revenue performance by category
loansmerchantscount = loansmerchants.groupby(['category']).nunique()
# Drop irrelevant columns
# We have the total number of user by merchant category
columns = ['user_id']
loansmerchantscount_user = loansmerchantscount[columns]
loansmerchantscount_user.rename(columns={"user_id": "number_of_customers"}, inplace=True)
# Dive deep into the analysis by computing the revenue per user/customer for each of four category
# Merge two data
loansmerchants_revenuesummary = pd.merge(loansmerchantsrevenue, loansmerchantscount_user,
on='category',
how='left')
# Create revenue per user column
loansmerchants_revenuesummary['revenue_per_customer'] = (round(loansmerchants_revenuesummary['revenue'] / loansmerchants_revenuesummary['number_of_customers']))

So we can see that based on revenue per customer, the Furniture category is even more potential. Also, we should not let the high number of Jewelry distract our analysis because of its small market size. Now let’s build a visualization to summarize our analysis.

Second, we want to find the current checkout funnel using a combination of funnel data and merchants’ data. We calculate the checkout funnel by merchant id using the Funnel data, then group by merchant_id and action and count the checkout_id for merchant id and each type of action. Then we pivot factors of action into different columns.

# Now we come back and calculate the current checkout funnel by merchant id using the first dataset: funneldata
# Group by merchant_id and action and count the checkout_id for merchant id and each type of action
funnelcount2 = funneldata.groupby(['merchant_id','action']).count()
# Pivot factors of action into different columns
funnelcountpivot2 = pd.pivot_table(funnelcount2, index='merchant_id', columns='action', values='checkout_id')
# Reorder the columns and change the column names to be the same as our data structure
funneldata3 = funnelcountpivot2[['Checkout Loaded', 'Loan Terms Run', 'Loan Terms Approved', 'Checkout Completed']]
funneldata3.rename(columns={"Checkout Loaded": "num_loaded"}, inplace=True)
funneldata3.rename(columns={"Loan Terms Run": "num_applied"}, inplace=True)
funneldata3.rename(columns={"Loan Terms Approved": "num_approved"}, inplace=True)
funneldata3.rename(columns={"Checkout Completed": "num_confirmed"}, inplace=True)

However, we need the merge this funneldata3 with the third dataset: merchants to get information about merchant id, before grouping by merchant category to see the funnel performance by category. Last, we add three calculation columns to the current data table. Again, using visualization to summarize our analysis ends up additional support for the Furniture category as a potential focus for the payment solution provider.

# Merge this funneldata3 with the third dataset: merchants to get information about merchant id
funnelmerchants = pd.merge(funneldata3, merchantsdata,
on='merchant_id',
how='left')
# Group by merchant category to see the funnel performance by category
funnelmerchantsperformance = funnelmerchants.groupby(['category']).sum().reset_index()
# Add three calculation columns to the funnelmerchantsperformance
# We end up with the current funnel performance by category
funnelmerchantsperformance['application_rate'] = round(funnelmerchantsperformance['num_applied'] / funnelmerchantsperformance['num_loaded'], 2)
funnelmerchantsperformance['approval_rate'] = round(funnelmerchantsperformance['num_approved'] / funnelmerchantsperformance['num_applied'], 2)
funnelmerchantsperformance['confirmation_rate'] = round(funnelmerchantsperformance['num_confirmed'] / funnelmerchantsperformance['num_approved'], 2)

Now, moving to the second question, “Which user demographic to focus on?”, we can consider age and FICO factors. Let’s start first with the age group by calculating the revenue performance index. We want to group data by age group to see the revenue performance by age group.

# Group by age_group to see the revenue performance by age_group
loansmerchantsage = loansmerchants3.groupby(['age_group']).sum().reset_index()
# Drop irrelevant columns
# We have the total revenue by merchant category
columnsage = ['age_group','revenue']
loansagerevenue = loansmerchantsage[columnsage]

Even though the total revenue from the age group is decreasing: Age 45–64; Age 30–44; 65 and older; Younger than 30, we may want to look at the total number of customers per user in each age group. We are using the same approach to compute a new metric: the revenue per user/customer for each of four categories.

# Group by age_group to see the total revenue performance by age group
loansagecount = loansmerchants.groupby(['age_group']).nunique()
# Drop irrelevant columns
# We have the total number of user by age group
columns = ['user_id']
loansagecount_user = loansagecount[columns]
loansagecount_user.rename(columns={"user_id": "number_of_customers"}, inplace=True)
# Dive deep into the analysis by computing the revenue per user/customer for each of four category
# Merge two data
loansage_revenuesummary = pd.merge(loansagerevenue, loansagecount_user,
on='age_group',
how='left')
# Create revenue per user column
loansage_revenuesummary['revenue_per_customer'] = (round(loansage_revenuesummary['revenue'] / loansage_revenuesummary['number_of_customers']))

Second, we may want to research the relationships between age groups, loan amount, and loan length. The graph below shows that as the loan lengths are the same across different groups, the provider should keep targeting Young-Middle Age due to their higher variations in big loan amounts.

To support our analysis of the user demographic, I also took into account buyers’ FICO scores. Using similar technical steps to work with relevant data frames, we should end up with the conclusion as below:

However, since may want to see the relationship between fico score and funnel conversion rate, I used scatter plots for better visualization.

# Group by fico to see the funnel performance by fico
funnelloansperformance = funnelloans.groupby(['fico_score']).mean().reset_index()
# Add three calculation columns to the funnelloansperformance
# We end up with the current funnel performance by fico_score
funnelloansperformance['application_rate'] = round(funnelloansperformance['num_applied'] / funnelloansperformance['num_loaded'], 2)
funnelloansperformance['approval_rate'] = round(funnelloansperformance['num_approved'] / funnelloansperformance['num_applied'], 2)
funnelloansperformance['confirmation_rate'] = round(funnelloansperformance['num_confirmed'] / funnelloansperformance['num_approved'], 2)
# 2D visualization
import seaborn as sns
import matplotlib.pyplot as plt

v12 = sns.catplot(data=funnelloansperformance, y="approval_rate", x="fico_score")
v13 = sns.catplot(data=funnelloansperformance, y="confirmation_rate", x="fico_score")

From a business perspective, we can consider additional research like “Where should we find our users?”; “How can we increase our user retention rate?” by requesting more data regarding Locations/Platforms of checkout or Time-series data to capture user lifetime value. Answering these questions will open doors to optimize promotions and partnerships to push user spending and to increase retention for high-growth segments

Final thoughts

No matter if we are data scientists or business/financial analysts, we are increasingly spending a tremendous amount of time working with data. It is becoming more messy, complicated, inconsistent, and consists of missing and noisy values. Even though excellent present tools can help diagnose, clean, and analyze datasets, Python should be considered one of the most optimized languages with its centralized and diverse applications. For those reasons, it is super helpful to learn some basic Python libraries like its built-in module Pandas, which provides a fast and efficient way to manage and explore data.

--

--