
Hands-on Tutorial
The basic theory of K-Prototype
One of the conventional clustering methods commonly used in clustering techniques and efficiently used for large data is the K-Means algorithm. However, its method is not good and suitable for data that contains categorical variables. This problem happens when the cost function in K-Means is calculated using the Euclidian distance that is only suitable for numerical data. While K-Mode is only suitable for categorical data only, not mixed data types.
Facing these problems, Huang proposed an algorithm called K-Prototype which is created in order to handle clustering algorithms with the mixed data types (numerical and categorical variables). K-Prototype is a clustering method based on partitioning. Its algorithm is an improvement of the K-Means and K-Mode clustering algorithm to handle clustering with the mixed data types.
Read the full of K-Prototype clustering algorithm HERE.
It’s important to know well about the scale measurement from the data.
Note: K-Prototype has an advantage because it’s not too complex and is able to handle large data and is better than hierarchical based algorithms


The application of K-Prototype
In this part, we will demonstrate the implementation of K-Prototype using Python. Before that, it’s important to install the kmodes
module first using the terminal or Anaconda prompt.
There are a few modules used for demonstration. They are pandas
for data manipulation, numpy
for linear algebra calculation, plotnine
as data visualization, and kmodes
for K-Prototype clustering algorithm.
# Import module for data manipulation
import pandas as pd
# Import module for linear algebra
import numpy as np
# Import module for data visualization
from plotnine import *
import plotnine
# Import module for k-protoype cluster
from kmodes.kprototypes import KPrototypes
# Ignore warnings
import warnings
warnings.filterwarnings('ignore', category = FutureWarning)
# Format scientific notation from Pandas
pd.set_option('display.float_format', lambda x: '%.3f' % x)
The data can be downloaded here or you can easily generate this data by visiting this website. It’s totally free of charge. Enjoy!
# Load the data
df = pd.read_csv('data/10000 Sales Records.csv')
# The dimension of data
print('Dimension data: {} rows and {} columns'.format(len(df), len(df.columns)))
# Print the first 5 rows
df.head()
The data is actually the Country Sales Data. The data has 10,000 rows and 14 columns with mixed data types (numerical and categorical). It records the transaction of sales by country around the world.

To make sure the data type of each column is mapped properly, we must inspect their type manually using df.info()
command. If we found there is false mapping, we should correct it to the right data type.
# Inspect the data type
df.info()

Luckily, all the columns have the right data type. Please ignore the Order ID
because we will not use it and will be removed later.
There are seven categorical variables in the dataset. The Country
that has 185 unique values, Order Date
with 2691 unique values, and Ship Date
with 2719 unique values will be removed from cluster analysis because they have a lot of unique values. The rest of the columns will be kept. They are Region
with 7 unique values, Item Type
with 12 unique values, Sales Channel
that has 2 unique values and Order Priority
with 4 unique values.
# Inspect the categorical variables
df.select_dtypes('object').nunique()

The rest of the columns are numerical variables. They are Order ID
, Units Sold
, Unit Price
, Units Cost
, Total Revenue
, Total Cost
, and Total Profit
.
# Inspect the numerical variables
df.describe()

The last task before going to data exploration and analysis is to make sure the data doesn’t contain missing values.
# Check missing value
df.isna().sum()

Before going to cluster analysis, we should do data exploration for descriptive analysis. It aims to find out an interesting point that can be useful for report generating and capturing the phenomenon in the data.
We have the hypothesis that the number of purchases in each region has a strong linear correlation to the total profit. To conclude this, we have two options, descriptive analysis, and inference analysis. For this section, we will choose the first option. Let’s see!
# The distribution of sales each region
df_region = pd.DataFrame(df['Region'].value_counts()).reset_index()
df_region['Percentage'] = df_region['Region'] / df['Region'].value_counts().sum()
df_region.rename(columns = {'index':'Region', 'Region':'Total'}, inplace = True)
df_region = df_region.sort_values('Total', ascending = True).reset_index(drop = True)
# The dataframe
df_region = df.groupby('Region').agg({
'Region': 'count',
'Units Sold': 'mean',
'Total Revenue': 'mean',
'Total Cost': 'mean',
'Total Profit': 'mean'
}
).rename(columns = {'Region': 'Total'}).reset_index().sort_values('Total', ascending = True)

From the above result, we can conclude that North America is the region with the lowest number of sales but they outperform all regions in certain columns such as Units Sold
, Total Revenue
, Total Cost
, and Total Profit
. Unlike other regions, North America makes many purchases at the same time. While Europe which has the highest number of purchases doesn’t contribute to the total profit significantly. It means that the number of purchases is not having a strong linear correlation to the total profit.
# Data viz
plotnine.options.figure_size = (8, 4.8)
(
ggplot(data = df_region)+
geom_bar(aes(x = 'Region',
y = 'Total'),
fill = np.where(df_region['Region'] == 'Asia', '#981220', '#80797c'),
stat = 'identity')+
geom_text(aes(x = 'Region',
y = 'Total',
label = 'Total'),
size = 10,
nudge_y = 120)+
labs(title = 'Region that has the highest purchases')+
xlab('Region')+
ylab('Frequency')+
scale_x_discrete(limits = df_region['Region'].tolist())+
theme_minimal()+
coord_flip()
)

For the data exploration, we will create a cross-tabulation between Region
and Item Type
to look out for any patterns.
# Order the index of cross tabulation
order_region = df_region['Region'].to_list()
order_region.append('All')
# distribution of item type
df_item = pd.crosstab(df['Region'], df['Item Type'], margins = True).reindex(order_region, axis = 0).reset_index()
# Remove index name
df_item.columns.name = None

Data pre-processing aims to remove the unused columns which are Country
, Order Date
, Order ID
, and Ship Date
. They are irrelevant regarding the K-Prototype clustering algorithm. There are two reasons why we need to remove these columns as follows:
Country
– it has a lot of unique values that add to the computational load. The information is too much to process and becomes meaninglessOrder Date
andShip Date
– the clustering algorithm needs the assumption that the rows in the data represent the unique observation observed in a certain time periodOrder ID
– it has meaningless information to the cluster analysis
# Data pre-processing
df.drop(['Country', 'Order Date', 'Order ID', 'Ship Date'], axis = 1, inplace = True)
# Show the data after pre-processing
print('Dimension data: {} rows and {} columns'.format(len(df), len(df.columns)))
df.head()

The K-Prototype clustering algorithm in kmodes
module needs categorical variables or columns position in the data. This task aims to save those in a given variables catColumnsPos
. It will be added for the next task in cluster analysis. The categorical column position is in the first four columns in the data.
# Get the position of categorical columns
catColumnsPos = [df.columns.get_loc(col) for col in list(df.select_dtypes('object').columns)]
print('Categorical columns : {}'.format(list(df.select_dtypes('object').columns)))
print('Categorical columns position : {}'.format(catColumnsPos))

Important! In a real case, the numerical columns will have different scales so you must normalize them using appropriate techniques, such as min-max normalization, Z-Score Normalization, etc
Next, convert the data from the data frame to the matrix. It helps the kmodes
module run the K-Prototype clustering algorithm. Save the data matrix to the variable dfMatrix
.
# Convert dataframe to matrix
dfMatrix = df.to_numpy()

We are using the Elbow method to determine the optimal number of clusters for K-Prototype clusters. Instead of calculating the within the sum of squares errors (WSSE) with Euclidian distance, K-Prototype provides the cost function that combines the calculation for numerical and categorical variables. We can look into the Elbow to determine the optimal number of clusters.
# Choose optimal K using Elbow method
cost = []
for cluster in range(1, 10):
try:
kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster, init = 'Huang', random_state = 0)
kprototype.fit_predict(dfMatrix, categorical = catColumnsPos)
cost.append(kprototype.cost_)
print('Cluster initiation: {}'.format(cluster))
except:
break
# Converting the results into a dataframe and plotting them
df_cost = pd.DataFrame({'Cluster':range(1, 6), 'Cost':cost})
# Data viz
plotnine.options.figure_size = (8, 4.8)
(
ggplot(data = df_cost)+
geom_line(aes(x = 'Cluster',
y = 'Cost'))+
geom_point(aes(x = 'Cluster',
y = 'Cost'))+
geom_label(aes(x = 'Cluster',
y = 'Cost',
label = 'Cluster'),
size = 10,
nudge_y = 1000) +
labs(title = 'Optimal number of cluster with Elbow Method')+
xlab('Number of Clusters k')+
ylab('Cost')+
theme_minimal()
)

Important! Read more HERE
According to the scree plot of the cost function above, we consider choosing the number of cluster k = 3
. It will be the optimal number of clusters for K-Prototype cluster analysis. Read more about the Elbow method HERE.
# Fit the cluster
kprototype = KPrototypes(n_jobs = -1, n_clusters = 3, init = 'Huang', random_state = 0)
kprototype.fit_predict(dfMatrix, categorical = catColumnsPos)
The algorithm has 7 iterations to converge and it has a cost of 4,960,713,581,025,175.0 (it’s quite a big right?!). We can print the centroids of clusters using kprototype.cluster_centroids_
. For the numerical variables, it will be using the average while the categorical use the mode.
# Cluster centorid
kprototype.cluster_centroids_
# Check the iteration of the clusters created
kprototype.n_iter_
# Check the cost of the clusters created
kprototype.cost_

The interpretation of clusters is needed. The interpretation is using the centroids in each cluster. To do so, we need to append the cluster labels to the raw data. Ordering the cluster labels will be helpful to arrange the interpretation based on cluster labels.
# Add the cluster to the dataframe
df['Cluster Labels'] = kprototype.labels_
df['Segment'] = df['Cluster Labels'].map({0:'First', 1:'Second', 2:'Third'})
# Order the cluster
df['Segment'] = df['Segment'].astype('category')
df['Segment'] = df['Segment'].cat.reorder_categories(['First','Second','Third'])

To interpret the cluster, for the numerical variables, it will be using the average while the categorical use the mode. But there are other methods that can be implemented such as using median, percentile, or value composition for categorical variables.
# Cluster interpretation
df.rename(columns = {'Cluster Labels':'Total'}, inplace = True)
df.groupby('Segment').agg(
{
'Total':'count',
'Region': lambda x: x.value_counts().index[0],
'Item Type': lambda x: x.value_counts().index[0],
'Sales Channel': lambda x: x.value_counts().index[0],
'Order Priority': lambda x: x.value_counts().index[0],
'Units Sold': 'mean',
'Unit Price': 'mean',
'Total Revenue': 'mean',
'Total Cost': 'mean',
'Total Profit': 'mean'
}
).reset_index()

The complete example is listed below.
Conclusion
The K-Prototype is the clustering algorithm which is the combination of K-Means and K-Mode developed by Huang. For the implementation of its algorithm, the researcher needs to filter the columns carefully, especially for the categorical variables. The categorical variables must be relevant to the analysis and are not meaningless information. Besides that, the quality of the input (data) affects the clustering result (cluster initialization) and how the algorithm processes the data to get the converged result. As the researcher, for the final task, interpretation, we need to consider the metrics to use for both numerical and categorical variables.
References
[1] Z. Huang. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values (1998). Data Mining and Knowledge Discovery. 2(3): 283–304.