The world’s leading publication for data science, AI, and ML professionals.

The k-prototype as Clustering Algorithm for Mixed Data Type (Categorical and Numerical)

The explanation of the theory and its application in real problems

Photo by Nareeta Martin on Unsplash
Photo by Nareeta Martin on Unsplash

Hands-on Tutorial

The basic theory of K-Prototype

One of the conventional clustering methods commonly used in clustering techniques and efficiently used for large data is the K-Means algorithm. However, its method is not good and suitable for data that contains categorical variables. This problem happens when the cost function in K-Means is calculated using the Euclidian distance that is only suitable for numerical data. While K-Mode is only suitable for categorical data only, not mixed data types.

Facing these problems, Huang proposed an algorithm called K-Prototype which is created in order to handle clustering algorithms with the mixed data types (numerical and categorical variables). K-Prototype is a clustering method based on partitioning. Its algorithm is an improvement of the K-Means and K-Mode clustering algorithm to handle clustering with the mixed data types.

Read the full of K-Prototype clustering algorithm HERE.

It’s important to know well about the scale measurement from the data.

Note: K-Prototype has an advantage because it’s not too complex and is able to handle large data and is better than hierarchical based algorithms

The mathematics formula for K-Prototype clustering algorithm (Image by Author)
The mathematics formula for K-Prototype clustering algorithm (Image by Author)

The application of K-Prototype

In this part, we will demonstrate the implementation of K-Prototype using Python. Before that, it’s important to install the kmodes module first using the terminal or Anaconda prompt.

There are a few modules used for demonstration. They are pandas for data manipulation, numpy for linear algebra calculation, plotnine as data visualization, and kmodes for K-Prototype clustering algorithm.

# Import module for data manipulation
import pandas as pd
# Import module for linear algebra
import numpy as np
# Import module for data visualization
from plotnine import *
import plotnine
# Import module for k-protoype cluster
from kmodes.kprototypes import KPrototypes
# Ignore warnings
import warnings
warnings.filterwarnings('ignore', category = FutureWarning)
# Format scientific notation from Pandas
pd.set_option('display.float_format', lambda x: '%.3f' % x)

The data can be downloaded here or you can easily generate this data by visiting this website. It’s totally free of charge. Enjoy!

# Load the data
df = pd.read_csv('data/10000 Sales Records.csv')
# The dimension of data
print('Dimension data: {} rows and {} columns'.format(len(df), len(df.columns)))
# Print the first 5 rows
df.head()

The data is actually the Country Sales Data. The data has 10,000 rows and 14 columns with mixed data types (numerical and categorical). It records the transaction of sales by country around the world.

The country sales data generated by MS Excel for K-Prototype (Image by Author)
The country sales data generated by MS Excel for K-Prototype (Image by Author)

To make sure the data type of each column is mapped properly, we must inspect their type manually using df.info() command. If we found there is false mapping, we should correct it to the right data type.

# Inspect the data type
df.info()
The data type for columns of country sales data (Image by Author)
The data type for columns of country sales data (Image by Author)

Luckily, all the columns have the right data type. Please ignore the Order ID because we will not use it and will be removed later.

There are seven categorical variables in the dataset. The Country that has 185 unique values, Order Date with 2691 unique values, and Ship Date with 2719 unique values will be removed from cluster analysis because they have a lot of unique values. The rest of the columns will be kept. They are Region with 7 unique values, Item Type with 12 unique values, Sales Channel that has 2 unique values and Order Priority with 4 unique values.

# Inspect the categorical variables
df.select_dtypes('object').nunique()
Categorical variables in sales country data (Image by Author)
Categorical variables in sales country data (Image by Author)

The rest of the columns are numerical variables. They are Order ID, Units Sold, Unit Price, Units Cost, Total Revenue, Total Cost, and Total Profit.

# Inspect the numerical variables
df.describe()
The summary statistic for numerical data of country sales data (Image by Author)
The summary statistic for numerical data of country sales data (Image by Author)

The last task before going to data exploration and analysis is to make sure the data doesn’t contain missing values.

# Check missing value
df.isna().sum()
Number of missing value in-country sales data (Image by Author)
Number of missing value in-country sales data (Image by Author)

Before going to cluster analysis, we should do data exploration for descriptive analysis. It aims to find out an interesting point that can be useful for report generating and capturing the phenomenon in the data.

We have the hypothesis that the number of purchases in each region has a strong linear correlation to the total profit. To conclude this, we have two options, descriptive analysis, and inference analysis. For this section, we will choose the first option. Let’s see!

# The distribution of sales each region
df_region = pd.DataFrame(df['Region'].value_counts()).reset_index()
df_region['Percentage'] = df_region['Region'] / df['Region'].value_counts().sum()
df_region.rename(columns = {'index':'Region', 'Region':'Total'}, inplace = True)
df_region = df_region.sort_values('Total', ascending = True).reset_index(drop = True)
# The dataframe
df_region = df.groupby('Region').agg({
    'Region': 'count',
    'Units Sold': 'mean',
    'Total Revenue': 'mean',
    'Total Cost': 'mean',
    'Total Profit': 'mean'
    }
).rename(columns = {'Region': 'Total'}).reset_index().sort_values('Total', ascending = True)
The sales distribution in each region in sales country data (Image by Author)
The sales distribution in each region in sales country data (Image by Author)

From the above result, we can conclude that North America is the region with the lowest number of sales but they outperform all regions in certain columns such as Units Sold, Total Revenue, Total Cost, and Total Profit. Unlike other regions, North America makes many purchases at the same time. While Europe which has the highest number of purchases doesn’t contribute to the total profit significantly. It means that the number of purchases is not having a strong linear correlation to the total profit.

# Data viz
plotnine.options.figure_size = (8, 4.8)
(
    ggplot(data = df_region)+
    geom_bar(aes(x = 'Region',
                 y = 'Total'),
             fill = np.where(df_region['Region'] == 'Asia', '#981220', '#80797c'),
             stat = 'identity')+
    geom_text(aes(x = 'Region',
                   y = 'Total',
                   label = 'Total'),
               size = 10,
               nudge_y = 120)+
    labs(title = 'Region that has the highest purchases')+
    xlab('Region')+
    ylab('Frequency')+
    scale_x_discrete(limits = df_region['Region'].tolist())+
    theme_minimal()+
    coord_flip()
)
The number of sales in each region in sales country data (Image by Author)
The number of sales in each region in sales country data (Image by Author)

For the data exploration, we will create a cross-tabulation between Region and Item Type to look out for any patterns.

# Order the index of cross tabulation
order_region = df_region['Region'].to_list()
order_region.append('All')
# distribution of item type
df_item = pd.crosstab(df['Region'], df['Item Type'], margins = True).reindex(order_region, axis = 0).reset_index()
# Remove index name
df_item.columns.name = None
The distribution of item type purchased by each region (Image by Author)
The distribution of item type purchased by each region (Image by Author)

Data pre-processing aims to remove the unused columns which are Country, Order Date, Order ID, and Ship Date. They are irrelevant regarding the K-Prototype clustering algorithm. There are two reasons why we need to remove these columns as follows:

  • Country – it has a lot of unique values that add to the computational load. The information is too much to process and becomes meaningless
  • Order Date and Ship Date – the clustering algorithm needs the assumption that the rows in the data represent the unique observation observed in a certain time period
  • Order ID – it has meaningless information to the cluster analysis
# Data pre-processing
df.drop(['Country', 'Order Date', 'Order ID', 'Ship Date'], axis = 1, inplace = True)
# Show the data after pre-processing
print('Dimension data: {} rows and {} columns'.format(len(df), len(df.columns)))
df.head()
The country sales data without certain columns (Image by Author)
The country sales data without certain columns (Image by Author)

The K-Prototype clustering algorithm in kmodes module needs categorical variables or columns position in the data. This task aims to save those in a given variables catColumnsPos. It will be added for the next task in cluster analysis. The categorical column position is in the first four columns in the data.

# Get the position of categorical columns
catColumnsPos = [df.columns.get_loc(col) for col in list(df.select_dtypes('object').columns)]
print('Categorical columns           : {}'.format(list(df.select_dtypes('object').columns)))
print('Categorical columns position  : {}'.format(catColumnsPos))
The position of categorical variables or columns in the country sales data (Image by Author)
The position of categorical variables or columns in the country sales data (Image by Author)

Important! In a real case, the numerical columns will have different scales so you must normalize them using appropriate techniques, such as min-max normalization, Z-Score Normalization, etc

Next, convert the data from the data frame to the matrix. It helps the kmodes module run the K-Prototype clustering algorithm. Save the data matrix to the variable dfMatrix.

# Convert dataframe to matrix
dfMatrix = df.to_numpy()
The matrix of country sales data for K-Prototype clustering algorithm (Image by Author)
The matrix of country sales data for K-Prototype clustering algorithm (Image by Author)

We are using the Elbow method to determine the optimal number of clusters for K-Prototype clusters. Instead of calculating the within the sum of squares errors (WSSE) with Euclidian distance, K-Prototype provides the cost function that combines the calculation for numerical and categorical variables. We can look into the Elbow to determine the optimal number of clusters.

# Choose optimal K using Elbow method
cost = []
for cluster in range(1, 10):
    try:
        kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster, init = 'Huang', random_state = 0)
        kprototype.fit_predict(dfMatrix, categorical = catColumnsPos)
        cost.append(kprototype.cost_)
        print('Cluster initiation: {}'.format(cluster))
    except:
        break
# Converting the results into a dataframe and plotting them
df_cost = pd.DataFrame({'Cluster':range(1, 6), 'Cost':cost})
# Data viz
plotnine.options.figure_size = (8, 4.8)
(
    ggplot(data = df_cost)+
    geom_line(aes(x = 'Cluster',
                  y = 'Cost'))+
    geom_point(aes(x = 'Cluster',
                   y = 'Cost'))+
    geom_label(aes(x = 'Cluster',
                   y = 'Cost',
                   label = 'Cluster'),
               size = 10,
               nudge_y = 1000) +
    labs(title = 'Optimal number of cluster with Elbow Method')+
    xlab('Number of Clusters k')+
    ylab('Cost')+
    theme_minimal()
)
The scree plot of a cost function using Elbow Method (Image by Author)
The scree plot of a cost function using Elbow Method (Image by Author)

Important! Read more HERE

According to the scree plot of the cost function above, we consider choosing the number of cluster k = 3. It will be the optimal number of clusters for K-Prototype cluster analysis. Read more about the Elbow method HERE.

# Fit the cluster
kprototype = KPrototypes(n_jobs = -1, n_clusters = 3, init = 'Huang', random_state = 0)
kprototype.fit_predict(dfMatrix, categorical = catColumnsPos)

The algorithm has 7 iterations to converge and it has a cost of 4,960,713,581,025,175.0 (it’s quite a big right?!). We can print the centroids of clusters using kprototype.cluster_centroids_. For the numerical variables, it will be using the average while the categorical use the mode.

# Cluster centorid
kprototype.cluster_centroids_
# Check the iteration of the clusters created
kprototype.n_iter_
# Check the cost of the clusters created
kprototype.cost_
Cluster centroids by K-Prototype (Image by Author)
Cluster centroids by K-Prototype (Image by Author)

The interpretation of clusters is needed. The interpretation is using the centroids in each cluster. To do so, we need to append the cluster labels to the raw data. Ordering the cluster labels will be helpful to arrange the interpretation based on cluster labels.

# Add the cluster to the dataframe
df['Cluster Labels'] = kprototype.labels_
df['Segment'] = df['Cluster Labels'].map({0:'First', 1:'Second', 2:'Third'})
# Order the cluster
df['Segment'] = df['Segment'].astype('category')
df['Segment'] = df['Segment'].cat.reorder_categories(['First','Second','Third'])
The country sales data with cluster information (Image by Author)
The country sales data with cluster information (Image by Author)

To interpret the cluster, for the numerical variables, it will be using the average while the categorical use the mode. But there are other methods that can be implemented such as using median, percentile, or value composition for categorical variables.

# Cluster interpretation
df.rename(columns = {'Cluster Labels':'Total'}, inplace = True)
df.groupby('Segment').agg(
    {
        'Total':'count',
        'Region': lambda x: x.value_counts().index[0],
        'Item Type': lambda x: x.value_counts().index[0],
        'Sales Channel': lambda x: x.value_counts().index[0],
        'Order Priority': lambda x: x.value_counts().index[0],
        'Units Sold': 'mean',
        'Unit Price': 'mean',
        'Total Revenue': 'mean',
        'Total Cost': 'mean',
        'Total Profit': 'mean'
    }
).reset_index()
The centroid of clusters in the country sales data (Image by Author)
The centroid of clusters in the country sales data (Image by Author)

The complete example is listed below.

Conclusion

The K-Prototype is the clustering algorithm which is the combination of K-Means and K-Mode developed by Huang. For the implementation of its algorithm, the researcher needs to filter the columns carefully, especially for the categorical variables. The categorical variables must be relevant to the analysis and are not meaningless information. Besides that, the quality of the input (data) affects the clustering result (cluster initialization) and how the algorithm processes the data to get the converged result. As the researcher, for the final task, interpretation, we need to consider the metrics to use for both numerical and categorical variables.

References

[1] Z. Huang. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values (1998). Data Mining and Knowledge Discovery. 2(3): 283–304.


Related Articles