The world’s leading publication for data science, AI, and ML professionals.

A Little-Known Trick in Hierarchical Clustering: Weights

Recalibrate your dataset to answer a business question

By Huey Fern Tay

With Greg Page

_Above: Photo by Mark König on Unsplash_
_Above: Photo by Mark König on Unsplash_

When data is partitioned into meaningful groups, marketing analysts can see patterns in the way consumers behave and set the stage for the company to create targeted promotional campaigns. But clustering applications extend beyond marketing scenarios. For instance, scientists who study gene sequencing can use clustering to analyse evolutionary relationships among sequences in a group; property analysts can use this technique to understand how certain types of real estate should be priced relative to similar ones in the area.

In this article, I will demonstrate how a popular segmentation method, Hierarchical Clustering, can be used to create customer personas. Apart from that I will illustrate how weights can be used to recalibrate the dataset to reflect your current business priorities.

My dataset? The fictional, lovable Maine amusement park: Lobster Land.

Picture this scenario: With vaccination rates rising in many parts of the United States, Lobster Land wants to win customers back after putting the business on hold due to the COVID-19 pandemic. How can the park’s management target their customers?

Step 1: Lose the categorical variables

The first step is to drop the categorical variables ‘householdID’ and ‘homestate’. HouseholdID is just a unique identifier, arbitrarily assigned to each household in the dataset. Since ‘homestate’ is categorical, it will not be suitable for use in this model, which will be based on Euclidean distance.

The drop() function from pandas will help us to accomplish this:

Step 2: Normalise the data

Our dataset contains variables that are measured in different units, and on completely different scales. For instance, we have annual household income that ranges from $16000 to $213000, while the average email open rate per household ranges from 11% to 34%.

If we do not place this data on a ‘level playing field’, our clustering model would be dwarfed by the household income simply because those numbers are larger. We need to transform the data in such a way that makes the variables directly comparable with one another. Changing these numbers to z-scores, each with mean of 0 and standard deviation of 1, will do the trick.

Step 3: Plot the dendrogram, decide the number of clusters, and create the clusters

The dendrogram, which places the records on one axis and distances on the other, enables an analyst to see the way a nested clustering model forms. Dendrograms can be helpful for someone determining how many clusters to make. Keep in mind, however, that the chart is just a decision-making tool – there is no ‘right’ answer to the "how many clusters?" question, as this is an unsupervised learning process.

After some iteration and experimentation, I decided to cut the dendogram at the y=17 mark to generate a model with five distinct clusters.

The code cell below contains instructions for setting up an agglomerative (bottom-up) clustering model with a distance threshold of 17.

Notice the result above is a series of numbers ranging from 0 to 4 with square brackets on each end. Those numbers indicate the cluster assignment for each of our observations. We will need to convert these cluster assignments into a data frame, so that we can attach the numbers to the last column of our original dataset. This will make it easier for us to compare the clusters later.

Step 4: Apply weights to each variable

Up to this point, every variable in the clustering model is of equal importance – however, that does not reflect our business priorities. To make an educated guess about which groups of customers can be more easily convinced to return, we need to place more importance on certain factors such as the number of times they visited the park during the 2019 season.

With this objective in mind, this is how I re-weighted the data:

This re-weighting is a straightforward process. It is performed by simply multiplying each variable’s z-score values by a constant. A modeler using this process can easily adjust these weights in order to experiment with different outcomes.

Step 5: Re-cluster the data

Applying different weights to variables changes the way some households are grouped. Placing a heavy importance on ‘_visits2019′ has concentrated most of our frequent visitor households into just two particular clusters (Cluster 3 and Cluster 4). But wait…there’s more. This is just a glimpse of the changes that have happened.

Step 6: Examine the differences caused by the weighted data

When a multiplier of 50x was applied to the number of visits, it causes the most frequent visitors (and least frequent visitors) to be more clearly separated from the rest of the pack. The large multiplier magnifies the differences in visits.

The tables below demonstrate this difference.

In the ward_cluster1 model (unweighted), the range in average visits across the clusters is just slightly greater than 7. We can see two frequent-visitor clusters with means above 16, while the other clusters hover around an average of 10. The standard deviation of these mean values is 3.33.

With the weighted clustering model, by contrast, the range of cluster averages becomes significantly bigger – for the ward_cluster2 model, the range of intra-cluster means jumps to more than 12. The standard deviation of mean visits per cluster jumps by nearly 44 percent, as the weighted model brings us higher highs, and lower lows, for our variable of greatest interest.

Step 7: Visualise the differences between the unweighted and weighted data

Before we can colour code the charts based on the clusters, we need to convert them to categorical data.

Notice how the weighted data in the second density plot does not show much overlap between groups. In this second plot, the ‘super fans’ in the purple group (Cluster 4) are clearly separated from most of the dataset. The same applies to the pink group (Cluster 3) which represents the lower end of the ‘fan index’.

_Above: A distribution plot of 'visits2019' in the unweighted dataset. Notice some clusters are clumped together.
_Above: A distribution plot of ‘visits2019′ in the unweighted dataset. Notice some clusters are clumped together.
_Above: A distribution plot of 'visits2019' in the weighted dataset. Notice there is less overlap between clusters and that the super fans in the purple cluster are much more distinct. The group in the red cluster stand out as well.
_Above: A distribution plot of ‘visits2019′ in the weighted dataset. Notice there is less overlap between clusters and that the super fans in the purple cluster are much more distinct. The group in the red cluster stand out as well.

Step 8: Examine the weighted clusters to create customer personas

Up until this point, our demonstration has focused on _’visits2019′ which has the heaviest weight of 50%. While those results are telling on their own, we need to consider the effect the weights have on the remaining variables as well to get a clearer picture of Lobster Land’s clientele.

When we look at the average of all variables, we see visitor profiles that will help our co-workers in the marketing department create targeted campaigns. Here is the low-down on the groups that this model has identified among Lobster Land’s visitors:

· Die-hard fans: These families are the darlings of Lobster Land. Even though **** the average household lives 85.73 miles (around 137.8 km) away from Lobster Land, they do not mind the ‘hike’ down to the park. In 2019, these super families visited the park 18.03 times on average. They were the biggest spenders on online merchandise and were the most likely to open our marketing emails.

· Cashed up fans: The richest in the group. They too are strong supporters of Lobster Land, with an impressive average of nearly 13 visits throughout the 2019 season.

· Shopaholic loyalists: These customers should be commended for their dedication to Lobster Land because even though they live the furthest from the park, they averaged more than 15 visits on the season. This group of households has the second highest annual income on average. They are one of two clusters known for high online merchandise spending, and they have the second best e-mail open rate among the groups.

· Moderate fans: This group has the second-lowest visit average among the five clusters.

· The distant admirers: This group of families are probably the hardest to convince to come back. Even though they are likely to open marketing emails, they only visited the park 5.69 times in 2019. These households are relatively restrained in their spending as well.

What does this mean for Lobster Land:

Lobster Land management should pat themselves on the back for cultivating a strong bond with families. Prior to the COVID-19 pandemic, the average household visited the park every month even though the park is at least 84.47 miles (around 135.94 km) away – not exactly around the corner from where most people live.

Now that vaccination rates have risen, the good news is that it should not be very difficult to persuade people to revisit Lobster Land as people clearly have a deep sense of affection for the park.

The ‘die-hard fans’ will be the easiest to nudge and should be the first group of households the marketing department targets. These families are the most likely to open marketing emails, have a strong purchasing power, and clearly enjoy the Lobster Land experience therefore they are the easiest to push down the marketing funnel for any purpose: day trips to the park, the latest Larry the Lobster soft toys, staycations, birthday parties – you name it.

The ‘cashed-up fans’ are also ardent fans who will not take much convincing to come back. They could be enticed to spend big at Lobster Land through activities like staycations because their estimated annual household income is the highest on average.

As previously pointed out, the ‘distant admirers’ will be the hardest to convince.

Conclusion:

A theme park in a place like Maine only has three months to make money every year before it becomes too cold. Given the short window of opportunity, it is imperative for these businesses to have hyper-focused marketing to ensure they maximise revenue for the year. At the same time, theme parks need to ensure they remain relevant to their audience by creating new rides and experiences.

The rebalancing example illustrated above is just the first step in getting people to return to the park after the COVID-19 disruption. To keep their finger on the pulse of the consumer, Lobster Land management needs to assess market conditions and consumer sentiment on a regular basis. They could conduct focus group discussions with their ‘super fans’ to test ideas for new attractions and analyse their overall marketing mix to determine whether their dollars are getting the most bang for their buck. In addition, A/B tests could also be applied to email newsletters to see what type of content works better as a ‘hook’.

Either way, tracking, testing, and iterating are must haves to sustain the future of Lobster Land for future generations.

Data created by:

Prof. Gregory Page, Senior Lecturer, Boston University

Data source:

Page, Gregory. (2021). Lobster Land Previous Passholders [comeback.csv]. Retrieved from https://drive.google.com/file/d/13bEScSEG95XluoSfbJ-H191oZZbgf49h/view?usp=sharing

License: CC0


Related Articles