The world’s leading publication for data science, AI, and ML professionals.

Sharpen Your Machine Learning Skills with This Real-World Housing Market Cluster Analysis

A Hands-On Project that Combines PCA, Hierarchical Clustering and K-means for Delivering Optimal Clustering Solutions

Photo by Jessica Bryant from Pexels
Photo by Jessica Bryant from Pexels

Whether you are an experienced data scientist, or someone who just started your analytics journey recently, believe me, at some point in your career you must have encountered or will encounter at least one project that involves segmentation/Cluster Analysis. It is probably one of the most popular and important machine learning skills you should learn and understand as a data scientist.

In this tutorial, I’ll walk you through a cluster analysis I recently did, completely driven by my own curiosity as a data scientist about North Carolina’s housing market insights (why NC? well that’s where I live and it has been a booming area in recent years with strong housing market momentum😉 ).

In this hands-on project, you will analyze the housing market data collected from 162 neighborhoods in North Carolina and identify the clusters of these neighborhoods based on their similarities and differences on a few key metrics, using unsupervised ML models. The raw data was sourced from Redfin Data Center and is free to download.

Are you ready? Let’s go!


A Gentle Introduction of Cluster Analysis

Before we jump into coding, let’s refresh our knowledge about cluster analysis first. Cluster analysis is a very popular Machine Learning technique that is widely used in many applications such as customer segmentation, image processing, recommendation systems, just to name a few.

In general, cluster analysis falls under the umbrella of unsupervised machine learning models. An unsupervised ML model, as its name suggests, is not supervised using training dataset. In other terms, the algorithm is not provided with any pre-assigned labels (e.g., cluster labels) for the training data. Instead, the model finds the hidden patterns and insights from the given data itself. Therefore, it is very exploratory and ‘curious’ in nature.

An Example of Clustering Analysis (Image Source: Wikipedia)
An Example of Clustering Analysis (Image Source: Wikipedia)

There are different types of clustering methods such as connectivity-based methods, centroid-based, density-based, distribution-based and so on, each of which has its own advantages and disadvantages and is suited to different purposes and use cases.

In our project, we are mainly focused on using two of the clustering techniques mentioned above: hierarchical clustering(connectivity-based) and k-means(centroid-based). These two unsupervised ML techniques are both based around proximity(using distance measures). They are simple yet very effective and powerful in many clustering tasks.


Download, Read and Prepare the Data

First, let’s go to Redfin’s data center, scroll down to the ‘How it Works’ section, and download the region data at ‘neighborhoods’ level. This is an open dataset (.gz file) that is free to download and use.

Source: Redfin Housing Market Data (https://www.redfin.com/news/data-center/)
Source: Redfin Housing Market Data (https://www.redfin.com/news/data-center/)

Next, let’s open the Jupyter notebook, import all the necessary libraries, and read in the data. This is a pretty big dataset at neighborhood level, so it may take some time to read into Python.

This is a huge dataset, so for simplicity of analysis and ease of demonstration, let’s only focus on ‘single-family residential’ properties sold in NC in the most recent 3 months period (July-Sep 2021). We will also filter out any neighborhoods that sold less than 20 properties in this period.

The raw data looks like the table below with a total of 58 columns. The column ‘region’ is the neighborhood name.

Image Provided by Author
Image Provided by Author
Image Provided by Author
Image Provided by Author

For our clustering analysis, we don’t need all these columns. We are going to create clusters based on five features so let’s only keep the fields that are of interest in our data frame using the following code.

Data frame for cluster analysis with five features (Image Provided by Author)
Data frame for cluster analysis with five features (Image Provided by Author)

Data Cleaning (missing values, outliers, etc.)

Let’s quickly check whether there are any missing values or outliers that need to be treated before we carry on the analysis.

Image Provided by Author
Image Provided by Author

The data seems to be in decent shape already. There are no missing values for all the columns. It appears that feature ‘median_dom’ and ‘homes_sold_yoy’ may have potentially large outliers but the rest of features seem to be in a pretty reasonable range.

Check for outliers using box plots (Image Provided by Author)
Check for outliers using box plots (Image Provided by Author)

Let’s treat the outliers for ‘median_dom’ and ‘homes_sold_yoy’ by capping the max value at the 99% percentile using the following code.

Data frame after outliers treatment (Image Provided by Author)
Data frame after outliers treatment (Image Provided by Author)

Perform PCA (Principal Component Analysis)

Before we feed our features into a clustering algorithm, I always recommend that you perform PCA on your data first. PCA is a dimensionality reduction technique that allows you to only focus on a few principal components that are able to represent the majority of information and variance in your data.

In the context of cluster analysis, particularly this one where we don’t have a large number of features, the main advantage of performing PCA is actually more so on visualization. Performing PCA will help us identify top 2 or 3 principal components so that we can visualize our clusters in a 2 or 3-dimentional chart.

In order to do PCA, let’s first normalize all of our features to the same scale because PCA and distance-based clustering methods are sensitive to outliers and different scales/units.

Next, let’s perform PCA analysis and show the top principal components that explain the majority of the variance in our data using the following code.

You can see that the top 3 principal components combined together explain ~80% of variance in the data. Therefore, instead of using all 5 features for our model, we can just use the top 3 principal components for clustering.

Image Provided by Author
Image Provided by Author

Finally, let’s look at the weights of each of the original variables on these principal components so that we can have a good understanding of what variables have most influence on which principal components, or in other words, we can derive some meanings to these principal components.

Image Provided by Author
Image Provided by Author

This is quite interesting to look at! PC1 is largely influenced by ‘median_dom’ (median days on market) so this principal component represents how fast the properties sell in a neighborhood. We can label it as ‘speed of selling’.

PC2 is largely influenced by ‘new_listings_yoy’ (increase in # of new listings compared to same period last year) so this one represents the inventory volume or new listings change year-over-year. We can label it as ‘supply/demand/new listings’.

PC3 is largely influenced by ‘avg_sale_to_list’ (ratio of sales price divided by listing price) and ‘median_sale_price_yoy’ (change in median sales price YoY), both of which are related to price. We can label PC3 as ‘sales price change’.


Perform Hierarchical Clustering

We are ready to feed our data into a clustering algorithm! First we’ll try hierarchical clustering. The nicest thing about hierarchical clustering is that it lets you build a cluster tree (called a ‘dendrogram’) that visualizes the clustering steps. Then you can slice the tree horizontally to select the number of clusters that make most sense based on the tree structure.

Image Provided by Author
Image Provided by Author

In the dendrogram above, we can see that depending on where you place your horizontal line you get different number of clusters. It looks like 4-cluster or 5-cluster are both reasonable solutions. Let’s plot the 4-cluster solution in a scatter plot and see how it looks.

Image Provided by Author
Image Provided by Author

This is not bad and gives us a basic understanding on how many possible clusters that may exist in this data and how those clusters look like in a scatter plot. However, the results from hierarchical clustering seem not ideal as we could see that there are some ‘questionable’ data points that may be assigned to the wrong clusters.

Hierarchical clustering, though very helpful in visualizing clustering process via a tree structure, does have some drawbacks. For example, hierarchical clustering makes only one pass through the data. This means that records that are allocated or assigned incorrectly early in the process cannot be reassigned subsequently.

Therefore, we rarely stop our analysis with hierarchical clustering solution, but rather use it to get a ‘visual’ understanding of how clusters might look like, and then use an iterative method (e.g., k-means) to improve and refine the clustering solutions.


Perform K-means Clustering

K-means is a very popular clustering method with which we specify a desired number of clusters, k, and assign each record to one of k clusters so as to minimize a measure of dispersion within the clusters.

This is an iterative process that allocates and reallocates data points to minimize the distance of each record to its cluster’s centroid. At the end of each iteration, some improvement is achieved and the step is repeated until the improvement is very small.

To perform k-means clustering, we need to first specify k (# of clusters). We can do this via a couple of ways:

  1. Specify k using our domain knowledge or past experience
  2. Perform hierarchical clustering to get an initial understanding of what maybe good number of clusters to try
  3. Use ‘Elbow’ method to determine k

Well, from the dendrogram in the previous step, we already know that k=4 (or k=5) seems to work pretty well. We can also use ‘elbow’ method to check again. The code below plots the ‘elbow’ chart which tells us where the turning point is for the optimal number of clusters.

Image Provided by Author
Image Provided by Author

It seems the results are pretty consistent with what we had from dendrogram: both 4 and 5 are reasonable number of clusters to try so let’s try k=5. The code below generates the 3-D scatter plot for 5 clusters.

Image Provided by Author
Image Provided by Author

Describe Clusters and Derive Insights

Now we have our clusters, let’s add the cluster labels back to our original data frame (df_NC), export it to a .csv file and derive insights from it.

I used my favorite dashboard tool Tableau to visualize these clusters in terms of their characteristics, similarities and differences. I plot the clusters in a 2-D scatter plot, just like what we did in Python.

The x-axis is the first principal component which we labeled as ‘selling speed’. The y-axis is the second principal component which we labeled as ‘supply/new listings’. The size of the circles represents the third principal component which is the ‘price change/sale-to-list ratio’. In this dashboard, you can also hover over each circle and see what those hot market neighborhoods are with more details provided in the tooltip.

Image Provided by Author
Image Provided by Author

Congratulations on following along the tutorial and leveling up your machine learning skills with this hands-on project! I hope you enjoy this article. Happy Learning!


You can support me by signing up for Medium through this referral link. By signing up through this link, I will receive a portion of your membership fee for referral. Thank you!


Related Articles