
Objectives
- Use K-Means Clustering Algorithm in R
- Determine the right amount of clusters
- Create tables and visualizations of the clusters
- Download, extract, and load complex Excel files from the web into R
- Clean, wrangle, and filter the data efficiently
Introduction
K-Means clustering is an unsupervised machine learning technique that is quite useful for grouping unique data into several like groups based on the centers of the independent variables present in the data set [1]. A couple classic examples are clustering different types of customers in company loyalty programs and separating medical patients into low, medium, high, and extreme risk categories [1].
In this article, we will be using the 2020 Index of Economic Freedom data set freely available from The Heritage foundation here [2]. We will also use an R Markdown file (.RMD) available for download on my GitHub at this link. I will be assuming that you can download and install R and RStudio on your machine of choice.
The structure of this article is:
- Load required libraries
- Import the data
- Clean the data & fix formatting issues
- Show a couple of different ways to create K-Means clusters
- Show several useful visualizations that you can adapt for your projects as needed
Let’s jump into it!
The Code
Load Libraries
First, we need to load tidyverse, FactoMineR, factoextra, and readxl into our R environment. If you get an error about the packages not being recognized, uncomment (remove the #) before the install.packages() line(s) that you need.
# Install if needed by removing the #
# install.packages("tidyverse")
# install.packages("readxl")
# install.packages("FactoMineR")
# install.packages("factoextra")
# Load Libraries
library(tidyverse)
library(readxl)
library(FactoMineR)
library(factoextra)
It would also benefit you to set your working directory into a folder that contains your file and has room to download a .xls file. That can be done with the setwd() command or using RStudio to navigate the Files menu to the folder location and setting the working directory with the More drop down menu as shown below:

Download the Data
Now we need to download the data. The link to the web page can be found here [2] or in the RMD file from my GitHub if you want to explore The Heritage Foundation’s website a bit more to learn about the data. Click the "Download Raw Data" button at the top of the page and you should get a file named _index2020data.xls that we will use for the rest of the project.
Import the Data
The easiest thing to do is to navigate to the folder with the data, click on it, and automatically pull up the readxl graphic interface. I went ahead and renamed my data to _EFIdata for simplicity and sanity. These screenshots should help:


After changing the name and hitting the Import button, RStudio will automatically generate some code to your console. It’s a good idea to copy/paste the first line without the ">" into the file to be able to quickly load the data again later. Here’s what that looks like:

The code should look like this:
# Import Data
EFI_data <- read_excel("index2020_data.xls")

Initial Look at the Data
When possible, it is always a good idea to take a quick look through the data to see if there are any obvious problem that need to be solved before we do the fun stuff.
Here’s a screenshot of some of the data in the middle. Let’s see what jumps out.

What jumps out at me is that missing data is labeled "N/A", which I know R will not like. The numbers also look pretty strange where some are short and sweet while others have way too many decimal places. Also, the keen among us noticed that in the data import phase that all of the columns that are supposed to be numeric are actually characters. We will have to resolve the data type issues as well.
Clean the Data
First, let’s get rid of the "N/A" issue. There are many ways to go about doing this, but I am going to rely on a trick of R and "trust the magic" a little bit. To avoid complication in the following steps, just trust me that we should get the original column names from the _EFIdata before moving on. Let’s use this code:
# Get original column names
EFI_data_col_names <- names(EFI_data)
Next, we are going to use the lapply() function and put the gsub() function inside of a custom function() to find any occurrence of the string "N/A" and replace it with nothing. When we structure the function like this, we rely on the speed of lapply() to perform the adjustment much faster than if a for loop were to be used because of the single-threaded nature of R [1]. In a future step, we will rely on R to coerce the empty data into the proper NA format. Scroll through the data to see the result for yourself.
Here’s the code:
# Fix "N/A" in the data set
EFI_data <- data.frame(lapply(EFI_data, function(x) {
gsub("N/A", "", x)
}))
While this function does what we want it do in the data itself, R thinks it is a good idea to automatically change the column names. It’s annoying and messy, so let’s put the original names back with this code:
# Put the right column names back
names(EFI_data) <- EFI_data_col_names
Time for a custom function!
Let’s make one to change all those columns that are characters to numeric and round them to two decimal places while we’re at it. We will plug this custom function into the next step.
Here’s the code:
# Create custom function to fix data types and round
to_numeric_and_round_func <- function(x){
round(as.numeric(as.character(x)),2)
}
With our custom function created, we will use the _mutateat() function to change all the columns except the four columns that are supposed to remain text.
Here’s the code:
# Mutate the columns to proper data type
EFI_data <- EFI_data %>%
mutate_at(vars(-one_of("Country Name", "WEBNAME", "Region", "Country")), to_numeric_and_round_func)
Realistically, we do not need the Country or WEBNAME for the actual analysis, so we will remove them now by just making them NULL with this code:
# Remove Country and WEBNAME
EFI_data$Country <- NULL
EFI_data$WEBNAME <- NULL
K-Means Clustering
There are two main ways to do K-Means analysis – the basic way and the fancy way.
Basic K-Means
In the basic way, we will do a simple kmeans() function, guess a number of clusters (5 is usually a good place to start), then effectively duct tape the cluster numbers to each row of data and call it a day. We will have to get rid of any missing data first, which can be done with this code:
# create clean data with no NA
clean_data <- EFI_data %>%
drop_na()
We also want to set the seed so that we ensure reproducibility with this code:
# Set seed
set.seed(1234)
Here is the code for the basic K-Means method:
# Cluster Analysis - kmeans
kmeans_basic <- kmeans(clean_data[,7:32], centers = 5)
kmeans_basic_table <- data.frame(kmeans_basic$size, kmeans_basic$centers)
kmeans_basic_df <- data.frame(Cluster = kmeans_basic$cluster, clean_data)
# head of df
head(kmeans_basic_df)
Here’s the output:

This shows that the cluster number added to each row of the data.
We can also look at the properties (centers, size) of each cluster with the _kmeans_basictable here:

We can also explore each row of the data with the cluster number in the _kmeans_basicdf object like this:

For many, the basic way will get you as far as you need to go. From here, you can do graphs by region, data for each cluster, or anything you want.
I created a quick ggplot() example breaking down the count of each cluster by the region. We could do dozens of different plots, but this is a good, simple demonstration.
Here’s the code:
# Example ggplot
ggplot(data = kmeans_basic_df, aes(y = Cluster)) +
geom_bar(aes(fill = Region)) +
ggtitle("Count of Clusters by Region") +
theme(plot.title = element_text(hjust = 0.5))
Here’s the output:

Fancy K-Means
The first task is to figure out the right number of clusters. This is done with a scree plot. Essentially, the goal is to find where the curve starts to flatten significantly [1]. Since the K-Means algorithm is effectively minimizing the differences between the centers of the clusters and each point of data, this shape of curve that starts steep then asymptotically approaches some level of a flat line is created [1]. While not a complete requirement, it is generally a good idea to scale the data when making clusters with the scale() function or another normalization technique to get more accurate results [1].
We will use the _fviznbclust() function to create a scree plot wit this code:
# Fancy K-Means
fviz_nbclust(scale(clean_data[,7:32]), kmeans, nstart=100, method = "wss") +
geom_vline(xintercept = 5, linetype = 1)
Here’s the output:

Actually creating the fancy K-Means cluster function is very similar to the basic. We will just scale the data, make 5 clusters (our optimal number), and set nstart to 100 for simplicity.
Here’s the code:
# Fancy kmeans
kmeans_fancy <- kmeans(scale(clean_data[,7:32]), 5, nstart = 100)
# plot the clusters
fviz_cluster(kmeans_fancy, data = scale(clean_data[,7:32]), geom = c("point"),ellipse.type = "euclid")
Here’s the output:

This last plot really helps visualize that the clusters tend to actually be similar to each other. It is interesting how little code it takes to get some very interesting results.
Conclusion
This project was made to show how to use and visualize K-Means clustering. There is more than one way to do it and countless plots that can be made. Like most tasks in Data Science, most of the time is actually spent finding data and cleaning it enough to do something useful with it. Hopefully by this point you feel much better about using K-Means clustering and can see different applications for the techniques we used here.
References
[1] R. Kabacoff, R in Action (2nd ed.) (2015), Shelter Island, NY: Manning Publications Co.
[2] The Heritage Foundation, 2020 Index of Economic Freedom (2020), https://www.heritage.org/index/download