Market Segmentation with R (PCA & K-means Clustering) — Part 1

A data science approach to the market research method that’s been around for decades

Published in

Towards Data Science

9 min readMar 8, 2019

What is Market Segmentation?

For those who are new to the marketing field, here’s a convenient Wikipedia-style explanation: market segmentation is a process used in marketing to divide customers into different groups (also called segments) according to their characteristics (demographics, shopping behavior, preference, etc.) Customers in the same market segment tend to respond to marketing strategy similarly. Therefore, the segmentation process can help companies understand their customer groups, target the right groups, and tailor effective marketing strategies for different target groups.

Case Study

This article will demonstrate the process of a data science approach to market segmentation, with a sample survey dataset using R. In this example, ABC company, a portable phone charger maker, wants to understand its market segments, so it collects data from portable charger users through a survey study. The survey questions consist of four types: 1) Attitudinal 2) Demographic 3) Purchase process & Usage behavior 4) Brand Awareness. In this case, we will only work with the attitudinal data for segmenting. In reality, decision-makers choose different types of input variables (demographic, geographic, behavioral, etc.) for segmentation based on their individual cases. Nonetheless, the idea is the same regardless of which inputs you choose!

(Note: Thomas W. Miller raised a good point about using sales transaction data as inputs for segmentation in his book Marketing Data Science: Modeling Techniques in Predictive Analytics with R and Python. In short, he warns against segmenting with sales transaction data because information about sales is only available for current customers. When you have a new customer, it’s hard to utilize the insights you obtained without any of his/her sales data.)

Before we dive into the methods and models, remember that as a responsible data analyst, always understand your data first!

Checking Data

# Importing and checking Dataraw <- read.csv(“Chargers.csv”)
str(raw)
head(raw)

Each row in our data represents a respondent and each column represents his/her answer for the corresponding survey question. There are 2,500 respondents and 24 attitudinal questions. All are rating questions that ask the respondents about their opinions towards a given statement. Answers are in the 1–5 range. Here’s an example:

Please indicate how much you agree or disagre with the following statements (1 = strongly disagree, 5 = strongly agree).
I value style the most when it comes to purchasing a portable phone charger.
…

Understanding the nature of the questions, we can next move on to verify the data in our dataset. Writing a simple function sometimes does the trick:

# Verifying Data describe(raw)
colSums(is.na(raw)) #Checking NAs 
table(unlist(raw[,]) %in% 1:5) #Simple Test

The validate package in R is also a handy tool for verifying data. It allows you to test your data against a set of rules you create. However, I find it not the most convenient when it comes to dealing with large datasets. I am still looking into alternative methods (preferably systems) that effectively verify data quality. I will greatly appreciate any suggestions.

Now that we have validated our data and we are confident about’em, let’s move on to the more fun stuff!

Principal Component Analysis (PCA)

The term “dimension reduction” used to freak me out. However, it is not as complicated as it sounds: it’s simply the process of extracting the essence from a myriad of data, so the new, smaller dataset can represent the unique features of the original data without losing too much useful information. Think of it as Picasso’s Cubism paintings where he elegantly captures the essence of an object with a few lines and shapes, forgoing many details. For me, I always like to think of his Guitar. If you have another artwork in mind, please COMMENT!!

PCA is a form of dimension reduction. This video by StatQuest (shout out to my favorite Statistics/Data Science video channel) explains the concept quite intuitively. I strongly recommend you watch this video if this is the first time you hear of PCA. In short, PCA allows you to take a dataset with a high number of dimensions and compresses it to a dataset with fewer dimensions, which still captures most variance within the original data.

Why is PCA helpful to divide customers into different groups, you ask? Imagine that you need to separate customers based on their answers to these survey questions. The first problem you encounter is how to differentiate them based on their inputs on 24 variables. Sure, you can try to come up with a few main themes to summarize these questions, and assign each respondent a “score” for each theme, then group them based on the scores. But how can you be SURE that the themes you propose are truly effective in dividing people? How do you decide what weight to give each question? Furthermore, what will you do if you have 5000 variables instead of 24? A human brain simply can’t operate with that much information in a short period of time. At least my brain can’t for sure.

This is where PCA can step in and do the task for you. Performing PCA on our data, R can transform the correlated 24 variables into a smaller number of uncorrelated variables called the principal components. With the smaller, compressed set of variables, we can perform further computation with ease, and we can investigate some hidden patterns within the data that was hard to discover at first.

When there are abundant literature/videos/articles out there that provide thorough explanations of PCA, I hope to present a few high-level points about PCA for people who find materials out there too technical:

Variability makes data useful. Imagine a dataset with 10,000 uniform values. It does not tell you much, and it’s boring. 😑
Again, PCA’s function is to create a smaller subset of variables (principal components) that capture the variability within the original, much larger dataset.
Each principal component is a linear combination of the initial variables.
Each principal component has an orthogonal relationship with each other. That means they are uncorrelated.
The first principal component (PC1) captures most variability within the data. The second principal component (PC2) captures the second most. The third principal components (PC3) captures the third most…and so on

In addition, here are a couple of terms you should know if you are planning to run PCA for your project:

Loading describes the relationship between the original variables and the new principal component. Specifically, it describes the weight given to an original variable when calculating a new principal component.
Score describes the relationship between the original data and the newly generated axis. In other words, score is the new value for a data row in the principal component space.
Proportion of Variance indicates the share of the total data variability each principal component accounts for. It is often used with Cumulative Proportion to evaluate the usefulness of a principal component.
Cumulative Proportion represents the cumulative proportion of variance explained by consecutive principal components. The cumulative proportion explained by all principal components equals 1 (100% of data variability are explained).

Running PCA in R

Before you run a PCA, you should take a look at your data correlation. If your data is not highly correlated, you might not need a PCA at all!

# Creating a correlation plot library(ggpcorrplot)
cormat <- round(cor(raw), 2)
ggcorrplot(cormat, hc.order = TRUE, type = “lower”, outline.color = “white”)

As the graph shows, our variables are quite correlated. We can proceed to PCA happily ✌.️

# PCA
pr_out <-prcomp(raw, center = TRUE, scale = TRUE) #Scaling data before PCA is usually advisable! 
summary(pr_out)

There are 24 new principal components because we had 24 variables in the first place. The first principal component accounts for 28% of the data variance. The second principal component accounts for 8.8%. The third accounts for 7.6%…We can use a scree plot to visualize this:

# Screeplot
pr_var <-  pr_out$sdev ^ 2
pve <- pr_var / sum(pr_var)
plot(pve, xlab = "Principal Component", ylab = "Proportion of Variance Explained", ylim = c(0,1), type = 'b')

X-axis describes the number of principal component(s), and y-axis describes the proportion of variance explained (PVE) by each. The variance explained drastically decreases after PC2. This spot is often called an elbow point, indicating the number of PCs that should be used for the analysis.

# Cumulative PVE plot
plot(cumsum(pve), xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained", ylim =c(0,1), type = 'b')

If we choose only 2 principal components, they will yield less than 40% of the total variance in data. This number is perhaps not enough.

Another rule of choosing the number of PCs is to choose PCs with eigenvalues higher than 1. This is called the Kaiser rule, and it is controversial. You can find many debates on this topic online.

Basically, there isn’t a single best way to decide the best number of PCs. People use PCA for different purposes, and it is always important to think about what you want to get out of your PCA analysis before making the decision. In our case, since we are using PCA to determine meaningful and actionable market segmentation, one criterion we should definitely consider is whether the PCs we decide on make sense in the real-world and business settings.

Interpreting Results

Let’s pick the first 5 PCs for now, since 5 components are not too hard to work with, and it follows the Kaiser rule.

Next, we want to make meanings out of these PCs. Remember that loadings describe the weights given to each raw variable in calculating the new principal component? They are key to help us interpret the PCA results. When directly working with the PCA loadings can be tricky and confusing, we can rotate these loadings to make interpretation easier.

There are multiple rotation methods out there, and we will use a method called “varimax”. (Note, this step of rotation is NOT a part of the PCA. It simply helps to interpret our results. Here is a good thread on the topic.)

# Rotate loadings
rot_loading <- varimax(pr_out$rotation[, 1:5])
rot_loading

Here’s an incomplete portion of the varimax-rotated loadings up to Q12. The numbers in the table correspond to the relationships between our questions (raw variables) and the selected components. If the number is positive, the variable positively contributes to the component. If it’s negative, then they are negatively related. Larger the number, stronger the relationship.

With these loadings, we can refer back to our questionnaire to get some ideas about what each PC is about. Let’s look at PC1, for example. I noticed that Q10, Q3 & Q7 negatively contribute to PC1. On the other hand, I see that Q8 & Q11 positively contribute to PC1. Checking the questionnaire, I realized that Q10, Q3 & Q7 are questions related to the style of the charger, when Q8 & Q11 focus on the functionality of the product. Therefore, we can make a temporary conclusion that PC1 describes people’s preference for the product’s functionality. It makes sense that people who value functionality more might not care too much about style.

Then, you can move on to PC2 and follow the same procedure to interpret each PC. I will not go through the complete process here, and I hope you got the idea. Once you go through all PCs and feel like each describes unique, logically-coherent traits, and you believe they make business sense, you’re good for the next step. However, if you feel like some information is missing or is repetitive within the PCs, you can consider going back and including more PCs, or you can eliminate some. You might have to go through several iterations until you get a satisfying result.

We’re done!!

Just kidding. But you are halfway there. You’ve walked through the process of compressing a large dataset to a smaller one with a few variables that can help you identify different customer groups out there using PCA. In the next post, I will introduce how to segment our customers based on the PCs we obtained using a clustering method.

Lastly, #HappyInternationalWomensDay to all the amazing superwomen out there 👯👧 💁 👭!

Thanks for reading! 💚 Feel free to connect with me on Linkedin!