Applying customer segmentation to create insights on marketing campaigns

Customer Segmentation is a process that separates a company’s customers into groups based on certain traits (age, gender, income, etc.), and an effective analysis can help a company make better marketing decisions. In this post, I will share a customer segmentation analysis based on data provided by the company – Arvato. Arvato is a supply chain solutions company, and the dataset I used for my analysis contains customer demographics data for a client company. I also built a model to predict the response rate of mail-out advertisements.
Below I will walk you through the details of my analysis using both unsupervised and supervised learning methods. Two datasets, customer data from a client company of Arvato and general demographics data from Germany, were analyzed to answer the following questions:
- Who are the loyal customers of the client company, and with a change in marketing strategy to expand customer demographics, who are the potential customers to target?
- When the client company sends out a mail-out offer, can we predict the responding rate?
Part I. Customer Segmentation
Who are the loyal customers of the client company, and with a change in marketing strategy to expand customer demographics, who are the potential customers to target?
Cluster segmentation helps to map the demographics of the client company’s existing customers to the general population in Germany. Here I apply unsupervised learning to identify what segment of the general population represents the loyal customers of the client company, and what segment represents a pool of potential new customers that they might target.
Data Exploration
A data scientist spends 80% of the time cleaning data.
During the course of this project, 90% of my time was spent on data exploring and preprocessing. The following datasets I used in customer segmentation contained an enormous amount of raw data:
- Azdias – Demographics data for the general population of Germany (891211 x 366).
- Customers – Demographics data for the client company (191652 x 369).
In the general population data, there are 273 columns containing missing value. I decided to drop columns with over 30% missing value rate (indicated by red), since large amount of missing value is harmful for analyzing statistic and constructing model (e.g. ALTER_KIND1 and EXTSEL992). For the remaining columns (the blue), I would impute the missing data with the most frequent value.

There are 890919 rows containing missing value. By checking the distribution of missing ratio, most of rows have less than 10% of missing value, therefore, I took the drop threshold as 50%.

Data Preprocessing

Data preprocessing involved the following six steps:
- Re-encode missing value: __ I manually created feature description file, feature_summary.csv, containing attribute name, feature type, and missing or unknown value. According to the feature description file, missing values in lots of features were represented by –1 and 0 values (e.g., in the feature ‘AGER_TYP’). Here, I re-encoded all the missing values into NaN in each attribute according to feature description file (see below).

- Drop missing value: Notice that the distribution of missing ratio in each column shifted toward higher percentage area after missing value re-encoding. All columns with over 30% of missing values and rows with over 50% of missing values were removed from the two databases.

- Drop highly correlated feature: Pearson’s correlation coefficients were computed for each feature. Correlation could be either positive or negative, therefore, I store the absolute value of correlation. Features were removed by the threshold value of >0.95, such as KBA13_BAUMAX, KBA13_FAB_SONSTIGE, and LP_FAMILIE_GROB.
# Calculate Pearson's correlation matrix
corr_df = azdias_drop_missing.corr().abs()
# Create and apply mask
mask = np.triu(np.ones_like(corr_df,dtype=bool))
tri_df = corr_df.mask(mask)
# Find the features that meet threshold
to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.95)]

- One-hot encode categorical features: Multilevel categorical features were identified according to the feature description file, each of them has been one-hot encoded into dummy variables.
- Translate and transform mix-type variable: __ Some features contained multiple attributes and were split into new independent features. For example, ‘CAMEO_INTL_2015’ was split into ‘wealth’ and ‘life stage’ attributes.
- Impute and scale data: Impute the remaining missing value with mode method and scale the data with StandardScaler.
imputer = SimpleImputer(strategy='most_frequent')
scaler = StandardScaler()
After data preprocessing, the Azdias database had a matrix of 791244 x 395, the Customers database had a matrix of 140866 x 395.
Principal component analysis (PCA)
Having multiple features can complicate the process of data analysis, and it is necessary to reduce the dimension of the data before cluster segmentation. In this project, I performed a principal component analysis (PCA) to dissect the original data into several uncorrelated principal components, keeping only 200 of the 395 components that explains around 88% cumulative variability of original dataset.
n_components=200
pca = PCA(n_components=n_components,random_state=42)
azdias_pca = pca.fit_transform(azdias_scaled)

Clustering and Feature Decomposition
I applied the k-means algorithm to group the dataset into individual clusters and used the elbow method to find the right number of clusters. The idea of the elbow method is to run k-means clustering on the dataset for a range of k, and for each value of k calculate the sum of squared errors (i.e., k-means score). In the plot of k value and k-means score, the "elbow" on the curve is the best number of clusters for the dataset. Based on the elbow method results below, I grouped the dataset into eight clusters using the k-mean algorithm.
# Number of different cluster counts
n_clusters = np.arange(2,21)
# Run k-means clustering on the data and compute the average within-cluster distances.
scores = [MiniBatchKMeans(i).fit(azdias_pca).score(azdias_pca) for i in n_clusters]

With the customer data mapped onto the demographics cluster, now we are able to compare cluster distribution between the general population and the customer base of the client company. The results suggest that Cluster 7 is the major customer base in the company, since it has the highest customer proportion among the others. This cluster can be considered as "loyal customers." In contrast, Cluster 2 shows the lowest customer percentage, and exhibits the highest negative difference between general population and customers proportion. This suggests that, given the current marketing strategy, a very small portion of the population in Cluster 2 is likely to become the customer of the client company. The opportunity for the client company here is that tailoring marketing strategy towards the population of Cluster 2 will potentially attract a new segment of customers.

Decomposition of the cluster information reveals what types of individuals make up Cluster 2 and Cluster 7. Firstly, I identified which principal components were more significant to the cluster, then evaluated which features could explain more variance in specific principal component, in the meantime, collected the original feature values by inverse the PCA transformation and data scaling. Together, I am able to find out what types of individuals belong to Cluster 2 and Cluster 7.
Loyal Customers: Cluster 7 (Middle-class)
Below, I listed ten most positively correlated features (in red), as well as ten most negatively correlated features (in blue).

Loyal customers of the client company are mostly between the ages of 46–60 years old, in the middle class, and have small-sized families. These customers have a stable lifestyle, little interest in financial investment, and are avid online shoppers.


Potential customers: Cluster 2 (Rich upper class)

"Cars, cars, cars!" It is all about luxury cars when we look at the key features of Cluster 2. The individuals in Cluster 2 are lovers of expensive German cars, and besides that, most of them trace their ancestry to old-time West Germany.


Takeaway: Middle-class individuals with small families and lively online shopping habits are the core customers of the client company. Upper class individuals who are fascinating by German luxury car and originated from West-Germany, are potential marketing targets if the client company wants to expand their customer demographics.
Part II. Supervised Learning
When the client company sends out a mail-out offer, can we predict the responding rate?
This is a typical classification problem, and building a supervised machine learning model is a good approach to the solution. The main focus for this part is to train a model using the training dataset and use the trained model to predict "RESPONSE" attribute, i.e., to determine whether individuals in the test **** dataset would respond to mail-out offer or not.
Data Exploration
The training data and test data provided by Arvato include the following:
- Mailout_train – Demographics data for individuals who were targets of a marketing campaign (42 982 x 367).
- Mailout_test – Demographics data for individuals who were targets of a marketing campaign (42 833 x 366).
As shown in the response distribution graph below, the target attribute "RESPONSE" is highly imbalanced in the Mailout_train dataset, has 42430 negative response and only 532 positive response, the respond rate is only 1.2%. This highly imbalanced will influence the training of classification models, also make the common metrics become misleading (e.g. accuracy).

Data Preprocessing
The data preprocessing procedure from Part I has been implemented as clean function, which can be found in my GitHub. Here, I just reproduced the same steps to preprocess the data in Mailout_train and Mailout_test.

Implementation and Refinement
Due to the highly imbalance data in ‘RESPONSE’, here I take the approach of computing the probability that an individual will reply to the offer, and analyze the collective features of individuals that respond, and use this information to suggest individuals with a high response probability. The computed area under the ROC curve (False Positive Rate vs True Positive Rate) was used to evaluate the performance of the machine learning models.
I considered the following five classification algorithms:
- AB: AdaBoostClassifier
- GB: GradientBoostingClassifier
- LR: LogisticRegression
- RF: RandomForestClassifier
- XGB: XGBClassifier
To evaluate the performance of these five algorithms, I create a function that can calculate the ROC score when specify the algorithm and parameters.
def grid_search(estimator, param_grid, X=X, y=y):
"""
Grid search a classifier by ROC score, using cross- validation.
INPUT:
- estimator (classifier): name of classifier
- param_grid (dict): parameters used with GridSearchCV
- X (DataFrame): features from cleaned mailout_train dataframe
- y (DataFrame): 'RESPONSE' from cleaned mailout_train dataframe
OUTPUT:
- classifier: fitted classifier
- prints elapsed time and ROX AUC
"""
grid = GridSearchCV(estimator=estimator, param_grid=param_grid, scoring='roc_auc', cv=5)
grid.fit(X, y)
best_estimator = grid.best_estimator_
best_score = grid.best_score_
return best_estimator, best_score
Because of the highly imbalanced response attribute, I split the Mailout_train dataset into a training set and a validation set by 5-fold cross-validation, then fitted the data into the five algorithms with default hyperparameters.
# Choose five classifiers, LR, RF, AB, XGB, and GB
models = []
models.append(('LR', LogisticRegression()))
models.append(('RF', RandomForestClassifier()))
models.append(('AB', AdaBoostClassifier()))
models.append(('XGB', XGBClassifier()))
models.append(('GB', GradientBoostingClassifier()))
# Evaluate ROC AUC score of five classifiers with default parameter
for i in range(len(models)):
score = grid_search(models[i][1], {})[1]
print('ROC AUC score for {}: {}'.format(models[i][0], score))
AdaBoostClassifier (AB) and GradientBoostClassifier (GB) were the best performing models, having ROC AUC scores of 0.7172 and 0.7564, respectively. From this point, I used Grid Search to tune the hyperparameters of the AB and GB models, increasing their ROC AUC scores to 0.7531 and 0.7570, respectively (see red bars below).
# Tune the hyperparameters of AB model
param_AB = {
"learning_rate" : [0.1,0.2],
"n_estimators" : [100,200,500]}
AB_tuned = grid_search(AdaBoostClassifier(), param_AB)
# Tune the hyperparameters of GB model
param_GB = {
"learning_rate" : [0.1,0.2],
"max_depth" : [3,4],
"n_estimators" : [100,200],
"min_samples_split" : [2,4]}
GB_tuned = grid_search(GradientBoostingClassifier(), param_GB)

I then looked at the features of the AB and GB models to see if they verified the conclusions drawn from my customer segmentation analyses in Part II. Unfortunately, the single feature that stood out for both models was ‘D19_SOZIALES’ – a feature that did not contain a description and was not identified in Part II as a relevant feature for loyal customers. A bold guess of the meaning of the ‘D19_SOZIALES’ feature is that it might be related to transaction history and therefore imply to socioeconomical status.

Kaggle competition
AB and GB models were applied to predict ‘RESPONSE’ in the Mailout_test dataset. The AB model achieved a score of 0.781, and the GB model achieved a score of 0.777. These results were submitted to Kaggle.
Takeaway: Building a machine learning model is the most efficient way to solve a classification problem. I have constructed two models, using the AB and GB algorithms, and both were able to predict response rates of mail-out offers with a score of around 0.78.
Conclusion
Some say that 80–85% of Data Science projects fail before completion, I feel lucky that this project falls in the 15–20%. In this project, I have practiced and applied different types of data science skills to solve a real-life problem – finding characteristic features of loyal and potential new customers and to predicting whether an individual would respond to a mail-out campaign.
There is room for improvement. Principal component analysis (PCA) is strongly influenced by differences in the scale of the features, and in this aspect, scaling the data using an alternative approach like MinMaxScaler may give better performance. In the supervised learning part, synthetic minority oversampling technique (SMOTE) might be a better option for solving the imbalanced classification issue, since this approach allows one to manually synthesize data to have a more balanced dataset.
Reference and Credit
All codes used in this article can be found on my GitHub and Tableau.
Data and resource provided by Udacity and Bertelsmann&Arvato.