
Customer segmentation, one of the most important processes in the marketing industry, allows companies to focus their efforts on a specific subset of customers with similar characteristics. This gives companies the ability to market groups effectively and appropriately to various audience groups. According to Shopify, customers segmentations help companies:
- Create targeted communication methods that resonate with specific groups of customers, but not with others.
- Identify ways to improve products, new products, or service opportunities.
- Focus on the most profitable customers.
A capstone project focuses on this process of German demographics data. specifically, it entails analyzing demographics data for the general population of Germany as well as customer data of a mail-order company. The goal of this analysis is to use unsupervised learning methods including K-Means clustering to determine how customers of the mail-order company are compared to the general population at large.
After segmenting the customers and interpreting the groups, the next step is to implement and tune a few supervised learning algorithms using a separate training dataset to predict the probability that an individual becomes a customer of the company.
After finding the optimal predictive model, the final step is to use it on a testing data set to predict the probabilities that a customer will respond to a campaign. These predicted probabilities will then be uploaded to a Kaggle competition, which measures the accuracy of those predictions compared to other entries.
This project is one of the multiple options for the Udacity Data Science nanodegree capstone. So, this project from Bertelsmann Arvato became my top choice since I have worked on another version in the machine learning nanodegree program months earlier. That version consisted only of step 1 – from cleaning the data to clustering the customers. Now, this is an opportunity to go beyond that step while further applying new skills in data mining and machine learning.
The Data
Each of the three steps contains different datasets with the same features. The demographic datasets contain around 366 features, each describing personalities, and behaviors for thousands of people. Each row represents a single individual as well as information beyond that individual i.e. household, building, and neighborhood data.
The German demographics data and the mail-order company’s customer data will be used for the unsupervised learning part of this analysis. The former contains a total of 366 features and 891211 samples while the latter contains 369 features and 191652 samples. The difference in features between the two is CUSTOMER_GROUP, ONLINE_PURCHASE, and PRODUCT_GROUP, which belong to the customer data and provide more information about the customers.
The mail-out training and testing data sets contain all of the same features as the demographics data. While both have almost 43000 samples, the data represents individuals who were targets of a marketing campaign. The outcome variable in the training data is named RESPONSE, which indicates whether a customer has responded to the campaign to become a customer of the company. The testing data set contains all of the features except the RESPONSE feature. This will be used to make the final predictions using the optimal model in the final part of this project.
Part 1 – Data Preprocessing
Like every data science problem, this project involves data preprocessing and cleaning before analyzing it and making predictions. Since the four data sets have the same structure, the data cleaning will be processed into an ETL pipeline that performs the modification of missing values, determine which features and rows to drop, how to handle categorical/mixed features, and make dummy variables.
Missing Values
Due to the large size of the demographics data, I uploaded the first 100,000 samples to perform the initial cleaning faster. The original data contains several features with missing values coded with many different values other than NaN i.e -1, 0, X, XX. Fortunately, a separate Excel file contains the index of the values and their descriptions for each feature. The values associated with the meaning "unknown" are missing values and were therefore collected into a dictionary. After replacing all of the original "unknown" encodings into "NaN," I proceeded to determine which features can be excluded.

From figure 1, the best threshold for missing values by a feature is 40%, since anything above has a significantly low frequency in the number of features. As a result, nine features were discarded, including AGER_TYP – best-ager typology, KBA05_BAUMAX – common building within the cell, and TITEL_KZ – whether the person has an academic title.

After dropping mostly missing columns, the next step was to do a similar process with the rows. From figure 2, missing values by row become less frequent after 50%, and any rows around 70% missing values are no good for the analysis. Hence, any rows with more than 50% missing are discarded.
Categorical Variables
While checking for the data types of the features, some of them are categorical with several values. For example, CAMEO_DEU_2015 and CAMEO_INTL_2015 would be split into two features each – wealth and life stage. CAMEO_DEUG_2015 was converted to a numerical variable since its values have one meaning.
Two features were created based on PRAEGENDE_JUGENDJAHRE. This original feature combines information on three dimensions – decade, movement, and nation. Based on the information on this feature in the Excel file, the values were converted into DECADE and MAINSTREAM. The feature OST_WEST_KZ already represents the nation.
After dropping some categorical variables that are not useful, the remaining ones were converted to dummy variables. This is the last step of the data cleaning process.
All of the steps above would be used for all four data sets before performing their respective analyses.
Part 2 – Customer Segmentation
With the data cleaning pipelines, both the German demographics and customer data sets are set for customer segmentation analysis. The customer segmentation comprises data imputation for the missing values, data standardization, principal component analysis dimensionality reduction, and K-Means clustering for segmentation.
Once the clusters are generated on both the demographics and customer data sets, they will be compared to determine which ones represent the target audience for the mail-order company – the clusters that get a higher customer proportion than their population counterparts. These clusters will be interpreted according to dominant features in the highest principal components, giving details on the types of people targeted for the marketing campaign.
Principal Component Analysis
The first step of the preprocessing of the clean data is finding the optimal number of components and clusters for the segmentation analysis. For the former, I created a function that returns a bar plot of the features with the highest principal component weights for a certain principal component based on a percentage of explained variance threshold. For the demographics data, figure 3 indicates that about 154 principal components explain 85% of the variation of the data.


From figure 4, the largest features by weight in the first principal component include movement patterns, number of single/double households, social status, and share of cars. Specifically, these characteristics are positively associated with this component. The smallest feature weights, on the other hand, include sharing of 6 to 10 family houses, car ownership per 6–10 family houses, wealth/life stage, and household net income. This means that the first principal component is negatively associated with big households.

From figure 5, the largest features in the second principal component mean that it is positively associated with online affinity, transaction activity frequency, money-saving financial topology, and age group. Conversely, this component is negatively associated with financial preparedness and recent financial transaction actuality.
K-Means Clustering
For the hundreds of thousands of individuals in the demographics, it would be best to estimate the number of clusters higher than 5. To test how many clusters to use, the elbow method plot indicates where a line shows an elbow-like change of direction as the K-means cluster score decreases the number of clusters increases. With a data set as large as the demographics data, a quick method to iterate the number of clusters in a loop and collect scores is using the mini-batch K-Means method. Using regular K-means would increase the computation time to fit and produce a score in a loop, especially for larger clusters. The mini-batch method, on the other hand, reduces the temporal and spatial cost of the K-mean algorithm. Specifically, it uses small random batches of data of a fixed size. The scoring from this method is illustrated in figure 6.

After fitting up to 25 clusters, there seems some twists and turns between 8 and 21 clusters. Since large data sets are best illustrated with a somewhat high number of clusters, the best elbow bend in the plot seems to be centered at k =8 clusters.
To compare the demographics data with that of the customers, the latter will have the same sklearn objects as the former from the imputations to the clustering. Specifically, the customers will be scaled to 154 principal components and segmented into 8 clusters. The clustering distribution for both data sets is illustrated in figure 7.

From both of the plots above, cluster 7 is the most represented among the German population by about 16% while cluster 8 is the most represented among the company’s customers by almost 20%. Conversely, cluster 6 is the least represented among the demographics by about 8% while cluster 5 is the least represented among the customers by about 7.5%. Interestingly, cluster 6 is one of the most represented among the customers. Also, cluster 7 is one of the most represented for the demographics and one of the least represented for the customers.
Cluster Interpretation

To identify which clusters represent the target audiences for the mail-order company, the population cluster proportions are subtracted from their respective customer cluster proportions. Moreover, the target audience clusters are those with customer proportions that are greater than their population counterparts. On the other hand, non-target audience groups are those with customer proportions less than their population counterparts. Based on figure 8, the mail-order company’s target audiences are in clusters 8 and 6. Cluster 8 has the highest positive margin while cluster 7 has the largest negative margin.

An important variable to understand here is the age group. The ALTERSKATEORIE_FEIN – age category, fine-scale – feature is not explained in any of the attachments for this project. However, a variable GEBURTSJAHR – the year of birth – is directly associated with that variable. Creating a dictionary associating the two features helps understand the meaning of the ALTERSKATEORIE_FEIN feature. Table 1 lists the 5-year intervals associated with every age category variable.

The heat map in figure 9 displays the proportion of age category across the clusters about to be analyzed. In both target audience groups, the highest proportions by age group occur with people born between 1930 to 1949. For non-target audiences in cluster 3, the highest proportions by age group occur almost the same as in both target groups. On the other hand, cluster 7 has higher proportions for people born between 1945 and 1969. Since an important variable such as age group is important to understand in this context, the centroids of the target audience clusters will bring a broader insight into the similarities and differences between each other.
The cluster analysis and interpretation consist of calculating the 8 cluster centroids for all features and inverse transform them to original feature scales. However, since there are over 300 features, the cluster interpretations are limited to the 10 features with the highest positive principal component weights – illustrated in figures 4 and 5.
Target Audiences

At first glance, the centroid values for the first principal component features are similar across all audience clusters except cluster 7. For the second principal component, on the other hand, most of the centroids of this cluster are similar to all other target audience clusters.

In cluster 6, customers tend to be new homeowners born in the 1940s. They live in single/double family homes in large communities and have small mobility in society. Due to high scales in money saving and investments, they must have high retirement funds. Although this group is comprised of average earners, they have little to no financial transaction activity over the past 2 years. Furthermore, they have high car shares and moderate online affinities.

Similar to cluster 6, audiences in cluster 8 are typically born around the early 1940s, live in single/double family homes in large communities, and have small mobility in society. Due to high scales in money-saving, they must have high retirement funds. Although this group is comprised of average earners, they have little to no financial transaction activity over the past 2 years and moderate online affinity. However, unlike cluster 6, these customers are more experienced homeowners and have higher investment activity.
Non-Target Audiences

Similar to target audiences, customers in cluster 3 are homeowners born around the early 1940s who live in large communities, and have small mobility in society. With very high scales in money-saving, they must have high retirement funds. Although this group is comprised of average earners, they have little to no financial transaction activity over the past 2 years. However, some key differences are that customers in this group have an average of single/double family home shares, a low share of car ownership, at least a few transaction activities, and higher online affinity.

Cluster 7 displays different values from other clusters at first glance. Like the target groups, this group comprises homeowners living in single/double family homes in large communities. With very high scales in money-saving, they must have high retirement funds. However, these customers are younger and have increased financial transaction activity over the past two years. Moreover, they have a high level of online affinity.
To conclude the unsupervised analysis, the target audience for the marketing campaign are people born between 1935 to 1949 with high financial topology in investment and money-saving. They are also people who have little to no overall or mail-order transaction activity, even though they have a moderate online affinity.
Part 3 – Supervised Learning
After performing cluster analysis on the German demographics and customer data sets to determine which customers are more likely to be customers of the mail-order company, the next stage is to apply supervised learning to a separate data file. This data set is like the previous two, but every individual was targeted for a mail-out campaign. Furthermore, it includes a response variable that represents whether an individual became a customer of the mail-order company after the campaign. That being said, the data serves as a training set that will be used to build an optimal predictive algorithm for a separate data set for testing, which will be used in a Kaggle competition.

As the training data set has almost 43,000 samples, figure 10 shows that the response variable is significantly unbalanced. In other words, about 1.24% of the targeted individuals became customers of the company. Since this is the target variable that the prediction models will use, it would be a challenge to classify whether an individual will become a customer in the testing data. A way to remedy this is by implementing sampling methods in addition to data preprocessing and training/testing splits before training the predictive models.
Data Preprocessing
Once the 300 predictors and the response variable are separated from each other, the mail-out training data was split into a training, validation, and testing set based on a validation ratio of 0.1 and a test ratio of 0.2. These three splits serve to assess the candidate models before using the best one on the entire mail-out data.
Following the splits, the training, validation, and testing sets were each transformed with imputation, standardization, and PCA reduction to 154 components. For this analysis, these processed data sets will be used to assess three prediction algorithms: logistic regression, AdaBoost, and gradient boosting classifiers. The process consists of performing cross-validation on the models using a ROC score, rather than an f1-score, but not before implementing an additional modification to the training data.
Resampling – ADASYN
Before implementing models on significantly unbalanced data, a resampling method will be essential for an accurate model. The minority response – RESPONSE = 1 – is so rare that an arbitrary model would learn close to nothing about the outcome, resulting in a weak model. To make the positive response more important, the models need to get to know it better, that’s where Oversampling comes in.
The two possible approaches to resampling are undersampling and oversampling. Undersampling consists of balancing the data such that observations of the majority class – RESPONSE = 0 – are randomly reduced to the same size as that of the minority class. However, for this data and in several applications, a major downside is that significant omissions of majority class observations result in the loss of important information, which could be crucial for the learning process of a model. Oversampling, on the other hand, generally duplicates minority class samples to the same size of the majority class instead of omitting samples from the majority class. However, this tactic could most likely overfit the data due to high repetitions of minority samples¹.
To pick between the two resampling tactics, oversampling is better since that way more information is kept rather than dropped². Even so, the imblearn package offers a few enhanced oversampling methods that reduce the downside of general oversampling.

A good technique used is ADASYN – Adaptive Synthetic Sampling Approach. This resampling technique aims to oversample the minority class samples by generating synthetic versions of them. Consider a sample x_i, which becomes a new sample x_new. This new sample is generated through the original sample’s k-nearest neighbors. From all possible neighbors, x_zi, is selected and the new observation is calculated as follows:
Where lambda is a random number between 0 and 1. This new sample generally lies on the line between x_i and x_z ³. Figure 11 illustrates this process. Furthermore, ADASYN uses a density distribution as a criterion to decide the number of synthetic samples for the minority samples. It does so by adaptively changing the weights of the different minority samples while taking into account any skewed distributions⁴.
The ADASYN resampling was fit on the split training data, resulting in a total of over 61044 samples rather than 30,000.
Logistic Regression

From a default logistic regression model on the resampled training data, the validation and testing scores are similar to one another. The recall score is 0.43 and the precision score is 0.2 for the positive responses. This means that 43% of the individuals who are actually customers were classified correctly as 2% of individuals who were predicted as customers were actually customers. The validation ROC score is 0.53.
For the testing data, 55% of actual customers were predicted correctly and 2% of the predicted customers were correctly classified. The ROC score is 0.59, higher than that of the validation set.
AdaBoost Classifier

From a default AdaBoost model on the resampled training data, the validation and testing scores are similar to one another. The recall score is 0.26 and the precision score is 0.2 for the positive responses. This means that 26% of the individuals who are actually customers were predicted correctly and 2% of individuals who were predicted as customers were actually customers. The validation ROC score is 0.54.
For the testing data, 26% of actual customers were predicted correctly and 2% of predicted customers were correctly classified. The ROC score is 0.54, about the same but slightly higher than that of the validation set.
Extreme Gradient Boosting Classifier

From a default extreme gradient boosting model on the resampled training data, the validation and testing scores are similar to one another. The recall score is 0.07 and the precision score is 0.1 for the positive responses. This means that 7% of the individuals who are actually customers were predicted correctly and 1% of individuals who were predicted as customers were actually customers. The validation ROC score is 0.47, which is less than the other two models.
In the testing data, 10% of actual customers were predicted correctly and 1% of predicted customers were correct. The ROC score is 0.49, which is higher than that of the validation set.
Model Selection
Based on the preliminary information on the general models above, if one model can be picked, the best one moving forward would be the logistic regression model. Since the precision scores are the same for all three models through validation and testing, another metric to look at would be the recall and ROC scores. Focusing on the positive response, even if it is the minority class, logistic regression has the highest recall and ROC scores in the testing and validation sets.
However, for this analysis, the goal of the supervised model is not to pick which model generally predicts best. Although extreme gradient boosting could give a ROC score higher than logistic regression, it turned out to be the lowest and significantly lower than that of the logistic regression. So, does this mean that logistic regression would give a higher ROC score than extreme gradient boosting on the real testing set? To find out, the three listed models would need to be cross-validated and optimized to see if the application to the mail-out testing data will give a higher score to logistic regression than the other two ensemble models.
Cross-Validation Model Tuning
Logistic Regression
Once a generic logistic regression model is initiated, it will be applied through a grid search cross-validation method to find the optimal parameters for the entire training data. The general model is defined in a pipeline containing the three data transformations, and the ADASYN oversampling in addition to the model. The set of parameters for this pipeline model consists of three separate instances.
lr_parameters = [
{'model__penalty':['l1'],
'model__C': np.arange(0.1, 1.1, 0.1),
'model__max_iter':[300, 500, 800, 1300],
'model__solver':['saga']
},
{'model__penalty':['l2'],
'model__C': np.arange(0.1, 1.1, 0.1),
'model__max_iter':[100, 200, 300, 500, 800],
'model__solver':['newton-cg','lbfgs']
},
{'model__penalty':['l2'],
'model__C': np.arange(0.1, 1.1, 0.1),
'model__max_iter':[300, 500, 800, 1300],
'model__solver':['sag']
}
]
Logistic regression accepts four different penalties, but for this case, two will be used – l2 (the default) and l1. The parameter dictionary structure consists of one l1 set and two slightly different l2 sets. Each one contains the C coefficient ranging from 0.1 to 1.1 by 0.1, max iterations ranging from 100 to 1300 depending on the penalties, and different solvers. The first instance corresponds to l1, which only accepts ‘saga’ or ‘liblinear’ solvers. Since this model is going to be applied to a large data set, only ‘saga’ will be used. The second instance is for l2 with ‘newton-cg’ and ‘lbfgs’. The third instance is l2 with solver ‘sag’, which handles large data sets with maximum iterations of at least 300. After the grid search was complete, logistic regression with a coefficient of 0.8, 1300 maximum iterations, and a ‘sag’ solver yielded a ROC score of 0.63.
LogisticRegression(C=0.8, max_iter=1300, solver='sag')
As a result, the prediction of this algorithm on the mail-out testing set resulted in 37.6% of individuals predicted as customers. The final submissions gave a score of 0.609.
AdaBoost Classifier
Contrary to logistic regression, the AdaBoost parameter search will be conducted through a randomized search. Although there are two parameters to search through, several fits can last a long time to fit. The learning rate values are organized like a Fibonacci sequence from 0.1 to 0.8, including the default value of 1. The number of estimators is organized the same way from 100 to 300, plus the value of 50.
ab_parameters = {'model__learning_rate': [0.1, 0.2, 0.3, 0.5, 0.8, 1],
'model__n_estimators': [50, 100, 200, 300]
}
The randomized search on the AdaBoost classifier consists of 12 iterations where the model with the highest score has a learning rate of 0.2 and 200 estimators.
AdaBoostClassifier(learning_rate=0.2, n_estimators=200,
random_state=0)
The prediction of this algorithm on the mail-out testing set resulted in 22% of individuals predicted as customers. The final submissions gave a score of 0.579.
Extreme Gradient Boosting Classifier
The parameter search for this model is different from the other two. Since extreme gradient smoothing is more complex than AdaBoost both in parameters and efficacy. The pipeline is defined the same way as in the other two models. Out of the 27 different parameters for this algorithm, I chose the five most important ones – maximum depth, alpha, subsample, number of trees, and learning rate.
xgb_parameters = {
'model__max_depth': np.arange(3, 13, 1),
'model__alpha': np.arange(0.1, 0.8, 0.05),
'model__subsample': [None, 0.1, 0.25, 0.5, 0.75, 1],
'model__n_estimators': np.arange(10, 200, 10),
'model__learning_rate': np.linspace(0.01, 0.8, 10)
}
After defining the parameters, the cross-validation will be performed through a randomized search instead of a grid search. The XGBoost parameters defined here would cost a lot of time and computational effort. Therefore, a limited but randomized search would improve both of those issues with a certain number of iterations. For this randomized search, 45 iterations were set to increase the chances of capturing an optimal model, since there would be hundreds of combinations here that could last for days with a grid search. In the end, the optimal XGBoost classifier has the following parameters, in addition to the default parameters that were not in the search.
XGBClassifier(alpha=0.7, base_score=0.5, booster=None, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1,
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints=None,
learning_rate=0.09777777777777778,
max_delta_step=0, max_depth=3,
min_child_weight=1, missing=nan,
monotone_constraints=None, n_estimators=90,
n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0.699999988, reg_lambda=1,
scale_pos_weight=1, subsample=1,
tree_method=None, validate_parameters=False,
verbosity=None)
The prediction of this algorithm on the mail-out testing set resulted in 20% of individuals predicted as customers. The final submissions gave a score of 0.587.
Out of the three models, the model that attained the highest score on the Kaggle competition turned out to be the logistic regression model. Just like the basic classification reports.

Conclusion
The goals of this capstone project involved unsupervised as well as Supervised Learning in similar data sets from Arvato Financial Solutions. The unsupervised analysis entailed principal component analysis as well as K-Means clustering to describe which individuals are most likely to become customers of a mail-order company. The supervised analysis focused on scaling and oversampling a training set so that a predictive model can accurately predict whether a targeted individual by the campaign becomes a customer of the company. This process involved initializing and optimizing three different models to confirm which one turns out to be the most accurate with the applied scaling and resampling.
Future Work
The supervised results could be improved if there was further tuning in the extreme gradient boosting algorithm by adding more parameters or increasing the number of iterations. The way the general XGBoosting validation and testing results compare to those of the other general models, the ROC scoring improved by almost 10%. The way the training data was imbalanced could mean that there is also a low percentage of customers in the testing data. That being said, a more consistent randomized search or alternate parameter search could result in an XGBoost model that surpasses the logistic regression score, considering how both final scores are close to each other and that it has the lowest percentage of customer predictions.
see my GitHub repository here.