Customer Segmentation Report for Arvato Financial Solutions

Elena Ivanova
Towards Data Science
15 min readDec 3, 2018

--

Capstone Project for Udacity Data Scientist NanoDegree

Introduction

In this project supervised and unsupervised learning techniques are used to analyze demographics data of customers of a mail-order sales company in Germany against demographics information for the general population. The goal of this project is to characterize customers segment of population, and to build a model that will be able to predict customers for Arvato Financial Solutions.

The data for this project is provided by Udacity partners at Bertelsmann Arvato Analytics, and represents a real-life data science task. It includes general population dataset, customer segment data set, dataset of mailout campaign with response and test dataset that needs to make predictions.

Problem Statement

There are four main parts of the project:

  1. Data Preprocessing

In this part we need to preprocess data for further analysis.

Missing values by columns and rows will be analysed, data will be divided by types followed by subsequent transformations.

2. Customer Segmentation

In this part we need to analyze general population and customer segment data sets and use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company.

I will use principal component analysis (PCA) technique for dimensionality reduction. Then, elbow curve will be used to identify the best number of clusters for KMeans algorithm. Finaly, I will apply KMeans to make segmentation of population and customers and determine description of target cluster for the company.

3. Supervised Learning Model

In this part we need to build machine learning model using response of marketing campaign and use model to predict which individuals are most likely to convert into becoming customers for the company.

I will use several machine learning classifiers and choose the best using analysis of learning curve. Then, I will parametrize the model and make predictions.

4. Kaggle Competition

The results of this part need to be submitted for Kaggle competition

Metrics

Area under the receiver operating characteristic curve (ROC_AUC) from predicted probabilities will be used to evaluate performances of the models. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random target person more highly than a random non-target person. Thus, ROC analysis provides tools to select possibly optimal models for customer prediction task.

Results and Discussion

Data Preprocessing

There are two data description Excel spreadsheets and four data files associated with this project:

  • AZDIAS: Demographics data for the general population of Germany. It has 891211 persons (rows) and 366 features (columns).
Descriptive statistics for the first few attributes of AZDIAS data set
  • CUSTOMERS: Demographics data for customers of a mail-order company. It has 191652 rows and 369 features.
Descriptive statistics for the first few attributes of CUSTOMERS data set
  • MAILOUT_TRAIN: Demographics data for individuals who were targets of a marketing campaign; 42982 persons and 367 features including response of people.
  • MAILOUT_TEST: Demographics data for individuals who were targets of a marketing campaign; 42833 persons and 366 features.

Unfortunately, there are a lot of missing values in these datasets and not all of the listed features have explanation in a given Excel spreadsheets.

Analysis of columns and rows with missing values

First, I created python dictionary of missing values codes where the key in a “key”: value pair is attribute and value is a list of missing codes parsed from DIAS Attributs — Values 2017.xlsx.

Interestingly, there are only 275 out of 366 items in the dictionary, meaning that there are a lot of features that are not listed in the given attribute description file as well as some of the attributes missing values are simply not entered and listed in dataset as numpy not a number (np.nan).

First three items from missing keys dictionary

Next, values that correspond to missing value codes of AZDIAS dataset were converted to np.nan values and the final number of missing values were analyzed for each attribute.

The analysis demonstrates that most of the columns have less than 30% of missing data while there are 41 attributes with more than 30% of missing data (see the distribution of missing values in these columns the below). These 41 attributes were dropped from analysis.

Attributes with more than 30% of missing values dropped from analysis

Additionally, there are other columns that were dropped based on the following reasons:

  • column with unique values, LNR
  • categorical columns with more than 10 categories to avoid many additional attributes after one-hot encoding (CAMEO_INTL_2015 is exclusion)
  • columns with information repetition from another feature (e.g. fein vs grob)
  • some attributes, for which description was not given and it is hard to predict meaning and type of column (categorical vs ordinal vs mixed).

The assement of missing values by rows demonstrates that the maximum number of missing data in each row is 233 attributes out of 303 attributes left after dropping columns. The distribution of amount of missing data in each row demonstrates that most of the rows have less than 25 missing attributes. So, the data was devided into two subsets: azdias with <=25 missing attributes (737235 rows) and azdias with > 25 missing attributes (153986 rows). Comparison of distribution of values for 6 randomly choosen columns demonstrates that there is similar distribution in two data sets (see bar plot for 6 different attributes with few nan vs many nan datasets below).

Comparison of distribution of values between dataset with few missing values (blue) vs dataset with many missing values (orange)

Assigning types of attributes

Part of the manually created table with attributes types and actions.

In order to proceed to data engineering and transforming step, all attributes should be assigned to the following types: categorical, numerical, ordinal, binary, mixed. For this step, I have manually created attribute_types.csv file with attribute names, types, action (keep, drop, one-hot or engineering) and drop reason columns.

Since not all of the columns were provided with description, I had to find similar columns by name and guess the type of it, Such columns are listed with question mark in-front of type (e.g. ?categorical). Also the type for missing discription attributes that is not similar to any listed attributes is taken as categorical.

OST_WEST_KZ feature was encoded as 0 for Ost and 1 for West moving pattern. The time feature EIGEFUEGT_AM was transfromed to just year format. Four mixed features were reengineered as follows:

  • PRAEGENDE_JUGENDJAHRE —> MOVEMENT (1: Mainstream, 2: Avantgarde) and GENERATION_DECADE (4: 40s, 5: 50s, 6: 60s, 7: 70s, 8: 80s, 9: 90s)
  • CAMEO_INTL_2015 —> WEALTH (1: Wealthy Household, 2: Prosperous Households, 3: Comfortable Households, 4: Less Affluent Households, 5: Poorer Households) and LIFE_AGE (1: Pre-Family Couples & Singles, 2: Young Couples With Children, 3: Families With School Age Children, 4: Older Families & Mature Couples, 5: Elders In Retirement)
  • WOHNLAGE —> RURAL_NEIGBORHOOD (0: Not Rural, 1: Rural)
  • PLZ8_BAUMAX —> PLZ8_BAUMAX_FAMILY (0: 0 families, 1: mainly 1–2 family homes, 2: mainly 3–5 family homes, 3: mainly 6–10 family homes, 4: mainly 10+ family homesand PLZ8_BAUMAX_BUSINESS (0: Not Business, 1: Business)

Standardization and dealing with NaNs

Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. For this step, all of the missing np.nan data need to be dropped or imputed. There are several strategies to deal with NaNs. A simple option of dropping data points with NaNs can result in loss of a lot of information (~30%). A better strategy is imputation of missing values with mean or most frequent value. Imputed values won’t be exactly right in most cases but systematically above or below their actual values. However, usually it gives more accurate results than dropping data points or columns entirely.

Therefore, I choose the strategy to impute binary data with most frequent value, numerical data with median value in the column. Categorical missing values will be assigned to zeros after transformation to dummy columns. Then standardization procedure was used to scale features. The advantage of standardization procedure that it does not bound values to a specific range and it is much less affected by outliers.

Data transformation pipeline

Distribution of skewed data for ANZ_HAUSHALTE_AKTIV attribute (skew=8.3)

Fisrt, I identified skewed numerical continuous attributes using pandas skew method with skew threshold 1.0. These attributes were assigned as log_attributes and the Pipeline was created for this attributes with the following steps: natural logarithmic transformation, imputation by median and standard scaling. Binary attributes (bin_attributes) underwent only imputation by most frequent value. The DummiesTransformation class was created for transformation of categorical attributes (cat_attributes). In this transformation no redundant dummy columns were dropped to keep hidden information about missing values encoded by zeros. All other numerical and ordinal attributes (num_attributes) transformations were combined into Pipeline with imputation by median and scaling using standard scaler. Lastly, the ColumnTransformer object is created that combines several feature transformation Pipelines into a single transformer.This object was further utilized for transformations of data for all parts.

Column transformation scheme for pipeline column transformer

transformers = [(‘log’, log_pipeline, log_attributes),
(‘binary’, binary_pipeline, bin_attributes),
(‘cat’, cat_pipeline, cat_attributes),
(‘num’, num_pipeline, num_attributes)]

ct = sklearn.compose.ColumnTransformer(transformers=transformers)

Quick note about outliers

I used Tukey rule to detect outliers outside of the Inter Quartile Range (IQR): IQR = Q3 — Q1. However, removing outliers of just six attributes that underwent log transformations results in loss of about ~30% of the data. So, I decided to keep those points.

Data preprocessing summary

In summary, all datasets undergo cleaning procedure (clean_data function) that: (1) drops columns and rows, (2) reengineers 4 mixing type features and (3) undergoes transformation by Column Transformer consisting of pipeline feature union. It is noteworthy that usage of pipeline greatly facilitates transformation and building machine learning model. It prevents data leakage that could potentially result in overfitting algorithm, improves code organization and reduces risk of confusing columns in training and testing sets.

Overall, after extract transform load (ETL) pipeline step, AZDIAS data was transformed to 737235 rows x 410 columns and CUSTOMERS data was transformed to 134245 rows x 410 columns data set.

Customer Segmentation

PCA

Principal component analysis (PCA) on the data was applied for dimensionality reduction. I plotted cumulative explained variance vs a number of principal components (shown by line) along with explained variance bar plot (see below). There is a visual reduction in explained varience after ~175 components. This number of transformed features results in 92% explaned variance. So, 175 transformed features were retained for the clustering part of the project.

The plot of fraction of explained variance in the data represented by each principal component vs number of principal components.

The first, second and third principal components have 8.0% 6.0% and 5.0% of variance, correspondingly, and relates to the next three following latent features:

  1. The most prominent features are CAMEO_DEUG_2015, HH_EINKOMMEN_SCORE, WEALTH and the most prominent negative features are MOBI_REGIO, KBA13_ANTG1, KBA05_ANTG1. So, the first principal component is associated with size, wealth and type of family. The larger family, the higher component corresponds to income and wealth. Negative values: the component is reciprocal to higher probability of share units with other families and low mobility.
  2. The most prominent features are KBA13_HERST_BMW_BENZ, KBA13_SEG_OBEREMITTELKLASSE, KBA13_MERCEDES, KBA13_BMW The most prominent negative features is KBA13_SITZE_5. This component is associated with possession of expensive cars.
  3. The most prominent features are GENERATION_DECADE, CJT_TYP_1, CJT_TYP_2, FINANZ_SPARER. The most prominent negative features are KOMBIALTER and CJT_TYP_5. This component is associated with age and customer and financial typology. Higher component relates to higher probability of people being of higher age and money savers.

K-means clustering

Average sum of squared errors (SSE) vs number of clusters

Elbow method was used to identify an ideal number of clusters for k-means clustering on the PCA-transformed data. Average of sum of squared errors (SSE) within-cluster distances was plotted against number of clusters from 1 to 30. In this plot, the clustering was created using MiniBatchKmeans method with batch size=40000.

The plot demonstrates that the score rapidly decreased for the first 16 clusters and then increased for 17 clusters and then continued to decrease for higher number of clusters but with lower slope. So, 16 clusters were selected as ideal number for k-means clustering.

Unsupervised machine learning pipeline

Unsupervised learning pipeline was created consisting of the following steps: data transformation, PCA and KMeans (see below).

cluster_pipeline = Pipeline([
('transform', ct),
('pca', PCA(n_components=175)),
('kmeans', KMeans(n_clusters=16) )
])

cluster_pipeline.fit(azdias_cleaned)
general_predictions = cluster_pipeline.predict(azdias_cleaned) customers_predictions = cluster_pipeline.predict(customers_cleaned)

Fit and predict methods were applied to AZDIAS data and predict method was applied to CUSTOMERS data.

Comparison of CUSTOMERS data to AZDIAS data

The results of clustering general population and AZDIAS data were compared to each other using proportion of people in each group.

Proportion of people for the general population and the customer data
Difference in proportion between customers and general audience: positive is overrepresented and negative is underrepresented

The comparison of proportions of people and difference of proportion between general and customer audience in each cluster (customers_ratio — general_ratio) demonstrates that there are clusters with overrepresentation as well as underrepresentation of customers. Clusters with the highest positive difference in proportion between customers and general audience is overrepresented in the customers data (interested clusters #6 and #8). Clusters with the highest negative difference in proportion between customers and general audience are underrepresented in the customer data (no interest clusters #0 and #7). Their cluster centers that are in latent features were mapped out to original feature to identify types of people.

Comparison table of attribute values for overrepresented and underrepresented clusters.

So, people that use mail-order company (#6 and #8) are wealthy, 45–60 years old (ALTERSKATEGORIE_GROB=~3.4) with 60s average generation decade (GENERATION_DECADE = 6.1, 6.6 respectively), money savers or investors with high probability (FINANZ_SPARER=1.3 and 1.6 respectevely). They are high earners (LP_STATUS_GROB=4) vs low earners (1.2 and 2.3) in underrepresented cluster. These people are with low movement pattern (MOBI_REGIO=4.2). These people are also religious and traditional-minded (SEMIO feature).

There are also some difference between two overrepresented clusters in such categories as D19_GESAMT_DATUM (8.2 vs 2.8), D19_KINDERARTIKEL (0.6 vs 2.2), D19_VERSAND_DATUM (9.0 va 3.3), D19_KONSUMTYP_MAX (6.1 vs 2.1). Not all of the listed features were provided with description but overall D19_* features are related to frequency of using certain group of products. Thus, the method distinguished two groups with high and low transaction activity.

On the other hand, people that are underrepresented in the company are up to 45 years old (ALTERSKATEGORIE_GROB=1.8 and 1.9). These people were born in 80s and 90s (GENERATION_DECADE = 8.6) and have low financial interest (FINANZ_MINIMALIST = 1 and 2). These people are poor, they are money savers with very low probability (FINANZ_SPARER > 3.9). One of the clusters demonstrated that these people have high movement pattern (MOBI_REGIO=1.9) meaning that they don’t own homes, and another with middle movement pattern (MOBI_REGIO=3.8). These people are sensual minded with higher probability; and relligious and traditional minded with very low probability.

Comparison bar plot for SEMIO_REL, GENERATION_DECADE, WEALTH and FINANZ_SPARER feature in four clusters

Supervised Learning Model

Now we are ready to analyze MAILOUT data set and build a supervised learning model able to predict whether the person becomes a customer. Each of the rows in the “MAILOUT” data files represents an individual that was targeted for a mailout campaign with “RESPONSE” column. It is notewothy that only ~1.2% of MAILOUT_TRAIN data set are customers. Here, the similar transformation pipeline and cleaning were applied except no rows were dropped to avoid loss of information (use test_set=True attribute in clean_data function).

Generally, to build supervised model we need to split data into training and testing data sets, to build it on the the training data set and to make prediction on the testing data set. In our case, we already were given MAILOUT_TEST dataset and can evaluate performance of model by submitting it to Kaggle competion. Therefore, in this part there are two possible approaches to go:

  • split data set to training and validation set
  • use cross validation technique

I decided to work with cross validation because there is only 532 (1.2%) of the customers. In case of splitting data set, this value will be reduced by 20%. So, I use 5-fold cross validation (default in Scikit-learn 0.2 version) to obtain learning curves and parameterizing model using GridSearchCV.

Supervised learning pipeline

Supervised learning pipeline was created consisting of two steps: a column transformer introduced before and a classifier.

pipeline = Pipeline([
(‘transform’, ct),
(‘classifier’, classifier)
])

pipeline.fit(maiout_train_cleaned)
predictions = pipeline.predict_proba(maiout_train_cleaned)

Classifiers

Several ensemble methods were tested with default parameters to choose the best classifier. Ensemble learning is method of building a model on top of other models. Random Forest Classifier, Adaboost Classifier and Gradient Boosting Classifier in Scikit-learn are all ensemble classifiers built on top of Decision tree model (by default).
These ensemble models work great both for classification and for regression problems and in general has the following benefits:

  • improves the stability and accuracy of machine learning algorithms;
  • reduces variance of a single estimate from one model as they combine several estimates from several models;
  • helps to avoid overfitting.

Classifier evaluation
For model evaluation learning curve method was used. A learning curve shows the validation and training score of an estimator for varying numbers of training samples 10%-100% of training data. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

Learning curves for three different classifier, score = roc_auc score

Analysis of learning curves demonstrated that Random Forest Classifier has a tiny error on train set (roc_auc score=53%) meaning that the model has badly overfit the data. An error on validation set (roc_auc score=53%) is huge, this is a high bias model. Note, that the curves are already converged and adding more data will not improve classifier. This is a high bias model.

Adaboost Classifier demonstrates better performance. Although training roc_auc score decreases up to 84% with increasing training set size, it increases for validation set up to 72%. The two curves are almost converged and the score will not be substantially improved with increasing number of points in training set.

The validation score of Gradient Boosting Classifier increases up to 76%, while training score decreases up to 92%. The two curves are not converged and adding more points can improve score even more. So, gradient boosting is the optimal model and will be used in GridSearch parametrization.

Classifier parametrization

The parametrization of the Gradient Boosting Classifier was perfomed using Grid Search. Grid Search tests all possible combination of specified hyperparameters using cross validation and then chooses the model with the maximum score (roc_auc) on a validation set. Below the initialization of gradient boosting pipeline with different hyperparmeters tested in grid search is presented.

gbc_pipeline = sml.make_pipeline(ct, GradientBoostingClassifier())

parameters = {‘classifier__learning_rate’: [0.1, 0.2], ‘classifier__n_estimators’: [100], ‘classifier__max_depth’: [3, 5], ‘classifier__min_samples_split’: [2, 4]}

grid_obj = GridSearchCV(gbc_pipeline, parameters, scoring = ‘roc_auc’)

The best classifier has the following hyperparameters: learning_rate=0.1, n_estimators=100, decision tree max_depth=3 and desicion tree min_sample_split=4 (default=2). The score calculated over training set using parameterized model is unsignificantly higher then default model (0.8976 vs 0.8968).

The feature importances plot demonstrates that the most important feature is D19_SOZIALES, for which description unfortunately is not available. The next important features relate to consumption type (D19_KONSUMTYP_MAX) and year of movement EINGESOGENAM_HH_JAHR (information not available).

Feature importances plot

Kaggle Competition

Finally supervised model was applied to MAILOUT_TEST data. The obtained probabilities of being a customer were submitted to Kaggle competition. The final score is 0.789, which is only 0.009 lower than the current winner score.

Conclusions

In this project, provided by Udacity partners at Bertelsmann Arvato Analytics, the real-life demographics data of Germany population and customer segment was analyzed.

  • In the first part the assessment and preprocessing of the data was performed. This part was one of the most difficult steps that have to be done to proceed because there were 366 columns to analyze and not all of them had description. There were identified a lot of missing values and missing information about attributes. The column transformation pipeline was created that was further utilized in supervised and unsupervised parts.
  • In the unsupervised part, the dimensionality reduction using PCA was performed to 175 latent features that describe 92% of explained variance. KMeans clustering to 16 clusters identified 2 clusters that are target customers of the company. These are traditionally minded wealthy people of 45–60 years of age.
  • Lastly, Gradient Boosting Classifier was selected and parameterized to build supervised model and make predictions over testing dataset on KAGGLE. The resulted performance of supervised learning algorithm is 78.9%.

This project potentially has some improvements. For example, there are other ways to preprocess the data: choose another threshold for dropping rows and columns, choose different transformations for the columns, apply MinMax Scaler instead of Standard Scaler, impute data in another way.

Improvement of supervised model can be tested by using PCA dimensionality reduction. We could also choose attributes that have the most difference in clustering for overrepresented and underrepresented data and build supervised model using only these attributes.

Lastly, a trick that I would like to try is to randomly add the data from the CUSTOMERS dataset to MAILOUT_TRAIN dataset in such way that there are equal number of customers and non-customers. Since now we have only 1.2% of the customers in this dataset, the model trained on dataset of 50% percent can be a much better predictor.

--

--