The world’s leading publication for data science, AI, and ML professionals.

Bertelsman Arvato Financial Solution Customer Segmentation

Nowadays, with big data becomes reality, people now focus on how to use the data to realize commercial values. One area which is much more…

Nowadays, with big data becomes reality, people now focus on how to use the data to realize commercial values. One area which is much more mature is how to picture the potential customer or predict the behavior of the customer, to target the market or customer more precisely.

Problem statement

Bertelsman Arvato Financial Solution provided a real-world challenge in Udacity. Arvato provided four demographics datasets. They are:

  1. Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns), named it as azdias.
  2. Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns), named is as customers.
  3. Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns), with the responses, names it as mailout_train.
  4. Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns), without responses, named it as mailout_test.

With the data,

  1. Can you perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company? With this, marketing can target the customer precisely, improve the customer conversion ratio and reduce the cost.
  2. Can you develop a model to predict which individuals will respond to a campaign? It can improve the advertising conversion rate and reduce the campaign cost.

Metrics

Considering the characters of the problems, two types of solutions are provided:

For question1, as it is no label, unsupervised machine learning is used. It is a high dimensional dataset. First, use PCA(Principal component analysis) to reduce the dimensions, then use Elbow method to choose the best K clusters for KMeans to make the Clustering the customer, finally, combine with original data to make analytical.

For question 2, as there is a label in the training set, the target is to predict the test dataset. Supervised machine learning is adopted here. As there is a significant class imbalance in the training dataset, 532 responses out of 42962 customers, only %1 positive value.

Which matrix should be chosen to evaluate the model? Let’s have a look at the confusion matrix.

As there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the project will be using AUC to evaluate performance. AUC is Area under curve that is ROC (Receiver Operation Characteristic) curve. The curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

Some definitions for your reference:

Sensitivity = TP/(TP+FN) Specificity=TN/(FP+TN)

TPR= Sensitivity = TP/(TP+FN) FPR = 1- Specificity = FP/(FP+TN)

(TP,TN,FP,FN: please refer to Pics1 Confusion matrix)

1. EDA and Data Cleaning

For such a big file with more than 300 features, it is very difficult to check the data in a column or in a row. Data exploratory will help us to understand it and clean it.

Pics3 shows the null value percentage in column-wise for the general population. The 50% null value percentage is 0.12.

Pics5 shows the null value percentage in column-wise for the general population. The 50% null value percentage is 0.27.

After investigating, the four datasets have almost the same features except for the additional columns. The below data clean process is developed for four datasets.

The data clean process then summarizes as follows:

  1. Change the unknown data into Nan. There are different sources as unknown, such as numerical data -1,0 or string X or XX.
  2. Delete the columns which have a lot of missing values. The threshold value sets as 90%, which means if the percentage of the null values is more than 90%, the column will be deleted.
  3. Delete the unnecessary columns, for example, CMEO-DEU-2015 is the detail info for CMEO-DEUG-2015, which can be represented by CMEO-DEUG-201, will be deleted. D19_LETZTER_KAUF_BRANCHE is summarized of other columns info, will be deleted.
  4. Encode for categorical data. Here is only for OST_WEST_KZ.
  5. Extract info from data time data. Only extract year info from EINGEFUEGT_AM.
  6. Replace the null values with most frequency data.

2. Customer Clustering

From the first step, 358 features have been chosen. Now PCA (Principal component analysis) to decomposite the features. After the main features being extracted, KMeans will be used to cluster the data.

For PCA plus KMeans, there are two important parameters to be decided: the number of PCA components and the kern of KMeans.

Choose the parameters for PCA and KMeans

PCA

Below screen plots show the components as 358, and15.

From Pics 6, we can see that 200 components will cover 90% variance, which should be a good choice, theoretically. But, in fact, running a single Notebook, PCA 200 components takes 6 min 56s, then calculating with combine with KMeans runs more than one hour still doesn’t get the result. Considering the input/output balance, and also from the Pics 7, after component 15, the rest components variance are less than 1%, so I decide to choose n_componenets as 15.

KMeans

Here below is the plot according to the elbow method. From the below Pics8, I can’t see very clearly the elbow point, compromising the reality which can help analytics and the data column, I choose the number of centroids as well as clusters as 10.

Now the chosen parameters are: n_component=15, n_cluster=10.

Use the fitted model to predict customer data. Now we get the overview of the general population data with clusters as well as customers.

Cluster segmentation analysis

Now come back to the principal components 0, which has the cumulated variance explained as 8.37% as below Pics 7.

The top 10 features are as Pics9 shows, from different projections. With these, we can analyze the cluster with clear features.

  1. Cluster comparison between general population and customers:

From Pics10, we can see that compared with the general population, clusters 1,9,3 and 6, have significant changes(increases) considering the percentage, which shows that the population at these clusters have a higher potential to become a customer.

2. Cluster with top features

Now we check cluster 1 which has the most significant difference between the general population and customer.

Comparing the mean values for Customer and the general population for the top and bottom 5 variables from component 0 from Cluster1. As the mean value for CAMEO_INTL_2015 has a very different range, it is displayed separately.

The difference for the other top 5 and bottom 4 features as below Pics11 :

The mean values for CAMEL_INTL_2015 as Pics12:

The top 5 and bottom 5 features from component 0 has been posted previously. Now I would like to rank them in the absolute weight.

  1. MOBI-RASTER: The mean value is 1.02 from customers, and the customer has 1.35 for the general population. This item can’t find the explanation from provided attributes file. But MOBI is a kind of mobility category, and MOBI-RASTER around 1 has a higher potential.
  2. KBA13_ANTG3: 1.00 vs 2.2, corresponding customers vs the general population. There is no corresponding attribute, but we can find that KBA13 shows the level of car sharing.
  3. PLZ8_ANTG3: 1.00vs 2.09, corresponding customers vs the general population. It shows the number of 6–10 family houses in the PLZ8.
  4. PLZ8_ANTG1: the mean value is 2.00 from customers and 1.82 from the general population. It shows the number of 1–2 family houses in the PLZ8.
  5. CAMEO_DEUG_2015: 2.02 vs 7.14. 2 shows the upper-middle-class, and 7 is lower middleclass.
  6. KBA13_ANTG1: 2.00 vs 1.74, corresponding customers vs the general population. There is no corresponding attribute, but we can find that KBA13 shows the level of car sharing.
  7. MOBI_REGIO: 4.99 vs 1.91, corresponding customers vs general population. It is moving pattern which shows the value around 4.99 has more potential than 1.91.
  8. CAMEO_INTL_2015: 14.13 vs 45.49. 14 means 2 wealthy households-Older Families & Mature Couples, 45 means less affluent households-elders in retirement.
  9. LP_STATUS_GROB: 2.21 vs 1.62, corresponding customers vs the general population. It is social status, which shows the status around 2.2, has a higher potential than 1.62.
  10. VK_DHT4A: 1.11 vs 7.81. t doesn’t exist in the attributes table.

So, the picture of the target population is MOBI_RASTER around 1, KBA13_ANTG3 around 1.0, PLZ8_ANTG3 around 1.0, PLZ8_ANTG1 around 2, CAMEO_DEUG_2015 around 2 (upper-middle-class),KBA13_ANTG1 around 2, MOBI_REGIO around 5, CAMEO_INTL_2015 around 14 (wealthy households-Older Families & Mature Couples), LP_STATUS_GROB social status around 2.21and VK_DHT4A around 1.

3. Customer labeling

This part is to develop a model to predict which individuals will respond to a campaign. Supervised Learning is used.

Ensemble Models

The three estimators: RandomForestClassifier ((n_estimators=1000), XGBClassifier((n_estimators=1000) and LGBMClassifier(learning_rate=0.001, n_estimators=1000,lambda_l2 = 0.1) are ensembles.

The ROC curves are as below Pics 13,14,15:

The score and wall time for the three estimators are as follow:

the roc_auc_score of RandomForestClassifier is 0.66, Wall time: 2min 46s
the roc_auc_score of XGBClassifier  is 0.74, Wall time: 3min 58s
the roc_auc_score of LGBMClassifier  is 0.75, Wall time: 49.9 s

From the scores, I choose LGBMClassifier as an estimator for further model tuning.

Model Tuning

Take LGBMClassifier as an estimator, Gridsearch with 5 split StratifiedKFold cross-validation, learning rate as 0.01 and 0.001, n_estimator as 500,1000 and 2000, is used here to find the best estimator.

The best score is as below:

The best estimator is:

The gridsearch result is as follow:

Use the best estimator to predict the positive probability of the test dataset, and get the Kaggle score of 0.79976.

Summary:

The work I have implemented in this project summarized as below:

  1. Exploratory the demographics data: general population of Germany, customers of a mail-order company, and market campaign train and test datasets.
  2. Cleaned the dataset and select features.
  3. Used PCA and KMeans to cluster the population.
  4. Analyzed the higher potential clusters’ top 10 features to get a relatively clear picture of the target population
  5. Ensemble models to select the best performance estimator, then tuning the model through Gridsearch cross-validation.

Possible further test and improvement:

It is a kind of iteration process. I have learned and found new problems during the process, and have to reiterate the processes from time to time. There are still improvements which can be done further:

  1. Optimize feature selection process, for example, deploy statistic technology to select features, check the outliers, test the different threshold values for deleting values(currently use 0.9 the column-wise)
  2. Deploy the PCA process can be used in a supervised model
  3. How the performance will change if resampling

Part of the code can be found here.

Thanks to Arvato and Udacity for providing such an interesting project, and thank you for your time to read it.


Related Articles