The world’s leading publication for data science, AI, and ML professionals.

Your (imaginary) first day as a Data Analyst

Finish your first project with the help of your prior (online course) knowledge.

Have you ever asked yourself how a successful first day as a data analyst looks? To get my Data Science Nanodegree, I will show you a simple scenario.

You will learn:

  1. What Cross-Selling is?
  2. How to analyze your customers.
  3. How to use machine learning for a cross-sell prediction.

Szenario

I got a dataset from Kaggle with the name "Health Insurance "Cross-Sell Prediction 🏠 🏥 ". Please, imagine that you are a Data Analyst at a big insurance company. As your first job, your boss asks you to analyze a particular customer dataset. He gives you a customer-info.csv file and the following three questions:

  1. What could be a typical customer of our company?
  2. What factors determine whether a customer wants to be insured by us or not?
  3. Which customers will respond to our marketing campaign?

Unfortunately, as a budding data scientist, you don’t know what cross-selling is, so you google it right away:

"Cross-selling is the action or practise of selling an additional product or service to an existing customer" – Wikipedia 2020

Now you know. You work for a large insurance company that offers comprehensive health protection. The insurance company now wants to expand its product portfolio. Therefore it is your job to find out which customers might be interested in car insurance.

What could be a typical customer of our company?

Now you want to use your Python knowledge to analyze the data. To visualize and analyze the data, you used the libraries Seaborn and Pandas. After you finished, you defined two customer personas.

Ann is 24 for years old lives in the state with the excellent title "28". She has a drivers licence, her car is two years old and already damaged. So far, she has no interest in car insurance.

Kevin is 28 years old and also lives in the state with the excellent title "28". He has a drivers licence as well. Because he is really into cars, his car is just a few months old, and it has no damage at all. So far, he has no interest in car insurance.

When you show the personas to your colleagues, they are impressed. However, they recommend that you add a few more diagrams. Otherwise, the boss will wonder if you made it all up. That is no problem at all, you think. After all, you didn’t come up with the personas. Quite the opposite, the Data Analysis is the basis for your assumed personas. First, you created a graph for gender distribution.

The gender distribution differs by only a few percentage points. So, you decided to have a man and a women. The next step was to look at the age distribution. You have also made sure that there are about the same number of men in each age group. Therefore you have coloured the age distribution in two colours. You can see on your chart that most customers are between the mid-20s and early 30s.

So, you decided to pic two individuals by the age of 24 and 28. So you decide on two people aged 24 and 28. Even if you don’t have any more precise information about this, you trust your instincts here. In your graph about the region, you saw that most people came from state 28 and that everyone had a drivers licence.

Finally, you decided to have a closer look at the cars of your customers. You have noticed that most customers have a vehicle that is either between 1 and 2 years old or even younger than one year old. You think this is important information and of course, you want to include it in your personas.

What influences whether a customer responds to our cross-sell approach or not?

You boss is impressed by your analysis. But he has something to criticize as well. Both of your personas are not interested in our product. He wants to know what factors influence Ann and Kevin so that they are more likely to buy car insurance from his company.

So, you got back to work, and you decided to find some correlation between the response feature and other features.

You added the new information which you gained from a heatmap to the personas:

Kevin would change his mind if his car got more significant damage. This is based on a small correlation (0.35) between the response criteria and the vehicle damage criteria. You think it’s funny to suggest to your boss that we could damage Kevin’s car, but you decide against it. Irony probably doesn’t come well on the first day.

Ann will maybe change her mind if she gots older (correlation of just 0.11). That older people are more interested in insurance means that our marketing campaign can focus on people who are customers for a very long time, but we haven’t asked them to buy vehicle insurance yet.

Which customers will respond to our marketing campaign?

As presented your final result to your boss, he was impressed, and he now asks you to use some fancy Machine Learning for your task. Because you prepared your data quite well, this is not a big deal, because you know that preprocessing is already 80% of the job.

However, after a quick research on the ScitKit-Learn Documentation, you now what to too. Your problem is a binary classification (Will or will not respond to our cross-sell approach) you can use some classifiers. You decided to go with some state of the Art Methods as LBM or XGBoost.

At the same time, you want to use an older but well-known model such that you can present the advantages of the new fancy classifiers. For an evaluation metric, you googled again. Research shows that a well-balanced classifier (balance between precision and recall) would be very nice. So you made sure that this is the case.

As could be seen on the results, the LBM Classifier works quite well in comparison to the other ones and is also balanced. You will use it to predict which customers are most likely to respond to our marketing campaigns.

After you send the final results to your boss, you could feel total happiness. All the work and effort which you put into your online courses made sure that you could succeed on your first task and your new position! In the words of your most favourite Machine Learning and Statistics Coach.

Triple BAM! Triple Bam! As a prospective data analyst, we stood our ground on our first day. Tomorrow we will get a real job so that we can test the cross-industry standard process for data mining on accurate data. We are curious.

You will find the code for this article at my GitHub Repository 💻 . Thanks for reading! In case you liked it please make sure to give a clap, check out my website 🌎 and don’t hesitate to email me for feedback 📩 .


Related Articles