Lead Prediction with Tableau und Scikit Learn

A Casestudy with Kaggle Data

Andreas Stöckl

Published in

Towards Data Science

7 min readMar 28, 2021

Foto: www.pexels.com Creative Commons CC0

The Task

With the data that a potential customer (“lead”) leaves on a website, important insights and results on customer behavior can be gained. Machine learning is then used to create a prediction model from this data. The case study carried out shows that the accuracy of such a forecast model is 90%.

An education company sells online courses to industry professionals. On any given day, thanks to marketing efforts, many professionals interested in the courses land on their website and browse for courses. This is how campaigns on social media channels, websites, or search engines, such as Google, attract new prospects.

Once people land on the site, they can browse the courses, fill out a form for the course, or watch some videos. If they leave the site again without completing this desired activity, they are brought back with retargeting campaigns and converted to leads. When these people then fill out a form and provide their email address or phone number, they are classified as prospects.

Once these leads are acquired, sales team members start making calls, writing emails, etc. Through this process, some of the leads are converted, while most are not. The typical lead conversion rate is around 30%.

For example, if they receive 100 leads in a day, only about 30 of them will convert. To make this process more efficient, the company wants to identify the potential leads, also known as “hot leads”. If they are able to identify these leads, the lead conversion rate should increase. Since the sales team is now more focused on communicating with the potential leads instead of making phone calls to all of them. This not only makes the sales process faster and more successful but also saves on personnel costs.

It is our task to create a model in which each lead is assigned a lead score. The customers with a higher lead score have a higher chance of conversion, and the customers with a lower lead score have a lower chance of conversion.

The Data

We use data from Kaggle in this demo https://www.kaggle.com/ashydv/leads-dataset

The data set contains information about:

Whether the requests became customers — the column is called “Converted”.
Behavior on the website, such as time spent, what content was viewed, etc.
Information provided in forms
How the visitor came to the website (search engine, referrer, direct)

The following table shows an extract from the data. Data records of 9240 persons with 37 features are available. The characteristics are stored for each lead. Some are numeric features, such as time spent visiting the website, but there are also many categorical features, such as demographic information or information from web forms.

The data can be imported into Tableau via the CSV file provided by Kaggle.

Data Cleansing

As is often the case in practice, incomplete data sets present a problem that must be solved. Simply removing all incomplete records is usually not a viable approach because too many records are affected. A detailed analysis is necessary.

We delete the features with more than 40 percent missing values since there is too little here to do anything with. In features with less missing data, these are replaced by the dominant value of the feature.

Additionally, there are “select” values for many columns in the dataset. These come from input forms in which the customer did not select a value from the list in the form (it may be that this was not a mandatory input). “Select” is displayed in the dataset here. “Select” values are as good as NULLs, so they are replaced with NULLs.

A detailed description of how to perform these steps can be found at https://www.kaggle.com/danofer/lead-scoring

Based on the analysis, we found that many features do not add information to the model, so we remove them from further analysis. 16 features remain in the dataset.

Exploratory Data Analysis with Tableau

Let’s first look at how many leads and non-leads there are by creating a simple bar chart. For this, “Converted” is dragged as a dimension to “rows” and the number of leads to “columns”.

The ‘Converted’ feature indicates whether a lead has been successfully converted (1) or not (0). Approximately 38% of the customers in our datasets were won.

Now let’s look at the regional distribution of the data by displaying a map with colored countries according to the number of records.

To do this, we use the “Country” dimension and the number of leads along with the appropriate chart type. Tableau then automatically generates the Latitude and Longitude data and places them in “Rows” and “Columns”.

Much of the data comes from India, as shown in the figure.

We now consider the numerical features “Total Visits”, “Total Time Spent on Website” and “Page Views Per Visit”. We use “box-and-whisker” plots to examine the distribution of the data.

“box-and-whisker” plots of the numerical features

The median for the number of visits to the website for converted and non-converted leads is the same. Based on the total number of visits, no conclusive statement can be expected.

People who spend more time on the website are more likely to convert. It is recommended to improve the website to make it more helpful for the users and keep them engaged on the website.

The median number of page views per visit for converted and non-converted leads is the same. Nothing can be said specifically for lead conversion from the number of page views per visit.

Now we consider categorical features.

From which source was the lead generated?

A bar chart with two dimensions (Lead Origin and Converted) is used to show the lead counts.

‘Lead Add Form’ has a very high conversion rate, but the number of leads is not very high.
‘API’ and ‘Landing Page Submission’ bring a higher number of leads.
‘Lead Import’ generates very few leads

To improve the overall lead conversion rate, we need to focus more on improving the lead conversion of API- and landing page submission origin and generate more leads from the lead add form.

We see that people who don’t want to receive email have a low conversion rate.

As a last example let us look at the last activity of the users.

Most of the leads have opened their email as their last activity. SMS Sent has a high conversion rate for leads who noted it as their last activity.

Prediction model

We now create a prediction model that provides a prediction for each lead as to whether it will convert in the period under consideration. In order to evaluate the model, only 70% of the data is used to create the model. The remaining 30% is used to test the model by creating a forecast and comparing it to the real conversion data.

This provides both a calculation method for the future, which calculates the conversion probability based on the lead characteristics, and an estimate of how good this forecast is. How often is it wrong on average?

To build a predictive model, we use the logistic regression model class. Logistic regression is a statistical model to model a binary dependent variable. In our case, whether the lead was converted or not.

There are many features in the model, but many of them do not contribute anything to the goodness of the model. We now carry out the feature selection. From the multitude of features available to us, we would like to use only the most important ones for model building, in order to obtain a simpler but also more stable model on the one hand.

The goal of recursive feature elimination (RFE) is to gradually select fewer and fewer features. First, training is performed on the original set of features, and the importance of each feature is calculated. Then, the least important features are removed from the current feature set. This procedure is repeated recursively on the reduced set until finally the reduced desired number of features is obtained. We then create a new model with the reduced feature count.

The code for training a model with Scikit Learn can be found here:

Evaluation with the 30% test data

With this, we can now determine the accuracy of the prediction with the test data. This is about 90.5%.

With the error matrix, we look at how often the model made what kind of error during the evaluation. In 5.7% of the predictions, a conversion is erroneously predicted, although there was none. This is a value that is quite suitable for practice.

16% conversion cases are not detected. This is also a useful value.

Summary

We have gained the following insights on the data from the analysis:

The time spent on the website is a good indicator of successful completion.
The “lead sources” had large differences in terms of number of leads and conversion rates.
The last recorded activities provided important information for prediction.

This information was used to create a predictive model for new leads that, when evaluated, showed a predictive accuracy of just over 90%.