Learning Artificial Neural Networks by predicting visitor purchase intention

Building ANN using Keras and Tensorflow

As I am taking a course on Udemy on Deep Learning, I decided to put my knowledge to use and try to predict whether a visitor would make a purchase (generate revenue) or not. The dataset has been taken from UCI Machine Learning Repository. You can find the complete code in the repo below:

Import libraries

The first step is to import necessary libraries. Apart from the regular data science libraries including numpy, pandas and matplotlib, I import machine learning library sklearn and deep learning library keras. I will use keras to develop my Artificial Neural Network with tensorflow as the backend.

Import dataset

I’ll import the dataset and get basic overview of the data. There are 1,908 visitors that led to revenue generation while 10,422 visitors did not.

There are 18 columns, 17 features and 1 target variable (Revenue). There are no missing values. I also ran dataset.describe(). The mean of each column was very varied from one another, thus, scaling should help with it.

Considering that Revenue as the target column, I'll split the dataset into test and train sets. I split the dataset into 80% training data and 20% testing data.

Data analysis and visualization

Target column

I began by creating a bar plot between visitors that generated revenue and those that didn’t.

Target column bar chart

As is clear from the bar plot above, the dataset includes majority of feature values that resulted in no revenue generation. The dataset is highly unbalanced and this is where we’d have to create an efficient model that can still classify between the two classes accurately.

Correlation matrix

I created the correlation matrix and then coloured it based on the level of correlation.

Correlation Matrix (Part 1)
Correlation Matrix (Part 2)

It appears that PageValues is most linearly correlated with our target value. Also, features such as OperatingSystems, Region and TrafficType have correlation less than 0.02 or more than -0.02, so I'll drop these columns.

The relation between Administrative and Administrative_Duration, Informational and Informational_Duration, and ProductRelated and ProductRelated_Duration appear to have very high correlation as can be seen in the correlation matrix. This is expected as the duration spent on a type of page would surely be influenced by the number of pages visited of that type. Thus, we can remove the number of pages visited of each type.

Data Engineering

Remove columns

I’ll now remove all the columns that are not needed including Administrative, Informational, ProductRelated, OperatingSystems, Region and TrafficType.

Encoding categorical columns

I’ll encode the categorical columns using LabelEncoder. I’ll use OneHotEncoder to encode the column classes and then append them to the dataset.

Scaling the data

Next, I’ll rescale the data so that each column has its mean around 0.

Model generation

I’ll use Keras with Tensorflow as its backend to generate an Artificial Neural Network. There are 32 input nodes, followed by 4 hidden layers and 1 output layer. The model architecture was designed to increase the validation set accuracy as well as overfitting prevention on train set.

Artificial Neural Network

I developed a Sequential model for the ANN. The layers are either Dense layer or Dropout layer. Each Dense layer represents the number of neutrons as units, the activation function as relu and the first layer also includes the input_dim of the inputs. The activation for output layer is sigmoid. Dropout ensures that there is no overfitting on the training data.

Training and evaluating the model

I trained the model for a total of 50 epochs with 90% train and 10% validation split. I achieved a training accuracy of 90.44% and a validation accuracy of 89.06%.

Finally, the prediction is made on the test data. Then, I used the prediction and actual values to create a confusion matrix and calculate the test accuracy of 88.77%.

Result analysis

The confusion matrix reveals that we are able to identify both type of visitors, visitors that are going to generate revenue and visitors that are not going to. We can use this information as follows:

  1. Once we are able to identify that someone is going to generate revenue, we do not need to provide any coupons, rather we can give the visitors special reward points which they can use the next time they visit.
  2. The visitors that are unlikely to make a purchase can be provided with discount coupons so that they are more likely to make a purchase.