Learning Artificial Neural Networks by predicting visitor purchase intention
Building ANN using Keras and Tensorflow
As I am taking a course on Udemy on Deep Learning, I decided to put my knowledge to use and try to predict whether a visitor would make a purchase (generate revenue) or not. The dataset has been taken from UCI Machine Learning Repository. You can find the complete code in the repo below:
Import libraries
The first step is to import necessary libraries. Apart from the regular data science libraries including numpy
, pandas
and matplotlib
, I import machine learning library sklearn
and deep learning library keras
. I will use keras
to develop my Artificial Neural Network with tensorflow
as the backend.
Import dataset
I’ll import the dataset and get basic overview of the data. There are 1,908 visitors that led to revenue generation while 10,422 visitors did not.
There are 18 columns, 17 features and 1 target variable (Revenue). There are no missing values. I also ran dataset.describe()
. The mean
of each column was very varied from one another, thus, scaling should help with it.
Considering that Revenue
as the target column, I'll split the dataset into test and train sets. I split the dataset into 80% training data and 20% testing data.
Data analysis and visualization
Target column
I began by creating a bar plot between visitors that generated revenue and those that didn’t.
As is clear from the bar plot above, the dataset includes majority of feature values that resulted in no revenue generation. The dataset is highly unbalanced and this is where we’d have to create an efficient model that can still classify between the two classes accurately.
Correlation matrix
I created the correlation matrix and then coloured it based on the level of correlation.
It appears that PageValues
is most linearly correlated with our target value. Also, features such as OperatingSystems
, Region
and TrafficType
have correlation less than 0.02
or more than -0.02
, so I'll drop these columns.
The relation between Administrative
and Administrative_Duration
, Informational
and Informational_Duration
, and ProductRelated
and ProductRelated_Duration
appear to have very high correlation as can be seen in the correlation matrix. This is expected as the duration spent on a type of page would surely be influenced by the number of pages visited of that type. Thus, we can remove the number of pages visited of each type.
Data Engineering
Remove columns
I’ll now remove all the columns that are not needed including Administrative
, Informational
, ProductRelated
, OperatingSystems
, Region
and TrafficType
.
Encoding categorical columns
I’ll encode the categorical columns using LabelEncoder
. I’ll use OneHotEncoder
to encode the column classes and then append them to the dataset.
Scaling the data
Next, I’ll rescale the data so that each column has its mean around 0.
Model generation
I’ll use Keras with Tensorflow as its backend to generate an Artificial Neural Network. There are 32 input nodes, followed by 4 hidden layers and 1 output layer. The model architecture was designed to increase the validation set accuracy as well as overfitting prevention on train set.
I developed a Sequential model for the ANN. The layers are either Dense layer or Dropout layer. Each Dense layer represents the number of neutrons as units
, the activation function as relu
and the first layer also includes the input_dim
of the inputs. The activation for output layer is sigmoid
. Dropout ensures that there is no overfitting on the training data.
Training and evaluating the model
I trained the model for a total of 50 epochs with 90% train and 10% validation split. I achieved a training accuracy of 90.44% and a validation accuracy of 89.06%.
Finally, the prediction is made on the test data. Then, I used the prediction and actual values to create a confusion matrix and calculate the test accuracy of 88.77%.
Result analysis
The confusion matrix reveals that we are able to identify both type of visitors, visitors that are going to generate revenue and visitors that are not going to. We can use this information as follows:
- Once we are able to identify that someone is going to generate revenue, we do not need to provide any coupons, rather we can give the visitors special reward points which they can use the next time they visit.
- The visitors that are unlikely to make a purchase can be provided with discount coupons so that they are more likely to make a purchase.
Thanks for reading. Do share your thoughts, ideas and suggestions. You can reach out to me on LinkedIn too. You might also like the following: