The world’s leading publication for data science, AI, and ML professionals.

The complete guide to Neural Networks multinomial classification

What on earth are neural networks? This article will give you a full and complete introduction to neural networks

The Complete Guide to Neural Network multi-class Classification from scratch

What on earth are neural networks? This article will give you a full and complete introduction to writing neural networks from scratch and using them for multinomial classification. Includes the python source code.

Photo by author: Mountain biking with friends 2018
Photo by author: Mountain biking with friends 2018

Neural networks reflect the behavior of the human brain. They allow programs to recognise patterns and solve common problems in machine learning. This is another option to either perform Classification instead of logistics regression. At Rapidtrade, we use neural networks to classify data and run regression scenarios. The source code for this article is available on GitHub.

We will be working with a dataset from Kaggle and you can download it here. So to visualise the data we will be working with in this article, see below. We will use this to train the network to categorise our customers according to column J. We will also use the 3 features highlighted to classify our customers. I needed 3 features to fit my neural network and these were the best 3 available.

Figure 1: Our Dataset
Figure 1: Our Dataset

Just keep in mind, we will convert all the alpha string values to numerics. After all, we can’t plug strings into equations 😉

This is quite a long article and is broken up into 2 sections:

  • Introduction
  • Putting it all together

Good luck 😉


Introduction

Neural networks are always made up of layers, as seen in figure 2. It all looks complicated, but let’s unpack this to make it more understandable.

Figure 2: Neural networks
Figure 2: Neural networks

A neural network has 6 important concepts, which I will explain briefly here, but cover in detail in this series of articles.

  • WeightsThese are like the theta’s we would use in other algorithms
  • LayersOur network will have 3 layers
  • Forward propagationUse the features/weights to get Z and A
  • Back propagationUse the results of forward propogation/weights to get S
  • Calculating the cost/gradient of each weight
  • Gradient descent – find the best weight/hypothesis

In this series, we will be building a neural network with 3 layers. Let’s discuss these layers quickly before we get into the tick of it.

– Input Layer

Refer to figure 2 above and we will refer to the result of this layer as A1. The size (# units) ** of this layer depends on the number of feature**s in our dataset.

Building our input layer is not difficult you simply copy X into A1, but add what is called a biased layer, which defaults to "1".

Col 1: Biased layer defaults to ‘1’ Col 2: "Ever married" our 1st feature and has been re-labeled to 1/2 Col 3: "Graduated" our 2nd feature and re-labeled to 1/2 Col 4: "Family size" our 3rd feature

Figure 3: Visualizing A1 - input layer
Figure 3: Visualizing A1 – input layer

– Hidden layer

Refer to figure 2 above and we only have 1 hidden layer, but you could have a hidden layer per feature. If you had more hidden layers than the logic I mention below, you would replicate the calculations for each hidden layer.

The size (#units) is up to you, we have chosen #features * 2 ie. 6 units.

This layer is calculated during forward and backward propagation. After running both these steps, we calculate Z2, A2 and S2 for each unit. See below for the outputs once each of these steps is run.

Forward propagation

Refer to figure 1 as in this step, we calculate Z2 and then A2.

  • Z2 contains the results of our hypothesis calculation for each of the 6 units in our hidden layer.
  • While A2 also includes the biased layer (col 1) and has the sigmoid function applied to each of the cell’s from Z2.

Hence Z2 has 6 columns and A2 has 7 columns as per figure 4.

Figure 4: Visualizing Z2 and A2 - hidden layer
Figure 4: Visualizing Z2 and A2 – hidden layer

Back propagation

So, after forward propagation has run through all the layers, we then perform the back propagation step to calculate S2. S2 is referred to as the delta of each units hypothesis calculation. This is used to then figure out the gradient for that theta and later on, combining this with the cost of this unit, helps gradient descent figure out what is the best theta/weight.

Figure 5: Visualizing gradients in S2
Figure 5: Visualizing gradients in S2

– Output layer

Our output layer gives us the result of our hypothesis. ie. if these thetas were applied, what would our best guess be in classifying these customers. The size (#units) is derived from the number labels for Y. As can become seen in figure 1, there are 7 labels, thus the size of the output layer is 7.

As with the hidden layer, this result is calculated during the 2 steps of forward and backward propagation. After running both these steps, here is the results:

Forward propagation

During forward prop, we will calculate Z3 and A3 for the output layer, as we did for the hidden layer. Refer to figure 1 above to see there is no bias column needed and you can see the results of Z3 and A3 below.

Figure 6: Visualising Z3 and A3
Figure 6: Visualising Z3 and A3

Back propagation

Now that (referring to figure 1) we have Z3 and A3, lets calculate S3. As it turns out S3 is simply a basic cost calculation, subtracting A3 from Y, so we will explore the equations in the upcoming articles, but we can nonetheless see the result below


Putting it all together

So, the above is a little awkward as it visualises the outputs in each layer. Our main focus in neural networks, is a function to compute the cost of our neural network. The coding for this function will take the following steps.

  1. Prepare the data
  2. Setup neural network
  3. Initialise a set of weights/thetas
  4. Create our cost function which will 4.1 Perform forward propagation 4.2 Calculate the cost of forward propagation 4.3 Perform backward propagation4.4 Calculate the deltas and then gradients from backward prop.

  5. Perform cost optimisation 5.1 Validates our cost function 5.2 That performs gradient descent on steps (4.1) to (4.4) until it finds the best weight/theta to use for predictions

  6. Predict results to check accuracy

1. Prepare the data

To begin this exploratory analysis, first import libraries and define functions for plotting the data using matplotlib. Depending on the data, not all plots will be made.

Hey, I’m just a simple kerneling bot, not a Kaggle Competitions Grandmaster!

from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. 
from scipy import optimize as opt
pd.read_csv)
import matplotlib.pyplot as plt

Now, let’s read our data and have a quick look.

df = pd.read_csv('customertrain.csv')
df.head()
Figure 4: Visualising df
Figure 4: Visualising df

Doing an info, we can see we have some work to do with null values as well as some object fields to convert to numerics.

df.info()
Figure 8
Figure 8

So, let’s transform our object fields to numerics and drop the columns we do not need.

columns = ["Gender","Ever_Married","Graduated","Profession","Spending_Score"]
for feature in columns:
  le = LabelEncoder()
  df[feature] = le.fit_transform(df[feature])
df = df.drop(["ID","Gender","Age","Profession","Work_Experience","Spending_Score"], axis=1)
df.dropna(subset=['Var_1'], inplace=True)
df.head()
Figure 9 Result of DF after preparing data
Figure 9 Result of DF after preparing data

Use fit_transform to encode our multinomial categories into numbers we can work with.

yle = LabelEncoder()
df["Var_1"] = yle.fit_transform(df["Var_1"])
df.head()

Fill in missing features

An important part of regression is understanding which features are missing. We can choose to ignore all rows with missing values, or fill them in with either mode, median or mode.

  • Mode = most common value
  • Median = middle value
  • Mean = average

Here is a handy function you can call which will fill in the missing features by your desired method. We will choose to fill in values with the average.

After funning below, you should see 7992 with no null values.

def fillmissing(df, feature, method):
  if method == "mode":
    df[feature] = df[feature].fillna(df[feature].mode()[0])
  elif method == "median":
    df[feature] = df[feature].fillna(df[feature].median())
  else:
    df[feature] = df[feature].fillna(df[feature].mean())

features_missing= df.columns[df.isna().any()]
for feature in features_missing
  fillmissing(df, feature= feature, method= "mean")
df.info()

Extract Y

Let’s extract our Y column into a separate array and remove it from the dataframe.

Y = df["Var_1"]
df = df.drop(["Var_1"], axis=1

Now copy out our X and y columns into matrices for easier matrix manipulation later.

X = df.to_numpy() # np.matrix(df.to_numpy())
y = Y.to_numpy().transpose() # np.matrix(Y.to_numpy()).transpose()
m,n = X.shape

Normalize features

Now, let’s normalise X so the values lie between -1 and 1. We do this so we can get all features into a similar range. We use the following equation

The goal to perform standardization is to bring down all the features to a common scale without distorting the differences in the range of the values. This process of rescaling the features is so that they have mean as 0 and variance as 1.

2. Setup neural network

Now, we can setup the sizes of our neural network, first, below is the neural network we want to put together.

Below initialisations, ensure above network is achieved. So, now you are asking "What are reasonable numbers to set these to?"

  • Input layer = set to the size of the dimensions
  • Hidden layers = set to input_layer * 2
  • Output layer = set to the size of the labels of Y. In our case, this is 7 categories
input_layer_size = n                      # Dimension of features
hidden_layer_size = input_layer_size*2    # of units in hidden layer
output_layer_size = len(yle.classes_)     # number of labels

3. Initialise weights (thetas)

As it turns out, this is quite an important topic for gradient descent. If you have not dealt with gradient descent, then check this article first. We can see above that we need 2 sets of weights. (signified by ø).

We often still calls these weights theta and they mean the same thing.

We need one set of thetas for level 2 and a 2nd set for level 3. Each theta is a matrix and is size(L) * size(L-1). Thus for above:

  • Theta1 = 6×4 matrix
  • Theta2 = 7×7 matrix

We have to now guess at which initial thetas should be our starting point. Here, epsilon comes to the rescue and below is the matlab code to easily generate some random small numbers for our initial weights.

def initializeWeights(L_in, L_out):
  epsilon_init = 0.12
  W = np.random.rand(L_out, 1 + L_in) * 2 * 
     epsilon_init - epsilon_init
  return W

After running above function with our sizes for each theta as mentioned above, we will get some good small random initial values as in figure 7. For figure 1 above, the weights we mention would refer to rows 1 in below matrix’s.

Figure 7: Initial thetas
Figure 7: Initial thetas

4. The cost function

We need a function which can implement the neural network cost function for a two layer neural network which performs classification.

In the GitHub code, checknn.py our costfunction called nnCostFunction will return:

  • gradient should be a "unrolled" vector of the partial derivatives of the neural network
  • the final J which is the cost of this weight.

Our cost function will need to perform the following:

  • Reshape nn_params back into the parameters Theta1 and Theta2, the weight matrices for our 2 layer neural network
  • Perform forward propagation to calculate (a) and (z)
  • Perform backward propagation to use (a) calculate (s)

So, our cost function first up, needs to reshape our thetas back into a theta for the hidden and output layers.

# Reshape nn_params back into the parameters Theta1 and Theta2, 
# the weight matrices for our 2 layer neural network
Theta1 = nn_params[:hidden_layer_size * 
   (input_layer_size + 1)].reshape( 
   (hidden_layer_size, input_layer_size + 1))
Theta2 = nn_params[hidden_layer_size * 
   (input_layer_size + 1):].reshape( 
   (num_labels, hidden_layer_size + 1))
# Setup some useful variables
m = X.shape[0]

4.1 Forward propagation

Forward propagation is an important part of neural networks. Its not as hard as it sounds.

In figure 7, we can see our network diagram with much of the details removed. We will focus on one unit in level 2 and one unit in level 3. This understanding can then be copied to all units. Take note of the matrix multiplication we can do (in blue in figure 7) to perform forward propagation.

I am showing the details for one unit in each layer, but you can repeat the logic for all layers.

Figure 7: Neural network forward propagation
Figure 7: Neural network forward propagation

Before we show the forward prop code, lets talk a little on the 2 concepts we need during forward prop.

4.1.1 Sigmoid functions

Since we are doing classification, we will use sigmoid to evaluate our predictions. A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula

In github, in checknn.py the following handy functions are created:

  • sigmoid is a handy function to compute sigmoid of input parameter Z
  • sigmoidGradient computes the gradient of the sigmoid function evaluated at z. This should work regardless if z is a matrix or a vector.
def sigmoid(z):
  g = np.frompyfunc(lambda x: 1 / (1 + np.exp(-x)), 1, 1)
  return g(z).astype(z.dtype)
def sigmoidGradient(z)
  return sigmoid(z) * (1 - sigmoid(z))

4.1.2 Regularization

We will implement regularization as one of the most common problems data science professionals face is to avoid overfitting. Overfitting gives you a situation where your model performed exceptionally well on train data but was not able to predict test data. Neural network are complex and makes them more prone to overfitting. Regularization is a technique which makes slight modifications to the learning algorithm such that the model generalizes better. This in turn improves the model’s performance on the unseen data as well.

If you have studied the concept of regularization in Machine Learning, you will have a fair idea that regularization penalizes the coefficients. In deep learning, it actually penalizes the weight matrices of the nodes.

We implement regularization in nnCostFunction by passing in a lambda which us used to penalise both the gradients and costs that are calculated.

4.1.3 Implementing forward prop

As per figure 1, lets calculate A1. You can see that its pretty much my X features an we add the bias column hard coded to "1" in front. Here is the python code to do this:

# Add ones to the X data matrix
a1 = np.insert(X, 0, 1, axis=1)

The result will now give you the results in A1 in figure 4. Take special note of the bias column "1" added on the front.

Great, thats A1 done, lets move onto A2. Before we get A2, we will first run a hypothesis to calculate Z2. Once you have the hypotheses, you can run it through the sigmoid function to get A2. Again, as per figure 1, add the bias column to the front.

# Perform forward propagation for layer 2
z2 = np.matmul(a1, Theta1.transpose())
a2 = sigmoid(z2)
a2 = np.insert(a2, 0, 1, axis=1)

Ok, so we almost there…. Now onto A3, lets do the same as with A2, but this time, we dont worry to add the bias column.

z3 = np.matmul(a2, Theta2.transpose())
a3 = sigmoid(z3)

You may be asking, "why do we keep Z2 & Z3". Well, we will need those in back propagation. So we may as well keep them handy ;-).

4.2 Calculate the cost of forward prop

Before we continue, if you understand our Y column (figure 9) which contains the labels used to categorise our customers. Then to calculate the cost we need to reformat Y into a matrix which corresponds to the number of labels. In our case we have 7 categories for our customers.

Figure 8, shows how Y is converted to a matrix y_one_hot and labels are now indicated as a binary in the appropriate column.

# turn Y into a matrix with a new column for each category and marked with 1
y_one_hot = np.zeros_like(a3)
for i in range(m):
  y_one_hot[i, y[i] - 1] = 1
Figure 8: Mapping Y from vector to matrix y_one_hot
Figure 8: Mapping Y from vector to matrix y_one_hot

Now that we have Y in a matrix format, lets have a look at the equation to calculate the cost.

Well, that’s all very complicated, but good news is that with some matrix manipulation, we can do it in a few lines of python code as below.

# Calculate the cost of our forward prop
ones = np.ones_like(a3
A = np.matmul(y_one_hot.transpose(), np.log(a3)) + 
  np.matmul((ones - y_one_hot).transpose(), np.log(ones - a3))
J = -1 / m * A.trace()
J += lambda_ / (2 * m) * 
  (np.sum(Theta1[:, 1:] ** 2) + np.sum(Theta2[:, 1:] ** 2))

4.3 Perform backward propagation

So, we have simplified our neural network in figure 1 to only show the details to firstly:

  • Subtract A1(3) from Y calculate S3
  • Thereafter perform a linear equation using the thetas mentioned below multiplied by S3 to calculate. S2.

Since a picture paints 1000 words, figure 9 should explain what we use to calculate S3 and thereafter S2 (marked in red).

Figure 9: Backward propagation
Figure 9: Backward propagation

From (3) we understand how our weights (thetas) were initialised, so just to visualise the weights (ø) that figure 9 is referring see figure 10 below.

Figure 9: Weights used in Backward propagation
Figure 9: Weights used in Backward propagation

So again, with matrix manipulation to the rescue, forward propagation is not a difficult task in python

# Perform backward propagation to calculate deltas
s3 = a3 - yv
s2 = np.matmul(s3, Theta2) * 
  sigmoidGradient(np.insert(z2, 0, 1, axis=1))
# remove z2 bias column
s2 = s2[:, 1:]

4.4 Calculate gradients from backward prop

We need to return the gradient’s as part of our cost function, these are needed as gradient descent is a process that occurs in backward prop where the goal is to continuously resample the gradient of the model’s parameter in the opposite direction based on the weight w, updating consistently until we reach the global minimum of function J(w).

Equation for backward prop
Equation for backward prop

To put it simply, we use gradient descent to minimize the cost function, J(w).

Figure 10
Figure 10

And again, matrix manipulation to the rescue makes it just a few lines of code.

Our first step is to calculate a penalty which can be used to regularise our cost. If you want an explanation on regularisation, then have a look at this article.

# calculate regularized penalty, replace 1st column with zeros
p1 = (lambda_/m) * np.insert(Theta1[:, 1:], 0, 0, axis=1)
p2 = (lambda_/m) * np.insert(Theta2[:, 1:], 0, 0, axis=1)

For cost optimisation, we need to feed back the gradient of this particular set of weights. Figure 2 indicates what a gradient is once its been plotted. For the set of weights, being fed to our cost function, this will be the gradient of the plotted line.

# gradients / partial derivitives
Theta1_grad = delta_1 / m + p1
Theta2_grad = delta_2 / m + p2
grad = np.concatenate((Theta1_grad.flatten(), 
  Theta2_grad.flatten()), axis=None)

However, the cost optimisation functions dont know how to work with 2 theta’s, so lets unroll these into a vector, with results shown in figure 5.

grad = np.concatenate((Theta1_grad.flatten(),  
   Theta2_grad.flatten()), axis=None)

Ok WOW, thats been a lot of info, but our cost function is done, lets move onto running gradient descent and cost optimization.

5. Perform cost optimization

5.1 Validating our cost function

One difficult thing to understand is if our cost function is performing well. A good method to check this is to run a function called checknn.

Creates a small neural network to check the backpropagation gradients, it will output the analytical gradients produced by your backprop code and the numerical gradients (computed using computeNumericalGradient). These two gradient computations should result in very similar values.

If you want to delve more into the theory behind this technique, it is tought in Andrew Ng’s machine learning course, week 4.

You do not need to run this every time, just when you have setup your cost function for the first time.

I won’t put the code here, but check the github project in checknn.py for the following functions:

  • checkNNGradients
  • debugInitializeWeights
  • computeNumericalGradient

After running cheecknn, you should get the following results

Figure 11: Result of validating our cost function
Figure 11: Result of validating our cost function

5.2 Gradient descent

Gradient descent is an optimization algorithm which is mainly used to find the minimum of a function. In machine learning, gradient descent is used to update parameters in a model. Parameters can vary according to the algorithms, such as coefficients in Linear Regression and weights in Neural Networks. We will use SciPy optimize modules to run our gradient descent.

from scipy import optimize as opt
print('Training Neural Network... ')
#  Change the MaxIter to a larger value to see how more 
#  training helps.
options = {'maxiter': 50, 'disp': True}
# You should also try different values of lambda
lambda_ = 1;
# Create cost function shortcuts to be minimized
fun = lambda nn_params: nnCostFunction2(nn_params, input_layer_size, hidden_layer_size, output_layer_size, xn, y, lambda_)[0]
jac = lambda nn_params: nnCostFunction2(nn_params, input_layer_size, hidden_layer_size, output_layer_size, xn, y, lambda_)[1]
# Now, costFunction is a function that takes in only one 
# argument (the neural network parameters)
res = opt.minimize(fun, nn_params, method='CG', jac=jac, options=options)
nn_params = res.x
cost = res.fun
print(res.message)
print(cost)
Figure 12: Result of running gradient descent
Figure 12: Result of running gradient descent

Get our thetas back for each layer by using a reshape

# Obtain Theta1 and Theta2 back from nn_params
Theta1 = nn_params[:hidden_layer_size * (input_layer_size + 
   1)].reshape((hidden_layer_size, input_layer_size + 1))
Theta2 = nn_params[hidden_layer_size * (input_layer_size + 
   1):].reshape((output_layer_size, hidden_layer_size + 1))

6. Predict results to check accuracy

Now that we have our best weights (thetas), let’s use them to make a prediction to check for accuracy.

pred = predict(Theta1, Theta2, X)
print(f'Training Set Accuracy: {(pred == y).mean() * 100:f}')

You should get an accuracy of 65.427928% Yes, it’s a little low, but that’s the dataset we are working with. I have tried this dataset with logistics regression & SVM and get the same results.

Conclusion

I hope this article gives you a deep level of understanding of neural networks and how you can use it to classify data. Let me know how you go…


Related Articles