Special thanks to: Alex Scriven

Contents:
- Introduction
- Background
- Semantics
- Set Up
- Get Data
- Check Data
- Prepare the Data
- Instantiate the Network
- Initialise the Network
- Forward Propagation
- Calculate the Cost
- Backward Propagation
- Update Model Parameters
- Run the Model End-to-End
- Create Prediction
- Conclusion
- Post Script
1. Introduction
Modern-day Data Science techniques frequently use robust frameworks for designing and building machine learning solutions. In the [R](https://www.r-project.org/)
community, packages such as the [tidyverse](https://www.tidyverse.org/)
and the [caret](http://caret.r-forge.r-project.org/)
packages are frequently referenced; and within [Python](http://127.0.0.1:17287/rmd_output/0/)
, packages such as [numpy](https://numpy.org/)
, [pandas](https://pandas.pydata.org/)
, [sci-kit learn](https://scikit-learn.org/)
are frequently referenced. There are even some packages that have been built to be used in either language, such as [keras](https://keras.io/)
, [pytorch](https://pytorch.org/)
, [tensorflow](https://www.tensorflow.org/)
. However, the limitation of using these packages is the ‘blackbox‘ __ phenomenon, where users do not understand what is happening behind-the-scenes (or ‘under the hood’, if you will). Users know how to use the functions, and can interpret the results, but don’t necessarily know how the packages were able to achieve the results.
The purpose of this paper is to create a ‘back to basics‘ approach to designing Deep Learning solutions. The intention is not to create the most predictive model, nor is it to use the latest and greatest techniques (such as convolution or recursion); but the intention is to create a basic neural network, from scratch, using no frameworks, and to walk through the methodology.
Note: The word ‘Vanilla‘ in ‘Vanilla Neural Networks’ simply refers to the fact it is built from scratch, and does not use any pre-existing frameworks in its construction.
2. Background
2.1. Context
There are already many websites and blogs which explain how this process is done. Such as Jason Brownlee‘s article How to Code a Neural Network with Backpropagation In Python (from scratch), and DeepLearning.ai‘s notebook dnn_app_utils_v2.py (on the Microsoft Azure Notebooks network). However, these sources are all written in Python. Which is fine, if that’s what is needed, and there are some very legitimate reasons to use Python
over other languages. But this paper will be written in R
.
The R
language was chosen for two reasons:
- I am familiar with the language. I can speak
Python
(along with other languages); I just choseR
to show how it can be achieved using this language. - To prove that there are many different ways to achieve the same outcome. So, while there are sometimes legitimate constraints for choosing one language over another (business legacy, technological availability, system performance, etc), sometimes one language is chosen simply because it is stylistically more preferable.
Therefore, let’s see how to architect and construct a Vanilla Neural Network in R
.
2.2. What It’s Not
This article does not cover the latest and greatest Deep Learning architectures (like Convolution or Recursion). As such the final performance may not be as good as it could be, if these other architectures were used.
This article does not teach readers about the theoretical mathematical concepts behind how Neural Networks operate. There are plenty of other lectures which teach this information (eg. The Math Behind Neural Networks). In fact, this article assumes a lot of knowledge from the Reader about programming, about calculus, and about the basics behind what a Neural Network conceptually is.
This article does not cover why Neural Networks work the way they do, and the conceptual understanding behind a Feed-Forward architecture. There are plenty of other blogs (eg. A Beginner Intro to Neural Networks) and videos (eg. Neural Networks series) which cover such information.
This article doesn’t __ point the reader to other packages and applications which may already have this information set up and working. Packages like [tensorflow](https://www.rdocumentation.org/packages/tensorflow)
and [nnet](https://www.rdocumentation.org/packages/nnet)
already have this covered.
What this article actually is, is a functional walk-through, how-to piece, for creating a Vanilla Neural Network (a Feed-Forward Network), from scratch, step-by-step, in the R
programming language. It contains lots of code, and lots of technical details.
3. Semantics
3.1. Layout
This article is laid our in such a way to describe how a Neural Network is built from the ground-up. It will walk through the steps to:
- Access and check the data
- Instantiate and Initialise the network
- Run forward propagation
- Compute the cost
- Run backward propagation
- Update the model
- Set up a training method to loop through every thing
- Predict and assess the performance
In the interest of brevity, the functions defined here will not include all the commentary and validations that should be included in a typical function. They will only include basic steps and prompts. However, the source code for this article (located here) does contain all the appropriate function docstrings and assertions.
3.2. Syntax
For the most part, the syntax in this article is kept to the [dplyr](https://www.rdocumentation.org/packages/dplyr)
‘pipe’ method (which uses the %>%
symbol). However, in certain sections the R [base](https://www.rdocumentation.org/packages/base)
syntax is used (for example, in function declaration lines).
Throughout the article, many custom functions are written. Each of these are prefixed with the words get
, let
and set
. The definitions of each are given below.
-
get_*()
:- It will
get
certain attributes of meta-data from the objects which are parsed to this function. - Or will use the information parsed to this function to derive and
get
other values or parameters.
- It will
-
set_*()
:- It will
set
(or ‘update’) the objects which are parsed to this function. - Is usually used for updating the network during forward and backward propagation processes.
- It will
-
let_*()
:- Similar to
get
, in that it takes other values parsed to this function to derive an outcome, however it willlet
this value be utilised by another object or function. - Used mainly for the initialisation and activation functions.
- Similar to
4. Set Up
4.1. Load Packages
The first step is to import the relevant packages. This list includes the the main packages used throughout this process; and the main purpose of which is also listed.
Note what is listed above about not using existing Deep Learning packages, and yet the tensorflow
package is included. Why? Well, this is used for only accessing the data, which will be covered in the next section. The tensorflow
package is not used for building and training any networks.
library(tensorflow) #<-- Only used for getting the data
library(tidyverse) #<-- Used for accessing various tools
library(magrittr) #<-- Extends the `dplyr` syntax
library(grDevices) #<-- For plotting the images
library(assertthat) #<-- Function assertions
library(roxygen2) #<-- Documentation is important
library(caret) #<-- Doing data partitioning
library(stringi) #<-- Some string manipulation parts
library(DescTools) #<-- To properly check `is.integer`
library(tictoc) #<-- Time how long different processes take
library(docstring) #<-- Makes viewing the documentation easier
library(roperators) #<-- Conveniently using functions like %+=%
library(plotROC) #<-- For plotting predictions
5. Get Data
5.1. Download Data
The dataset to be used is the CIFAR-10 dataset. It’s chosen for a number of reasons, including:
- The data is on images, which is ideal for Deep Learning purposes;
- There are a decent number of images included (60,000 images in total);
- All images are the same the same size (32×32 pixels);
- The images have been categorised in to 10 different classes; and
- It’s easily accessible through the
TensorFlow
package.
The following code chunk has the following process steps:
-
Get the data
- In order to import the date, it is accessed through the
keras
element, which contains the suite ofdatasets
, including thecifar10
part. - The
load_data()
function retrieves the data from the online GitHub repository.
- In order to import the date, it is accessed through the
-
Extract the second element
- The
load_package()
here returns two different objects: -
-
- The train dataset (containing 50,000 images);
-
-
-
- The Test dataset (containing 10,000 images).
-
- The second element is extracted (by using the
extract2(2)
function) because only 10,000 images are needed for these purposes. - This article is to show the process of creating Vanilla Neural Networks; if more data is needed at a later stage, it can easily be accessed here.
- The
-
Name the parts
- The data as downloaded contains two further elements:
-
-
- The images themselves (in the form of a 4-Dimensional array);
-
-
-
- The image labels (in the form of a 2-Dimensional, single column array).
-
- This data does not have any names, so therefore the names are set by using the
set_names()
function.
# Download Data
# NOTE:
# - The first time you run this function, it download everything.
# - Next time you run it, TensorFlow will load from Cache.
cifar <- tf$keras$datasets$cifar10$load_data() %>%
extract2(2) %>%
set_names(c("images","classes"))
5.2. Get Class Definitions
One of the challenges behind accessing this data from the TensorFlow
package is that the classes are only numeric values (0
to 9
) for each type of image. The definitions for these images can be found on GitHub (GitHub > EN10 > CIFAR). These classes are defined in the following code chunk.
# Define classes
ClassList <- c(
"0" = "airplane",
"1" = "automobile",
"2" = "bird",
"3" = "cat",
"4" = "deer",
"5" = "dog",
"6" = "frog",
"7" = "horse",
"8" = "ship",
"9" = "truck"
)
6. Check Data
6.1. Check Objects
It is important to check the data, to ensure that it has been generated correctly, and all the information looks okay. For this, a custom-function is written (get_ObjectAttributes()
), the source code for which can be found here. As seen by the following code chunk, the images
object is a 4-Dimensional numeric array, with 10,000
images, each 32
x32
pixels, and 3
colour chanels. The entire object is over 117 Mb
large.
# Check Images
cifar %>%
extract2("images") %>%
get_ObjectAttributes("cifar$images") %>%
cat()
Which prints:
Name : cifar$images
- Size : 117.2 Mb
- Clas : array
- Type : integer
- Mode : numeric
- Dims : 10000x32x32x3
When checking the classes
object, it is a 2-Dimensional numeric array (with only 1 column), but with the same number of images as the images
object (which is to be expected), with the frequency of each class label having exactly 1000
images each. The total size is less than 40 Kb
.
# Check classes
cifar %>%
extract2("classes") %>%
get_ObjectAttributes(name="cifar$classes", print_freq=TRUE) %>%
cat()
Which prints:
Name : cifar$classes
- Size : 39.3 Kb
- Clas : matrix,array
- Type : integer
- Mode : numeric
- Dims : 10000x1
- Freq :
label Freq
1 0 1000
2 1 1000
3 2 1000
4 3 1000
5 4 1000
6 5 1000
7 6 1000
8 7 1000
9 8 1000
10 9 1000
6.2. Check Images
After having gained an appreciation of the size of the objects in memory, it is then worth while to check the actual images themselves. As humans, we understand the actual images and the colours, better than we understand the numbers.
In order to visualise the images, two custom functions are written, as shown in the following code chunk. These functions take in the data (as a 4-Dimensional array), and visualise the images as a plot.
set_MakeImage <- function(image, index=1) {
# Extract elements
image.r <- image[,,1]
image.g <- image[,,2]
image.b <- image[,,3]
# Make rgb
image.rgb <- rgb(
image.r,
image.g,
image.b,
maxColorValue=255
)
# Fix dimensions
dim(image.rgb) <- dim(image.r)
# Return
return(image.rgb)
}
plt_PlotImage <- function(images, classes, class_list, index=1) {
# Slice images
image <- images[index,,,]
image %<>% set_MakeImage(index)
lbl <- classes %>%
extract(index) %>%
as.character() %>%
class_list[[.]]
# Create plot
plot <- ggplot() +
ggtitle(lbl) +
draw_image(image, interpolate=FALSE)
# Return
return(plot)
}
When the function is run on the top 16 images, the following is displayed. As shown, the images are extremely pixelated (which is expected, as they are only 32
x32
pixels each), and you can see how each of the images are categorised.
# Set list
lst <- list()
# Loop 16 images
for (index in 1:16) {
lst[[index]] <- plt_PlotImage(
cifar$images,
cifar$classes,
ClassList,
index)
}
# View images
plt <- gridExtra::grid.arrange(grobs=lst, ncol=4)

7. Prepare the Data
There are four steps to preparing the data:
- Reclassify
- Split
- Reshape
- Standardise
7.1. Reclassify
For the purposes of this paper, let’s assume that we are trying to predict whether or not the picture is a car
or not
. This will require transforming the data in to a binary classification problem, where the Neural Network will be looking to predict a 1
or a 0
from the data. This will mean that the model output will be a probability distribution of scores, which can be easily classified by changing the cutoff variable.
First step is to reclassify the data so that all the cars have the value 1
, and everything else is 0
. We know from the classes we defined before that the car’s already have the value 1
, which means the transformation will only need to occur for all the other classes.
# Implement within a pipe
cifar[["classes"]] <- cifar %>%
extract2("classes") %>%
(function(classes){
# View initial classes
classes %>% as.vector %>% head(40) %>% print
# Reclassify
classes <- ifelse(classes==1,1,0)
# View reclassified classes
classes %>% as.vector %>% head(40) %>% print
# Return
return(classes)
})
Which prints:
[1] 3 8 8 0 6 6 1 6 3 1 0 9 5 7 9 8 5 7 8 6 7 0 4 9 5 2 4 0 9 6
[1] 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7.2. Split Data
The next task is to split the data in to a training and testing set. The reason for doing this is covered elsewhere (such sites such as Wikipedia’s Cross-Validation, and Google’s Machine Learning Crash Course: Training and Test Sets Splitting Data).
Firstly, to get an appreciation of the current split of the data, the following code chunk visualises this data using the ggplot2
package. As show, the data is currently distributed with 90%
with the class 0
, and the remaining 10%
with the class 1
.
# Print Plot
cifar %>%
extract("classes") %>%
table(dnn="classes") %>%
data.frame() %>%
ggplot(aes(classes, Freq, fill=classes)) +
geom_col(colour="black") +
geom_label(
aes(label=Freq),
show.legend=FALSE
) +
scale_y_continuous(breaks=seq(0,10000,1000)) +
theme(panel.grid.minor.y=element_blank()) +
labs(
title="Count of Each Class"
)

To implement this data splitting, we are using the caret::createDataPartition()
function. This creates a partition
object, which is then used to segregate the cifar
data accordingly. The proportion of splitting was arbitrarily chosen at 70%
for training, and the remaining for testing. However, it could be justified for this to be 80%
; this a is hyperparameter which can be tweaked at a later stage.
# Set seed for reproducibility
set.seed(1234)
# Create partition
partition <- createDataPartition(cifar$classes, p=0.7, list=FALSE)
# Split data
trn_img <- cifar$images[partition,,,]
tst_img <- cifar$images[-partition,,,]
trn_cls <- cifar$classes[partition]
tst_cls <- cifar$classes[-partition]
After splitting, the data is re-plotted, and the train/test split is easily seen to have achieved an even 70%
distribution over the two classes.
# Print Plot
rbind(
trn_cls %>% table,
tst_cls %>% table
) %>%
set_rownames(c("train","test")) %>%
data.frame() %>%
rename_all(str_remove_all, "X") %>%
rownames_to_column("data") %>%
pivot_longer(2:3, names_to="classes", values_to="Freq") %>%
mutate(
label=paste(data, classes, sep=": "),
data=factor(data, levels=c("train","test"))
) %>%
ggplot(aes(classes, Freq, fill=data), position="dodge") +
geom_col(colour="black", position="dodge") +
geom_label(
aes(label=Freq),
position=position_dodge(width=0.9),
show.legend=FALSE
) +
scale_y_continuous(breaks=seq(0,10000,1000)) +
theme(panel.grid.minor.y=element_blank()) +
labs(
title="Count of Each Class",
subtitle="Split by Train/Test"
)

Another method for checking that the data has been split properly, is to again run the get_ObjectAttributes()
function, as shown in the following code chunk. The information shown here is consistent with the plot above. It is also interesting to note that the Training Image array is 82 Mb
large, which will be important to know for later checking the performance of forward propagation.
for (name in c("trn_img","tst_img","trn_cls","tst_cls")) {
name %>%
get() %>%
get_ObjectAttributes(
name,
if (name %in% c("trn_cls","tst_cls")) TRUE else FALSE
) %>%
cat()
if (name != "tst_cls") cat("n")
}
Which prints:
Name : trn_img
- Size : 82 Mb
- Clas : array
- Type : integer
- Mode : numeric
- Dims : 7000x32x32x3
Name : tst_img
- Size : 35.2 Mb
- Clas : array
- Type : integer
- Mode : numeric
- Dims : 3000x32x32x3
Name : trn_cls
- Size : 54.7 Kb
- Clas : numeric
- Type : double
- Mode : numeric
- Dims : 7000
- Freq :
label Freq
1 0 6309
2 1 691
Name : tst_cls
- Size : 23.5 Kb
- Clas : numeric
- Type : double
- Mode : numeric
- Dims : 3000
- Freq :
label Freq
1 0 2691
2 1 309
7.3. Reshape Data
For the first input
layer of our Neural Network, we want to be a single dimension of nodes. Therefore, it is necessary to reshape the data from a 4-Dimensional array in to a 2-Dimensional array. This process is called Flattening, and more information can be found here: Memory Layout of Multi-Dimensional Arrays.
The implementation of this method can be done quite easily with the array()
function, as it has the dim=
argument which can be used to specify the dimensions required.
The desired matrix size should have each image in a new row, and each pixel on a different column. Because each pixel is comprised of the 2nd, 3rd and 4th dimensions, we need to take the product of these three numbers, and use that to specify the desired number of columns. Effectively, we are running the equation: 32
×32
×3
, which equates to having 3072
columns. This equation is implemented programmatically in-line, in the next code chunk.
# Reshape data
trn_img %<>% array(dim=c(
dim(.) %>% extract(1),
dim(.) %>% extract(2:4) %>% prod()
))
tst_img %<>% array(dim=c(
dim(.) %>% extract(1),
dim(.) %>% extract(2:4) %>% prod()
))
trn_cls %<>% array(dim=c(
length(.),
1
))
tst_cls %<>% array(dim=c(
length(.),
1
))
When checking the object attributes once again, you will see that the image data has been manipulated correctly to have the number of rows as the number of images, and the number of columns as the number of pixels.
for (name in c("trn_img","tst_img","trn_cls","tst_cls")) {
name %>%
get() %>%
get_ObjectAttributes(name, FALSE) %>%
cat()
if (name != "tst_cls") cat("n")
}
Which prints:
Name : trn_img
- Size : 82 Mb
- Clas : matrix,array
- Type : integer
- Mode : numeric
- Dims : 7000x3072
Name : tst_img
- Size : 35.2 Mb
- Clas : matrix,array
- Type : integer
- Mode : numeric
- Dims : 3000x3072
Name : trn_cls
- Size : 54.9 Kb
- Clas : matrix,array
- Type : double
- Mode : numeric
- Dims : 7000x1
Name : tst_cls
- Size : 23.6 Kb
- Clas : matrix,array
- Type : double
- Mode : numeric
- Dims : 3000x1
7.4. Standardise Data
The final step to preparing the data is to standardise the data so that all the elements are between 0
and 1
. The reason for this is to prevent exploding and vanishing gradients in later steps, as the Neural Network will try to fit to all the peaks and troughs which is caused from having the data in a 0
to 255
value range.
As documented on the TensorFlow website, the CIFAR10 data set consists of RGB Image data. And, as documented on Wikipedia, RGB data are all values between 0
and 255
.
Therefore, what is necessary to do is to divide all the elements by 255
, and they will inevitably result in a value between 0
and 1
. As the image data is currently in an array, the function in the below code chunk will run as a vectorised function over the entire array, dividing all elements by 255
accordingly.
trn_img <- trn_img/255
tst_img <- tst_img/255
The data is now ready for the network. Next step is to build the network.
8. Instantiate the Network
8.1. Define the Architecture
Some quick notes on what the Network will actually be:
- Overall architecture is a
list
. - Each
element
of the masterlist
is anotherlist
, and these are going to form each of the layers of the overall network. - The first layer will always be the
input
layer. - The last layer will always be the
output
layer. - Each of the layers in between will be the
hidden
layers, and the names for these layers are simply named with numbers. - Each element of each layer will all be labelled the same, defined as follows:
-
nodz
: Number of nodes in this layer.
-
inpt
: Input matrix. AkaA_prev
. This is a duplicate of the activation of the previous layer, so for large networks this needs to be taken in to consideration.
-
wgts
: Weights matrix. AkaW
.
-
bias
: Bias vector. Akab
.
-
linr
: Linear matrix. AkaZ
. This is the result of the linear algebra betweeninpt
,wgts
andbias
.
-
acti
: Activation matrix. AkaA
. The result of applying an activation function to thelinr
matrix.
-
acti_func
: The activation function used.
-
cost
: The overall cost of the model. This is a single value (the overall cost of the model), but is copied to each layer of the model.
-
back_cost
: Gradient of the cost vector. AkadA_cost
.
-
back_acti
: Gradient of the Activation matrix. AkadA
. The result of differentiation after having applied back propagation. With a given cost function.
-
back_linr
: Gradient of the Linear algebra matrix. AkadZ
. The result of backwards linear differentiation back propagation.
-
back_wgts
: Gradient of the Weights matrix. AkadW
. Also the result of back-prop.
-
back_bias
: Gradient of the Bias vector. Akadb
. Also the result of back-prop.
8.1. Set the Instantiation Function
For the below code chunk, the function set_InstantiateNetwork()
is defined. It only has three input arguments, which specify the number of nodes to be used in each layer. Based on this information, the model will be instantiated and returned, ready for initialisation, in the next step.
set_InstantiateNetwork <- function(
input=50,
hidden=c(30,20,10),
output=1
) {
# Set up
model = list()
layers = c(
"input",
1:length(hidden),
"output"
)
# Loop
for (layer in layers) {
# Make layer
model[[layer]] <- list(
"nodz" = "",
"inpt" = "",
"wgts" = "",
"bias" = "",
"linr" = "",
"acti" = "",
"acti_func" = "",
"cost" = "",
"back_cost" = "",
"back_acti" = "",
"back_linr" = "",
"back_wgts" = "",
"back_bias" = ""
)
# Set nodes
if (layer=="input") {
model[[layer]][["nodz"]] <- input
} else if (layer=="output") {
model[[layer]][["nodz"]] <- output
} else {
layer_index <- layer %>% as.numeric()
model[[layer]][["nodz"]] <- hidden[layer_index]
}
}
# Return
return(model)
}
8.1. Create the Network
The below code chunk instantiates the network. The model will be set up with the following number of nodes in each layer:
- The
input
layer will have3072
nodes, the same number calculated above. - Each
hidden
layer will have a decreasing number of nodes, from100
down to20
notes. - The
output
layer will have1
node, because this one node will be a floating point number between0
and1
, which will be used to predict whether the relevant image is a car or not.
network_model <- set_InstantiateNetwork(
input=3072,
hidden=c(100,75,50,30,20),
output=1
)
8.1. Visualise the Network
There is a very good website available which allows Neural Networks to be visualised: http://alexlenail.me/NN-SVG/AlexNet.html. The below image is a representation of the network that has just been created.
Further visualisation can be conducted once the network is fully initialised and forward propagated. See section Check Model Shapes for more details.

9. Initialise the Network
There are four steps to initialising the network:
- Set the Weight Initialisation functions
- Set the Layer Initialisation function
- Set the Model Initialisation function
- Run the Initialisation
9.1. Weight Initialisation
At it’s core, weight initialisation is simply generating a random normal number (with μ=0
and σ=1
). However, by only using this randomly generated number, the model gradients are found to be exploding or vanishing when attempting to train deeper neural networks. Therefore, these weights need to be scaled after they are initialised, in order to be robust enough to continue to be trained at deeper layers.
There are many algorithms that can be used for weight initialisation. Two of the more common ones are the Xavier algorithm, and the He algorithm. Some good resources for understanding the details behind these algorithms include:
- Weight Initialization in Neural Networks
- Understanding the difficulty of training deep feedforward neural networks
- Surpassing Human-Level Performance on ImageNet Classification
9.1.1. Xavier Algorithm
The equation for the Xavier initialisation is:
Where:
- nᵢ is the number of nodes coming in to this layer. Also known as "fan-in".
- nᵢ₊₁ is the number of nodes going out of this layer. Also known as "fan-out".
9.1.2. He Algorithm
The equation for the He initialisation is:
Where:
- nᵢ is the number of nodes coming in to this layer.
9.1. Initialisation Functions
For programmatic purposes, these functions are written with the order
value as part of the function arguments. This means that the order of magnitude of the equation can be altered at a later stage and used as a hyperparameter.
let_InitialiseXavier <- function(nodes_in, nodes_out, order=6) {
# Do work
numer <- sqrt(order)
denom <- sqrt(nodes_in + nodes_out)
output <- numer/denom
# Return
return(output)
}
let_InitialiseHe <- function(nodes_in, nodes_out, order=2) {
# Do work
numer <- order
denom <- nodes_in
output <- sqrt(numer/denom)
# Return
return(output)
}
9.3. Layer Initialisation
The next step is to construct a function that will initialise all the relevant aspects of an individual layer. This step is important because it is where the weight matrices are created, and these must be constructed in a certain manner so as to ensure that the dimensions allow successful forward propagation.
The steps for the layer construction are as follows:
-
Determine the layer names of current layer (
layer
) and of the previous layer (layer_prev
).- Used to access the relevant configurations from the
network_model
list.
- Used to access the relevant configurations from the
-
Determine the number of nodes feeding in to this layer (
nodes_in
), and the number of nodes feeding out of the current layer (nodes_out
).- Used for parse’ing in to the Initialisation Algorithms.
-
Create the Weight matrix.
- The dimensions of which are as follows:
-
-
- Number of Rows is the number of nodes in the previous layer.
-
-
-
- Number of Columns is the number of nodes in the current layer.
-
- Each element is created using the
rnorm()
function, which generates a random number on the normal curve, withμ=0
andσ=1
.
-
Determine the Initialisation Algorithm to be used.
- The value value parsed in to this function is a lower-case word associated with the relevant algorithm.
- It is coerced to a Title capitalisation, and then given the prefix
let_Initialise
. - The value is then parsed in to the
get()
function, which then calls the function, and executes the parameters parsed to it. - This is a programmatic way to flexibly call different algorithms, based on values parsed to the function.
-
Scale the Weight matrix.
- By multiplying each element by the Initialisation Algorithm.
-
Create the Bias matrix.
- The dimensions of which are as follows:
-
-
- The number of rows is the number of nodes in the current layer.
-
-
-
- There is only one column.
-
- Every single element has value
0
.
- Apply the Weight and Bias matrix back on to the
network_model
. - Return the updated
network_model
object.
In order to achieve this, the function arguments used are for the function include:
- The
network_model
itself. - The
layer_index
of the layer (where1
is theinput
layer, and each increasing number is each subsequent layer). - The
initialisation_algorithm
, which is either the valueNA
, or the valuexavier
orhe
which is the value of the relevant algorithm. - The
initialisation_order
, which is an integer value to be used as the numerator in the relevant algorithm.
set_InitialiseLayer <- function(
network_model,
layer_index,
initialisation_algorithm=NA,
initialisation_order=6
) {
# Get layer names
layer_prev <- names(network_model)[layer_index-1]
layer <- names(network_model)[layer_index]
# Get number of nodes
if (layer_index == 1) {
nodes_in <- 0 #First layer is 'input'
} else {
nodes_in <- network_model %>%
extract2(layer_prev) %>%
extract2("nodz")
}
nodes_out <- network_model %>%
extract2(layer) %>%
extract2("nodz")
# Set the seed of reproducibility
set.seed(1234)
# Initialise weight matrix
w_matrix <- matrix(
data=rnorm(nodes_in * nodes_out),
nrow=nodes_in,
ncol=nodes_out
)
# Get initialisation algorithm
if (!is.na(initialisation_algorithm)) {
algorithm <- paste0(
"let_Initialise",
str_to_title(initialisation_algorithm)
)
}
# Scale weights
if (layer_index != 1) {
if (is.na(initialisation_algorithm)) {
w_matrix <- w_matrix
} else {
w_matrix <- w_matrix *
get(algorithm)(
nodes_in=nodes_in,
nodes_out=nodes_out,
order=initialisation_order
)
}
}
# Initialise bias matrix
b_matrix <- matrix(
data=network_model %>%
extract2(layer) %>%
extract2("nodz") %>%
replicate(0),
nrow=network_model %>%
extract2(layer) %>%
extract2("nodz"),
ncol=1
)
# Place data back in to the model
network_model[[layer]][["wgts"]] <- w_matrix
network_model[[layer]][["bias"]] <- b_matrix
# Return
return(network_model)
}
9.4. Model Initialisation
The purpose of the set_InitialiseModel()
function is to loop through each layer in the network_model
object, to initialise them in accordance with the parameters set within the model itself. This function will take the number of nodes (set by the set_InstantiateNetwork()
function).
The function will take only three input parameters:
-
The
network_model
:- The Network, as instantiated by the
set_InstantiateNetwork()
function.
- The Network, as instantiated by the
-
The
initialisation_algorithm
:- The algorithm to be used for the network initialisation.
- This should be the values
NA
,xavier
, orhe
.
-
The
initialisation_order
:- The order value to be used for the numerator of the equation.
- This could be a numeric value, or the string
layers
which signifies that the order should be the number of hidden layers in the network.
set_InitialiseModel <- function(
network_model,
initialisation_algorithm="xavier",
initialisation_order="layers"
) {
# Redefine 'initialisation_order'
if (initialisation_order == "layers") {
initialisation_order <- get_CountOfElementsWithCondition(
names(network_model),
function(x){is.integer(as.numeric(x))}
)
}
# Initialise each layer
for (layer_index in 1:length(names(network_model))) {
network_model <- set_InitialiseLayer(
network_model=network_model,
layer_index=layer_index,
initialisation_algorithm=initialisation_algorithm,
initialisation_order=initialisation_order
)
}
# Return
return(network_model)
}
Notice, this function uses the user-defined function get_CountOfElementsWithCondition()
. This function allows for the counting of the number of hidden layers in the model. The source code for the function can be found here.
9.5. Network Initialisation
The below code chunk initialises the network using the function defined above. The method utilised is %<>%
, which is defined in the magrittr
package, which states that it is used to update a value on the left-hand side by first piping it into the first argument position on the right-hand side, and then assigning the result back to the object on the left hand side.
network_model %<>% set_InitialiseModel()
9.6. Check Model Parameters
For a quick sanity-check, the below code chunk checks for the number of parameters for the defined model. This again uses a custom function get_ModelParametersCount()
, the definition for which can be found in the source code for this article (located here).
network_model %>%
get_ModelParametersCount() %>%
format(big.mark=",")
Which prints:
[1] "320,846"
There are over 320,000
parameters in this neural network. Every single one of these parameters will need to be trained, and each of them will have an influence on the final outcome. We will continue to walk through how this can be achieved in the following sections.
10. Forward Propagation
10.1. The Theory
There is a very good website which outlines what actually happens during a matrix multiplication: Matrix Multiplication.
The theory for the forward propagation method is as follows:
-
Apply matrix multiplication
-
- The first matrix is the activation matrix from the previous layer.
-
- The second matrix is the weight matrix for the current layer.
-
- The third matrix is the (first) linear activation matrix for the current layer.
-
-
Apply bias matrix
-
- The first matrix is the (first) linear activation of the current layer.
-
- The second matrix is the bias matrix for the current layer.
-
- The third matrix is the (second) linear activation matrix for the current layer.
-
-
Apply activation algorithm
-
- The first matrix is the (second) linear activation matrix for the current layer.
-
- The activation function is determined by the user during the time of the function being run. Can be the
relu
activation,sigmoid
activation, or any of the other activation functions.
- The activation function is determined by the user during the time of the function being run. Can be the
-
- The third matrix is the activation matrix for the current layer.
-
10.1.1. Step One
To illustrate, the following matrices shows the first step of the forward propagation, using fictitious numbers.
The below code shows how this process is implemented programmatically.
# Declare First matrix
matrix_input <- matrix(
data=1:24,
nrow=8,
ncol=3,
byrow=TRUE
)
# Declare Weight matrix
matrix_weight <- matrix(
data=1:15,
nrow=3,
ncol=5,
byrow=TRUE
)
# Apply matrix manipulation
matrix_layer <- matrix_input %*% matrix_weight
Which prints:
matrix_input:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
[5,] 13 14 15
[6,] 16 17 18
[7,] 19 20 21
[8,] 22 23 24
matrix_weight:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
matrix_layer:
[,1] [,2] [,3] [,4] [,5]
[1,] 46 52 58 64 70
[2,] 100 115 130 145 160
[3,] 154 178 202 226 250
[4,] 208 241 274 307 340
[5,] 262 304 346 388 430
[6,] 316 367 418 469 520
[7,] 370 430 490 550 610
[8,] 424 493 562 631 700
10.1.2. Step Two
The next diagram shows how the bias matrix is applied. As shown, each element in the bias matrix is applied horizontally to each element of the initial matrix. This diagram shows how it works.
The below code shows how this process is implemented programmatically.
# Declare Bias matrix
vector_bias <- matrix(1:8, 8, 1)
# Apply Bias matrix
matrix_biased <- sweep(matrix_layer, 1, vector_bias, "+")
Which prints:
matrix_layer:
[,1] [,2] [,3] [,4] [,5]
[1,] 46 52 58 64 70
[2,] 100 115 130 145 160
[3,] 154 178 202 226 250
[4,] 208 241 274 307 340
[5,] 262 304 346 388 430
[6,] 316 367 418 469 520
[7,] 370 430 490 550 610
[8,] 424 493 562 631 700
vector_bias:
[,1]
[1,] 1
[2,] 2
[3,] 3
[4,] 4
[5,] 5
[6,] 6
[7,] 7
[8,] 8
matrix_biased:
[,1] [,2] [,3] [,4] [,5]
[1,] 47 53 59 65 71
[2,] 102 117 132 147 162
[3,] 157 181 205 229 253
[4,] 212 245 278 311 344
[5,] 267 309 351 393 435
[6,] 322 373 424 475 526
[7,] 377 437 497 557 617
[8,] 432 501 570 639 708
10.1.3. Step Three
The activation function is one which is defined at the time of running the function. The algorithm is applied to each element of the initial matrix. In this instance, a simple multiplication of the initial matrix is used.
The below code shows how this process is implemented programmatically.
# Apply Activation function
matrix_output <- matrix_biased * (0.01 * matrix_biased)
Which prints:
matrix_biased:
[,1] [,2] [,3] [,4] [,5]
[1,] 47 53 59 65 71
[2,] 102 117 132 147 162
[3,] 157 181 205 229 253
[4,] 212 245 278 311 344
[5,] 267 309 351 393 435
[6,] 322 373 424 475 526
[7,] 377 437 497 557 617
[8,] 432 501 570 639 708
matrix_output:
[,1] [,2] [,3] [,4] [,5]
[1,] 22.09 28.09 34.81 42.25 50.41
[2,] 104.04 136.89 174.24 216.09 262.44
[3,] 246.49 327.61 420.25 524.41 640.09
[4,] 449.44 600.25 772.84 967.21 1183.36
[5,] 712.89 954.81 1232.01 1544.49 1892.25
[6,] 1036.84 1391.29 1797.76 2256.25 2766.76
[7,] 1421.29 1909.69 2470.09 3102.49 3806.89
[8,] 1866.24 2510.01 3249.00 4083.21 5012.64
10.2. Linear Component
When combined together, the linear algebraic function is implemented in just three lines of code, as shown in the below code chunk in the set_LinearForward()
function.
set_LinearForward <- function(inpt, wgts, bias) {
# Perform matrix multiplication
linr <- inpt %*% wgts
# Add bias
linr <- sweep(linr, 2, bias, "+")
# Return
return(linr)
}
10.3. Non-Linear Component
The real power of Neural Network comes from their Activation function. It is now the network is able to capture the non-linear aspects, and this is what emphasises their predictive power.
The Activation function can be one of many different algorithms, depending on the purpose of the network. The choice of Activation can be a hyperparameter which is chosen at a later stage. The Desmos website provides an excellent interactive way of reviewing the different types of Activations: Activation Functions.
Each of the Activation functions are defined individually, and each only take one argument for the function, which is a matrix for which the activation will be applied. For the purposes of brevity, four of the more popular activations are provided here; but there are many, many others which can be used. The resources and equations for how to compute these functions are also provided:
sigmoid
:
Sources:
relu
:
Sources:
softmax
:
Sources:
swish
:
Sources:
These Activation Functions are defined programmatically as follows:
let_ActivateSigmoid <- function(linr) {
# Do work
acti <- 1/(1+exp(-linr))
# Return
return(acti)
}
let_ActivateRelu <- function(linr) {
# Do work
acti <- sapply(linr, max, 0) %>%
structure(dim=dim(linr))
# Return
return(acti)
}
let_ActivateSoftmax <- function(linr) {
# Do work
expo <- exp(linr)
expo_sum <- sum(exp(linr))
acti <- expo/expo_sum
# Return
return(acti)
}
let_ActivateSwish <- function(linr, beta=0.1) {
# Do work
acti <- linr * (beta * linr)
# Return
return(acti)
}
10.4. Set Forward Propagation Function
The set_ForwardProp()
function pulls together all of the components mentioned above. It implements the following steps:
- Loop through each layer of the
network_model
. - Get the layer name for the current layer.
- Implement a ‘pass-thru’ process for the
input
layer. -
Extract the relevant information, including: -1. Layer name of the previous layer -2. Activation matrix of the previous layer -3. Weights matrix of the current layer
-
- Bias matrix for the current layer
-
- Apply linear algebra component.
- Apply the non-linear activation component.
- Note the difference between the activation of the hidden layers and that of the final layer.
- Apply the relevant information back on to the network.
- Return the
network_model
object.
In order to implement this process, the function takes only four arguments:
network_model
: The network model to be updated.data_in
: The 4-Dimensional array of training images, as defined above.activation_hidden
: The Activation function to be applied to the hidden layers.activation_final
: The Activation function to be applied to the final (output
) layer.
set_ForwardProp <- function(
network_model,
data_in,
activation_hidden="relu",
activation_final="sigmoid"
) {
# Do work
for (index in network_model %>% names() %>% length() %>% 1:.) {
# Define layer name
layr <- network_model %>%
names() %>%
extract(index)
if (layr=="input") {
# Pass-thru for 'input' layer
network_model[[layr]][["inpt"]] <- data_in
network_model[[layr]][["acti"]] <- data_in
} else {
# Extract data
prev <- names(network_model)[index-1]
inpt <- network_model %>%
extract2(prev) %>%
extract2("acti")
wgts <- network_model %>%
extract2(layr) %>%
extract2("wgts")
bias <- network_model %>%
extract2(layr) %>%
extract2("bias")
# Calculate
linr <- set_LinearForward(inpt, wgts, bias)
# Activate
if (layr=="output") {
func <- activation_final %>%
str_to_title() %>%
paste0("let_Activate", .) %>%
get()
acti <- func(linr)
network_model[[layr]][["acti_func"]] <- activation_final
} else {
func <- activation_hidden %>%
str_to_title() %>%
paste0("let_Activate", .) %>%
get()
acti <- func(linr)
network_model[[layr]][["acti_func"]] <- activation_hidden
}
# Apply back to our model
network_model[[layr]][["inpt"]] <- inpt
network_model[[layr]][["linr"]] <- linr
network_model[[layr]][["acti"]] <- acti
}
}
# Return
return(network_model)
}
10.5. Run Forward Propagation
The last step of the Forward Propagation process is to actually run the function. In the below code chunk, the tic()
and toc()
functions are implemented to time how long a process takes to run.
tic()
network_model %<>% set_ForwardProp(trn_img, "relu", "sigmoid")
toc()
Which prints:
7.05 sec elapsed
As mentioned above in the Split Data section, the trn_img
object is over 82 Mb
large, and the model has over 320,000
parameters. The entire end-to-end process for forward propagation took only 7 seconds
to run; which is quite impressive, and is a testament to the power of mathematics.
10.6. Check Model Shapes
The Network can now be printed using the custom function get_PrintNetwork()
. This function returns each layer in an individual box, containing key information like the relevant shapes of the matrices and the activation functions used. The source code for the function can be found here.
# Print the Network
network_model %>%
get_PrintNetwork() %>%
cat()
Which prints:
+--------------------------------+
| Layer : input |
| Nodes : 3,072 |
| Inpt Shape : 7,000 x 3,072 |
| Wgts Shape : 0 x 3,072 |
| Outp Shape : 7,000 x 3,072 |
| Activation : |
+--------------------------------+
|
V
+--------------------------------+
| Layer : 1 |
| Nodes : 100 |
| Inpt Shape : 7,000 x 3,072 |
| Wgts Shape : 3,072 x 100 |
| Outp Shape : 7,000 x 100 |
| Activation : relu |
+--------------------------------+
|
V
+--------------------------------+
| Layer : 2 |
| Nodes : 75 |
| Inpt Shape : 7,000 x 100 |
| Wgts Shape : 100 x 75 |
| Outp Shape : 7,000 x 75 |
| Activation : relu |
+--------------------------------+
|
V
+--------------------------------+
| Layer : 3 |
| Nodes : 50 |
| Inpt Shape : 7,000 x 75 |
| Wgts Shape : 75 x 50 |
| Outp Shape : 7,000 x 50 |
| Activation : relu |
+--------------------------------+
|
V
+--------------------------------+
| Layer : 4 |
| Nodes : 30 |
| Inpt Shape : 7,000 x 50 |
| Wgts Shape : 50 x 30 |
| Outp Shape : 7,000 x 30 |
| Activation : relu |
+--------------------------------+
|
V
+--------------------------------+
| Layer : 5 |
| Nodes : 20 |
| Inpt Shape : 7,000 x 30 |
| Wgts Shape : 30 x 20 |
| Outp Shape : 7,000 x 20 |
| Activation : relu |
+--------------------------------+
|
V
+--------------------------------+
| Layer : output |
| Nodes : 1 |
| Inpt Shape : 7,000 x 20 |
| Wgts Shape : 20 x 1 |
| Outp Shape : 7,000 x 1 |
| Activation : sigmoid |
+--------------------------------+
11. Calculate the Cost
Once the forward-propagation part has been completed, it’s then necessary to get a measure for how wrong the model is. This will then be used to update the model parameters during the backward propagation steps.
11.1. Set Cost Function
The first step is to write the functions to be used for getting the cost for the model. Ultimately, it does not matter how bad the results are for the first round of training; remember, the model was initialised with random numbers. What’s important is that the cost functions should be determining a single value for the cost of the network, and this single function will be what is used for the derivative functions to be used in the backwards propagation steps.
Note here, that there is a very small epsilon value used (epsi
), which effectively adjusts for a perfect prediction made by the model. The reason for this is not that we don’t want the model to predict a perfect value, it’s that we want the model to predict the probability of a perfect value. Furthermore, it’s impossible to take a logarithmic value of a 0
, so therefore it’s necessary to adjust this to be just a little bit off-zero.
get_ComputeCost <- function(pred, true, epsi=1e-10) {
# Get number of samples
samp <- length(true)
# Instantiate totals
total_cost <- 0
# Loop for each prediction
for (i in 1:samp) {
# Adjust for perfect predictions
if (pred[i]==1) {pred[i] %<>% subtract(epsi)}
if (pred[i]==0) {pred[i] %<>% add(epsi)}
# Calculate totals
total_cost <- total_cost -
(
true[i] * log(pred[i])
+
(1-true[i]) * log(1-pred[i])
)
}
# Take an average
cost <- (1/samp) * total_cost
# Return
return(cost)
}
Then, the cost must be applied back on to the network. For this, the exact same value is applied to each layer of the network.
set_ApplyCost <- function(network_model, cost) {
# Apply back to the model
for (layer in network_model %>% names) {
network_model[[layer]][["cost"]] <- cost
}
# Return
return(network_model)
}
11.2. Run the Cost Function
The below code chunk runs the cost functions, utilising the functions defined above.
network_model %<>% set_ApplyCost(
get_ComputeCost(network_model[["output"]][["acti"]], trn_cls)
)
12. Backward Propagation
The function of back-propagation is intended to take the cost of the model, and then differentiate each of the Weights and Biases matrices in the network to determine to what extent each and every single parameter in the network contributed to that final cost value. To do this, the process works backwards from the end to the start, following the following steps:
- Differentiate the final cost value
- Differentiate the activation matrices
- Differentiate the linear algebra matrices
- Continue to the previous layer.
Each step in the back-propagation process requires calculus to derive the values, and the final function is implemented here. As this paper is not intended to show how to derive the equations, but more how to run the functions, the necessary algebraic steps are not included here.
12.1. Differentiate Cost
As similar to the functions used to get the cost, the cost differential values are first calculated, then applied to each layer of the network.
get_DifferentiateCost <- function(pred, true) {
# Do work
diff_cost <- -(
divide_by(true, pred) - divide_by(1-true, 1-pred)
)
# Return
return(diff_cost)
}
set_ApplyDifferentiateCost <- function(
network_model,
cost_differential
) {
# Do work
for (layer in names(network_model)) {
network_model[[layer]][["back_cost"]] <- cost_differential
if (layer=="output") {
network_model[[layer]][["back_acti"]] <- network_model %>%
extract2(layer) %>%
extract2("back_cost") %>%
t()
}
}
# Return
return(network_model)
}
12.2. Differentiate Activation
As each of the activations have their own function, there is also a derivation of that function which can be calculated using calculus.
let_BackwardActivateRelu <- function(diff_acti_curr, linr_curr) {
# Do work
diff_linr_curr <- diff_acti_curr
diff_linr_curr[linr_curr<=0] <- 0
# Return
return(diff_linr_curr)
}
let_BackwardActivateSigmoid <- function(diff_acti_curr, linr_curr) {
# Do work
temp <- 1/(1+exp(-linr_curr))
diff_linr_curr <- t(diff_acti_curr) * temp * (1-temp)
# Return
return(t(diff_linr_curr))
}
12.3. Differentiate Linear
Having had defined all this so far, the next step is to combine in to one single function, which can be used once per layer, which runs the necessary back-propagation differentiation functions.
Notice here that the output is actually a list of three elements. This a key difference between R
and python
. In R
, functions can only output a single element; as compared to python
which is able to return multiple elements from each function.
get_DifferentiateLinear <- function(
back_linr_curr,
acti_prev,
wgts,
bias
) {
# get number of samples
samp <- acti_prev %>% dim %>% extract(2)
# Differentiate weights
diff_wgts <- 1/samp * (back_linr_curr %*% acti_prev)
# Differentiate bias
diff_bias <- 1/samp * rowSums(back_linr_curr, dims=1)
# Differentiate activation
diff_acti_prev <- wgts %*% back_linr_curr
# Consolidate in to one list
list_linr <- list(
diff_acti_prev,
diff_wgts,
diff_bias
)
# Return
return(list_linr)
}
12.4. Run Backward Propagation
After having defined the differentiation functions, next is necessary to combine these individual functions in to a single, combined, which can be run once per layer.
First, we will set up the function, and secondly we will run it.
12.4.1. Set Backward Propagation Function
The backward propagation function to be defined (set_BackwardProp()
) must be designed to run through the following steps:
- Run through each layer in reverse direction. This is necessary because, quite logically, the backward propagation function needs to be run in a backwards direction.
- Extract the layer name.
- Skip the
input
layer. Again, quite logical, because this layer is at the beginning of the network, and does not need to be back-propagated. - Extract the name of the previous layer.
- Extract the relevant matrices to be used for subsequent calculations.
- Extract the relevant activation function for this particular layer.
- Set up some empty matrices, which will house the relevant differentiated matrices.
- Differentiate the
activation
matrix of the current layer -
Differentiate the
linear
matrices, including:weight
matrix of the current layer.bias
matrix of the current layer.activation
matrix of the previous layer.
- Apply information back in to the relevant places in the
network_model
. - Return the updated
network_model
.
set_BackwardProp <- function(network_model) {
# Loop through each layer in reverse order
for (layr_indx in network_model %>% names() %>% length() %>% 1:. %>% rev) {
# Get the layer name
layr_curr <- network_model %>%
names() %>%
extract(layr_indx)
# Skip the 'input' layer
if (layr_curr == "input") next
# Get the previous layer name
layr_prev <- network_model %>%
names %>%
extract(layr_indx-1)
# Set up the existing matrices
linr_curr <- network_model %>%
extract2(layr_curr) %>%
extract2("linr")
wgts_curr <- network_model %>%
extract2(layr_curr) %>%
extract2("wgts")
bias_curr <- network_model %>%
extract2(layr_curr) %>%
extract2("bias")
acti_prev <- network_model %>%
extract2(layr_prev) %>%
extract2("acti")
diff_acti_curr <- network_model %>%
extract2(layr_curr) %>%
extract2("back_acti")
# Get the activation function
acti_func_back <- network_model %>%
extract2(layr_curr) %>%
extract2("acti_func") %>%
str_to_title %>%
paste0("let_BackwardActivate", .)
# Set up the empty matrices
diff_linr_curr <- matrix()
diff_acti_prev <- matrix()
diff_wgts_curr <- matrix()
diff_bias_curr <- matrix()
# Differentiate activation
diff_linr_curr <- get(acti_func_back)(
diff_acti_curr,
linr_curr
)
# Differentiate linear
list_linr <- get_DifferentiateLinear(
back_linr_curr=diff_linr_curr,
acti_prev=acti_prev,
wgts=wgts_curr,
bias=bias_curr
)
diff_acti_prev <- list_linr %>% extract2(1)
diff_wgts_curr <- list_linr %>% extract2(2)
diff_bias_curr <- list_linr %>% extract2(3)
# Apply back to model
network_model[[layr_prev]][["back_acti"]] <- diff_acti_prev
network_model[[layr_curr]][["back_linr"]] <- diff_linr_curr
network_model[[layr_curr]][["back_wgts"]] <- diff_wgts_curr
network_model[[layr_curr]][["back_bias"]] <- diff_bias_curr
}
# Return
return(network_model)
}
12.4.2. Run Backward Propagation Function
Having defined this function, the next step is to run the function. The below code chunk is wrapped with the tic()
and toc()
functions, in order to determine how much time it took for the function to run.
tic()
network_model %<>% set_BackwardProp()
toc()
Which prints:
8.84 sec elapsed
As shown, it took approximately 9 seconds to run this function. This is quite impressive, considering that there are over 320,000
parameters to be updated (see Check Model Parameters section).
13. Update Model Parameters
13.1. Context
After the model parameters have been differenced, it’s then necessary to update the relevant parameters of the network (the weights and biases) before re-training the network again by re-running the forward propagation function.
The method used to update these parameters is called Stochastic Gradient Descent. For more information, see Stochastic Gradient Learning in Neural Networks or Backpropagation and Stochastic Gradient Descent Method.
For sure, there are other methods of implementing neural network optimisation. The literature has been spending a lot of effort in this area. There are algorithms such as RMSProp and Adam, which implement intelligent methods to achieve a faster convergence, and a more accurate final result. Some good sources for this include An Empirical Comparison of Optimizers for Machine Learning Models and Overview of different Optimizers for Neural Networks. For further enhancement and optimisation, it is important to explore these optimisation options.
13.2. Process
Nonetheless, the process of implementing Stochastic Gradient Descent is really quite simple:
- Specify a given learning rate between
0
and1
(usually very small number, for example0.001
). - Take the differenced weight and bias matrices, and multiply by the learning rate in a negative direction.
- Add the updated differenced weight and bias matrices, and add to the original weight and bias matrix.
- Repeat this process layer-by-layer.
13.3. Set the Update Model Function
To set the function for updating the model parameters, the following steps are used:
- Specify the
learning_rate
as an argument for the function. - Loop through each layer in the network in a forward direction.
- Extract the layer name.
- Skip the
input
layer (as it does not need updating). - Define the gradient steps for the
back_wgts
andback_bias
matrices. - Apply the gradient steps to the original
wgts
andbias
matrices. - Return the updated mode.
set_UpdateModel <- function(network_model, learning_rate=0.001) {
# Do work
for (index in network_model %>% names() %>% length() %>% 1:.) {
# Get layer name
layr <- network_model %>%
names() %>%
extract(index)
# Skip 'input' layer
if (layr=="input") next
# Define gradient steps for the weight
grad_step_wgts <- -1 *
(
learning_rate * network_model[[layr]][["back_wgts"]]
)
# Define gradient steps for the bias
grad_step_bias <- -1 *
(
learning_rate * network_model[[layr]][["back_bias"]]
)
# Take steps
network_model[[layr]][["wgts"]] <- network_model %>%
extract2(layr) %>%
extract2("wgts") %>%
add(t(grad_step_wgts))
network_model[[layr]][["bias"]] <- network_model %>%
extract2(layr) %>%
extract2("bias") %>%
add(grad_step_bias)
}
# Return
return(network_model)
}
13.4. Run the Update Model Function
Having defined the function, then we run the model our network.
network_model %<>% set_UpdateModel(0.01)
14. Run the Model End-to-End
Now, it’s time to bring it all together. Running the model end-to-end essentially means:
- Applying forward propagation.
- Calculate the cost.
- Run backward propagation.
- Update the model parameters.
- Repeat…
Each time this is repeated is called an Epoch. It is quite typical for networks to be updated over many many Epochs. Sometimes hundreds, sometimes thousands of Epochs; the exact number of Epochs to run is a discretionary decision by the Data Scientist, based on numerous variables.
There is one additional step to be added in here, and that is batching the data. For this, we can consider within each Epoch, the data will be batched up in to equally divisible groups, and used for subsequent processing. In this instance, the model parameters will be updated after each batch. And when the full amount of data has been run through the model in it’s entirety, this is then considered one Epoch.
14.1. Set up the Train Model Function
To illustrate this programmatically, the following function is written.
Notice here, that there are many custom functions included (the source code for each of which can be found here). These functions include:
get_Modulus()
get_BatchIndexes()
get_VerbosityValues()
get_TimeDifference()
plt_PlotLearningCurve()
The steps for this function include:
- Begin the timer.
- Declare what information will be returned.
- Instantiate the network.
- Initialise the network.
- Loop through each Epoch.
- Loop through each batch.
- Subset the data for this particular batch.
- Run forward propagation.
- Calculate the cost.
- Apply the cost to the network.
- Differentiate the cost.
- Run backward propagation.
- Update the model parameters.
- Run through next batch.
- Save the overall cost at the end of the Epoch.
- Print the update at the relevant Epoch number (from the
verbosity
argument) - Run through next Epoch.
- Save the updated network.
- Print the learning curve.
- Return the output.
let_TrainModel <- function(
x_train,
y_train,
input_nodes=dim(x_train)[2],
hidden_nodes=c(100, 50, 10),
output_nodes=1,
initialisation_algorithm="xavier",
initialisation_order="layers",
activation_hidden="relu",
activation_final="sigmoid",
batches=get_Modulus(dim(x_train)[1])[4],
epochs=500,
learning_rate=0.001,
verbosity=NA,
print_learning_curve=TRUE
) {
# Begin the timer
time_begin <- Sys.time()
# Set return values
output <- list(
network_model=network_model,
results=list(
cost=vector()
# Can add more, such as accuracy or specificity.
)
)
# Instantiate
network_model <- set_InstantiateNetwork(
input=input_nodes,
hidden=hidden_nodes,
output=output_nodes
)
# Initialise
network_model <- set_InitialiseModel(
network_model=network_model,
initialisation_algorithm=initialisation_algorithm,
initialisation_order=initialisation_order
)
# Loop each epoch
for (epoch in 1:epochs) {
# Loop each batch
for (batch in 1:batches) {
# Set indices
batch_indexes <- get_BatchIndexes(
vector=1:dim(x_train)[1],
batches=batches,
batch=batch,
seed=1234
)
# Set data
x_train_batch <- x_train[batch_indexes,]
y_train_batch <- y_train[batch_indexes]
# Forward Prop
network_model <- set_ForwardProp(
network_model = network_model,
data_in = x_train_batch,
activation_hidden = activation_hidden,
activation_final = activation_final
)
# Get cost
cost <- get_ComputeCost(
pred = network_model[["output"]][["acti"]],
true = y_train_batch,
epsi = 1e-10
)
# Apply cost
network_model <- set_ApplyCost(
network_model = network_model,
cost = cost
)
# Differentiate cost
network_model <- set_ApplyDifferentiateCost(
network_model = network_model,
cost_differential = get_DifferentiateCost(network_model[["output"]][["acti"]], y_train_batch)
)
# Backprop
network_model <- set_BackwardProp(network_model)
# Update parameters
network_model <- set_UpdateModel(
network_model = network_model,
learning_rate = learning_rate
)
}
# Save cost
output[["results"]][["cost"]] %<>% c(cost)
# Print update
if (!is.na(verbosity)) {
if (epoch %in% get_VerbosityValues(epochs, verbosity)){
if (epoch == verbosity) {
"Learning rate: {}n" %>%
str_Format(learning_rate) %>%
cat()
}
"Epoch {}, Cost: {}, Time: {}n" %>%
str_Format(
epoch,
round(cost, 5),
get_TimeDifference(time_begin)
) %>%
cat()
}
}
}
# Re-apply back to the output list
output[["network_model"]] <- network_model
# Print the results
if (print_learning_curve == TRUE) {
tryCatch(
expr={
output %>%
extract2("results") %>%
extract2("cost") %>%
plt_PlotLearningCurve(
input_nodes=input_nodes,
hidden_nodes=hidden_nodes,
output_nodes=output_nodes,
initialisation_algorithm=
initialisation_algorithm,
initialisation_order=initialisation_order,
activation_hidden=activation_hidden,
activation_final=activation_final,
epochs=epochs,
learning_rate=learning_rate,
verbosity=verbosity,
run_time=get_TimeDifference(time_begin)
) %>%
print()
},
warning=function(message){
writeLines("A Warning occurred:")
writeLines(message)
return(invisible(NA))
},
error=function(message){
writeLines("An Error occurred:")
writeLines(message)
return(invisible(NA))
},
finally={
#Do nothing...
}
)
}
# Return
return(output)
}
14.2. Run the Train Model Function
Having had set up the training function, let’ now run it.
Notice here that:
- Only the training data is used (see Split Data section).
- The He initialisation algorithm is used, with an order of 2.
- The relu activation algorithm is used for the hidden layers, with the sigmoid activation algorithm for the output layer.
- There are 56 batches, and 50 epochs.
- The model results are printed every 10 epochs.
- The learning curve is printed at the end.
training_output <- let_TrainModel(
x_train=trn_img,
y_train=trn_cls,
input_nodes=dim(trn_img)[2],
hidden_nodes=c(100,75,50,30,20),
output_nodes=1,
initialisation_algorithm="he",
initialisation_order=2,
activation_hidden="relu",
activation_final="sigmoid",
batches=56, epochs=50,
learning_rate=0.01,
verbosity=10,
print_learning_curve=TRUE
)
Which prints:
Learning rate: 0.01
Epoch 10, Cost: 0.32237, Time: 4.86 mins
Epoch 20, Cost: 0.30458, Time: 9.58 mins
Epoch 30, Cost: 0.28903, Time: 12.23 mins
Epoch 40, Cost: 0.29525, Time: 14.87 mins
Epoch 50, Cost: 0.29884, Time: 17.43 mins

It works! This plot proves that the model is training, and it is continuing to learn and improve its performance over time.
Notice here that the cost line begins at approximately 0.4
, and very quickly decreases to 0.3
after 20 Epochs. In order for it to complete the full 50 Epochs, this took approximately 12 minutes
. This can be seen as a success.
14.3. Further Experimentation
Due to the nature of how these functions have been set up, it is very easy for further experimentation and optimisation.
It may be reasonable to try:
- A different number of hidden layers, or number of nodes per layer (change the
hidden_nodes
parameter), - A different initialisation algorithm, or initialisation order (change the
initialisation_alghorithm
orinitialisation_order
parameter), - A different activation function on the hidden layers (change the
activation_hidden
parameter), - A different number of batches per Epoch (change the
batches
parameter), - A different number of epochs (change the
epochs
parameter), - A different learning rate (change the
learning_rate
parameter), - Get more data (recall in the Download Data section, only
10,000
images were downloaded; but there are another50,000
images which can be used for further training).
Here is an experimentation of trying some different parameters. Notice how the results also change. This next round of training took 34 minutes
to run.
training_output <- let_TrainModel(
x_train=trn_img,
y_train=trn_cls,
input_nodes=dim(trn_img)[2],
hidden_nodes=c(100,75,50,30,20),
output_nodes=1,
initialisation_algorithm="xavier",
initialisation_order="layers",
activation_hidden="relu",
activation_final="sigmoid",
batches=56, epochs=100,
learning_rate=0.001,
verbosity=20,
print_learning_curve=TRUE
)
Which prints:
Learning rate: 0.001
Epoch 20, Cost: 0.33354, Time: 5.4 mins
Epoch 40, Cost: 0.29891, Time: 10.26 mins
Epoch 60, Cost: 0.30255, Time: 15.2 mins
Epoch 80, Cost: 0.29968, Time: 20.84 mins
Epoch 100, Cost: 0.29655, Time: 26.09 mins

It is then the job of the Data Scientist to identify the optimal configuration of these parameters, in order to achieve a better performance rate. See Hyperparameter Optimization in Machine Learning Models for more details.
However, as this article is not to achieve the best results, but more to show how to achieve said results, and the processes/methodologies involved. Therefore, further experimentation will not be conducted in this paper.
15. Create Prediction
After having trained the network, next step is to apply it to the test
data to see how accurate it can predict unknown data.
15.1. Set Up Prediction
The first step is to write two custom functions. The first will take in the test data and trained model, and will return a data frame containing the predicted values and the truth values. Then pretty-print the Confusion Matrix in a ggplot
-style of output.
get_Prediction <- function(
x_test,
y_test,
network_model,
threshold=0.5
) {
# Create prediction
predic <- set_ForwardProp(
network_model=network_model,
data_in=x_test,
activation_hidden="relu",
activation_final="sigmoid"
)
# Extract probabilities
probas <- predic %>%
extract2("output") %>%
extract2("acti")
# Define results
result <- data.frame(
probs=probas,
truth=y_test
)
# Add prdic
result %<>%
mutate(prdic=ifelse(probas>threshold, 1, 0))
# Return
return(result)
}
plt_ConfusionMatrix <- function(confusion_matrix) {
# Do work
plot <- confusion_matrix %>%
extract("table") %>%
as.data.frame() %>%
rename_all(str_remove_all, "table.") %>%
rename("Prediction"=1, "Reference"=2) %>%
mutate(
goodbad = ifelse(
Prediction == Reference,
"good",
"bad"
)
) %>%
group_by(Reference) %>%
mutate(prop = Freq/sum(Freq)) %>%
ungroup() %>%
{
ggplot(.,
aes(
x = Reference,
y = Prediction,
fill = goodbad,
alpha = prop
)) +
geom_tile() +
geom_text(
aes(label = Freq),
vjust = .5,
fontface = "bold",
alpha = 1
) +
scale_fill_manual(
values=c(good="green", bad="red")
) +
scale_x_discrete(
limits=levels(.$Reference),
position="top"
) +
scale_y_discrete(
limits=rev(levels(.$Prediction))
) +
labs(
title="Confusion Matrix",
subtitle=paste0(
"For: '",
.$Prediction[1],
"' vs '",
.$Prediction[2],
"'"
)
)
}
# Return
return(plot)
}
15.2. Run Prediction
The next step is to actually run the prediction. This step is quite self explanatory. The parameters for this function includes the test data, and the trained model.
# Create prediction
Prediction <- get_Prediction(
tst_img,
tst_cls,
training_output[["network_model"]],
0.1
)
15.3. View Prediction
Having created this prediction, it’s then necessary to visualise the output. In the same way as the Check Images section, the following code chunk visualises the top 16 images, and returns a label for each.
As seen, the model is clearly able to recognise some, but it also makes mistakes on others.
# Define classes
ClassList <- c("0"="Not", "1"="Car")
# Set list
lst <- list()
# Loop 16 images
for (image in 1:16) {
lst[[image]] <- plt_PlotImage(
cifar$images[-partition,,,],
Prediction[["prdic"]],
ClassList,
image
)
}
# View images
gridExtra::grid.arrange(grobs=lst, ncol=4)

15.4. Test the Prediction
The next step then is to statistically test the data to see how accurate it may be. For this, the confusionMatrix()
function (from the caret
package) used. From this output, it is possible to select the appropriate metric in order for further optimising the neural network.
Since this article is onlyabout showing the method, then we will not progress any further with more optimisation here.
# Set Up
ConfusionMatrix <- Prediction %>%
mutate_at(c("truth","prdic"), ~ifelse(.==1,"car","not")) %>%
select(prdic,truth) %>%
table %>%
caret::confusionMatrix()
# Print
ConfusionMatrix %>% print()
Which prints:
Confusion Matrix and Statistics
truth
prdic car not
car 188 1118
not 121 1573
Accuracy : 0.587
95% CI : (0.569, 0.605)
No Information Rate : 0.897
P-Value [Acc > NIR] : 1
Kappa : 0.079
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.6084
Specificity : 0.5845
Pos Pred Value : 0.1440
Neg Pred Value : 0.9286
Prevalence : 0.1030
Detection Rate : 0.0627
Detection Prevalence : 0.4353
Balanced Accuracy : 0.5965
'Positive' Class : car
Next step is to plot the confusion matrix. A good source of understanding and interpreting confusion matrices can be found here: Confusion Matrix. In this website, it also includes an image (also copied below) for a very good visual understanding.

When we visualise the confusion matrix of our model, we can see that it successfully predicted a majority of the images. However, the largest number of bad classifications is where it predicted that the image was a car, when it actually was not.
# Plot Confusion Matrix
ConfusionMatrix %>%
plt_ConfusionMatrix()

Another good analytical and plotting tool is to use the ROC Curve (receiver operating characteristics). An ROC Curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied (see Receiver Operating Characteristic).
Essentially, a model that creates a perfect prediction would have the curve perfectly ‘hugging’ the top-left of this plot. As our model is not doing this, it is clearly obvious that further training and optimisation is necessary. For this, it is necessary to continue experimenting as mentioned in the Further Experimentation section.
# Print ROC
Prediction %>%
ggplot() +
plotROC::geom_roc(aes(m=probs, d=truth), n.cuts=0) +
plotROC::style_roc(
theme=theme_grey,
ylab="True Positive Rate",
xlab="False Positive Rate"
) +
theme(
plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5)
) +
labs(
title="ROC Curve",
y="True Positive Rate",
x="False Positive Rate"
)

16. Conclusion
This is an effective representation for how to build Vanilla Neural Networks in R
. Here, it has been shown how to:
- Access and check the data
- Instantiate and Initialise the network
- Run forward propagation
- Compute the cost
- Run backward propagation
- Update the model
- Set up a training method to loop through every thing
- Predict and assess the performance
It has also been shown how to do this entirely in R
, and without using any pre-defined deep learning framework. For sure, there are other packages that can perform all of the steps, and perhaps in a more computationally efficient manner. But the use of these packages is not the target of this article. Here, the aim is to dispel the ‘blackbox’ phenomenon, and to show how to create these deep learning frameworks from the ground up.
As shown, the architecture of these feed-forward neural networks effectively use matrix multiplication and differential calculus in order to adjust the ‘weights’ of a network, and increase it’s predictive power. It does require lots of data, and does require a long time to train and optimise. Furthermore, there are many (many, many) different architectural choices that can be made along the way, all of which are at the discretion of the Data Scientist applying this methodology.
At the end of the day, this article has proven that deep learning can definitely be achieved in R
. Thereby, dispelling the myth that ‘in order to do Deep Learning, you must use Python
‘. Certainly, there are many successful implementations of Deep Learning in Python
; for example: Building a Feedforward Neural Network from Scratch in Python. There’s also example of achieving the same in Swift (Deep Neural Networks in Swift, lessons learned), and in C++ (Neural Network Implementation in C++ From Scratch), and in Java (Implementing an Artificial Neural Network in Pure Java (No external dependencies)). Therefore, the specific language used is effectively irrelevant. What matters is the use case, the environment, the business artifacts, and the language that the Data Scientist feels comfortable in.
17. Post Script
Acknowledgements: This report was compiled with some assistance from others. Acknowledgements go to:
Publications: This report is also published on the following sites:
- RPubs: RPubs/chrimaho/VanillaNeuralNetworksInR
- GitHub: GitHub/chrimaho/VanillaNeuralNetworksInR
- Medium: Medium/chrimaho/VanillaNeuralNetworksInR
Change Log: This publication was modified on the following dates:
- 02/Nov/2020: Original Publication date.