The world’s leading publication for data science, AI, and ML professionals.

Vanilla Neural Networks in R

Take a Look Under The Hood of Neural Network Architecture: Design and Build a Neural Network, from Scratch, in R, without using any Deep…

Special thanks to: Alex Scriven

Image source: GitHub
Image source: GitHub

Contents:

  1. Introduction
  2. Background
  3. Semantics
  4. Set Up
  5. Get Data
  6. Check Data
  7. Prepare the Data
  8. Instantiate the Network
  9. Initialise the Network
  10. Forward Propagation
  11. Calculate the Cost
  12. Backward Propagation
  13. Update Model Parameters
  14. Run the Model End-to-End
  15. Create Prediction
  16. Conclusion
  17. Post Script

1. Introduction

Modern-day Data Science techniques frequently use robust frameworks for designing and building machine learning solutions. In the [R](https://www.r-project.org/) community, packages such as the [tidyverse](https://www.tidyverse.org/) and the [caret](http://caret.r-forge.r-project.org/) packages are frequently referenced; and within [Python](http://127.0.0.1:17287/rmd_output/0/), packages such as [numpy](https://numpy.org/), [pandas](https://pandas.pydata.org/), [sci-kit learn](https://scikit-learn.org/) are frequently referenced. There are even some packages that have been built to be used in either language, such as [keras](https://keras.io/), [pytorch](https://pytorch.org/), [tensorflow](https://www.tensorflow.org/). However, the limitation of using these packages is the ‘blackbox‘ __ phenomenon, where users do not understand what is happening behind-the-scenes (or ‘under the hood’, if you will). Users know how to use the functions, and can interpret the results, but don’t necessarily know how the packages were able to achieve the results.

The purpose of this paper is to create a ‘back to basics‘ approach to designing Deep Learning solutions. The intention is not to create the most predictive model, nor is it to use the latest and greatest techniques (such as convolution or recursion); but the intention is to create a basic neural network, from scratch, using no frameworks, and to walk through the methodology.

Note: The word ‘Vanilla‘ in ‘Vanilla Neural Networks’ simply refers to the fact it is built from scratch, and does not use any pre-existing frameworks in its construction.

2. Background

2.1. Context

There are already many websites and blogs which explain how this process is done. Such as Jason Brownlee‘s article How to Code a Neural Network with Backpropagation In Python (from scratch), and DeepLearning.ai‘s notebook dnn_app_utils_v2.py (on the Microsoft Azure Notebooks network). However, these sources are all written in Python. Which is fine, if that’s what is needed, and there are some very legitimate reasons to use Python over other languages. But this paper will be written in R.

The R language was chosen for two reasons:

  1. I am familiar with the language. I can speak Python (along with other languages); I just chose R to show how it can be achieved using this language.
  2. To prove that there are many different ways to achieve the same outcome. So, while there are sometimes legitimate constraints for choosing one language over another (business legacy, technological availability, system performance, etc), sometimes one language is chosen simply because it is stylistically more preferable.

Therefore, let’s see how to architect and construct a Vanilla Neural Network in R.

2.2. What It’s Not

This article does not cover the latest and greatest Deep Learning architectures (like Convolution or Recursion). As such the final performance may not be as good as it could be, if these other architectures were used.

This article does not teach readers about the theoretical mathematical concepts behind how Neural Networks operate. There are plenty of other lectures which teach this information (eg. The Math Behind Neural Networks). In fact, this article assumes a lot of knowledge from the Reader about programming, about calculus, and about the basics behind what a Neural Network conceptually is.

This article does not cover why Neural Networks work the way they do, and the conceptual understanding behind a Feed-Forward architecture. There are plenty of other blogs (eg. A Beginner Intro to Neural Networks) and videos (eg. Neural Networks series) which cover such information.

This article doesn’t __ point the reader to other packages and applications which may already have this information set up and working. Packages like [tensorflow](https://www.rdocumentation.org/packages/tensorflow) and [nnet](https://www.rdocumentation.org/packages/nnet) already have this covered.

What this article actually is, is a functional walk-through, how-to piece, for creating a Vanilla Neural Network (a Feed-Forward Network), from scratch, step-by-step, in the R programming language. It contains lots of code, and lots of technical details.

3. Semantics

3.1. Layout

This article is laid our in such a way to describe how a Neural Network is built from the ground-up. It will walk through the steps to:

  1. Access and check the data
  2. Instantiate and Initialise the network
  3. Run forward propagation
  4. Compute the cost
  5. Run backward propagation
  6. Update the model
  7. Set up a training method to loop through every thing
  8. Predict and assess the performance

In the interest of brevity, the functions defined here will not include all the commentary and validations that should be included in a typical function. They will only include basic steps and prompts. However, the source code for this article (located here) does contain all the appropriate function docstrings and assertions.

3.2. Syntax

For the most part, the syntax in this article is kept to the [dplyr](https://www.rdocumentation.org/packages/dplyr) ‘pipe’ method (which uses the %>% symbol). However, in certain sections the R [base](https://www.rdocumentation.org/packages/base) syntax is used (for example, in function declaration lines).

Throughout the article, many custom functions are written. Each of these are prefixed with the words get, let and set. The definitions of each are given below.

  • get_*():

    • It will get certain attributes of meta-data from the objects which are parsed to this function.
    • Or will use the information parsed to this function to derive and get other values or parameters.
  • set_*():

    • It will set (or ‘update’) the objects which are parsed to this function.
    • Is usually used for updating the network during forward and backward propagation processes.
  • let_*():

    • Similar to get, in that it takes other values parsed to this function to derive an outcome, however it will let this value be utilised by another object or function.
    • Used mainly for the initialisation and activation functions.

4. Set Up

4.1. Load Packages

The first step is to import the relevant packages. This list includes the the main packages used throughout this process; and the main purpose of which is also listed.

Note what is listed above about not using existing Deep Learning packages, and yet the tensorflow package is included. Why? Well, this is used for only accessing the data, which will be covered in the next section. The tensorflow package is not used for building and training any networks.

library(tensorflow)  #<-- Only used for getting the data
library(tidyverse)   #<-- Used for accessing various tools
library(magrittr)    #<-- Extends the `dplyr` syntax
library(grDevices)   #<-- For plotting the images
library(assertthat)  #<-- Function assertions
library(roxygen2)    #<-- Documentation is important
library(caret)       #<-- Doing data partitioning
library(stringi)     #<-- Some string manipulation parts
library(DescTools)   #<-- To properly check `is.integer`
library(tictoc)      #<-- Time how long different processes take
library(docstring)   #<-- Makes viewing the documentation easier
library(roperators)  #<-- Conveniently using functions like %+=%
library(plotROC)     #<-- For plotting predictions

5. Get Data

5.1. Download Data

The dataset to be used is the CIFAR-10 dataset. It’s chosen for a number of reasons, including:

  1. The data is on images, which is ideal for Deep Learning purposes;
  2. There are a decent number of images included (60,000 images in total);
  3. All images are the same the same size (32×32 pixels);
  4. The images have been categorised in to 10 different classes; and
  5. It’s easily accessible through the TensorFlow package.

The following code chunk has the following process steps:

  1. Get the data

    • In order to import the date, it is accessed through the keras element, which contains the suite of datasets, including the cifar10 part.
    • The load_data() function retrieves the data from the online GitHub repository.
  2. Extract the second element

    • The load_package() here returns two different objects:
        1. The train dataset (containing 50,000 images);
        1. The Test dataset (containing 10,000 images).
    • The second element is extracted (by using the extract2(2) function) because only 10,000 images are needed for these purposes.
    • This article is to show the process of creating Vanilla Neural Networks; if more data is needed at a later stage, it can easily be accessed here.
  3. Name the parts

    • The data as downloaded contains two further elements:
        1. The images themselves (in the form of a 4-Dimensional array);
        1. The image labels (in the form of a 2-Dimensional, single column array).
    • This data does not have any names, so therefore the names are set by using the set_names() function.
# Download Data
# NOTE:
# - The first time you run this function, it download everything.
# - Next time you run it, TensorFlow will load from Cache.
cifar <- tf$keras$datasets$cifar10$load_data() %>% 
    extract2(2) %>% 
    set_names(c("images","classes"))

5.2. Get Class Definitions

One of the challenges behind accessing this data from the TensorFlow package is that the classes are only numeric values (0 to 9) for each type of image. The definitions for these images can be found on GitHub (GitHub > EN10 > CIFAR). These classes are defined in the following code chunk.

# Define classes
ClassList <- c(
    "0" = "airplane",
    "1" = "automobile",
    "2" = "bird",
    "3" = "cat",
    "4" = "deer",
    "5" = "dog",
    "6" = "frog",
    "7" = "horse",
    "8" = "ship",
    "9" = "truck" 
)

6. Check Data

6.1. Check Objects

It is important to check the data, to ensure that it has been generated correctly, and all the information looks okay. For this, a custom-function is written (get_ObjectAttributes()), the source code for which can be found here. As seen by the following code chunk, the images object is a 4-Dimensional numeric array, with 10,000 images, each 32x32 pixels, and 3 colour chanels. The entire object is over 117 Mb large.

# Check Images
cifar %>% 
    extract2("images") %>% 
    get_ObjectAttributes("cifar$images") %>% 
    cat()

Which prints:

Name : cifar$images
 - Size : 117.2 Mb
 - Clas : array
 - Type : integer
 - Mode : numeric
 - Dims : 10000x32x32x3

When checking the classes object, it is a 2-Dimensional numeric array (with only 1 column), but with the same number of images as the images object (which is to be expected), with the frequency of each class label having exactly 1000 images each. The total size is less than 40 Kb.

# Check classes
cifar %>% 
    extract2("classes") %>% 
    get_ObjectAttributes(name="cifar$classes", print_freq=TRUE) %>% 
    cat()

Which prints:

Name : cifar$classes
 - Size : 39.3 Kb
 - Clas : matrix,array
 - Type : integer
 - Mode : numeric
 - Dims : 10000x1
 - Freq :
      label Freq
   1  0     1000
   2  1     1000
   3  2     1000
   4  3     1000
   5  4     1000
   6  5     1000
   7  6     1000
   8  7     1000
   9  8     1000
   10 9     1000

6.2. Check Images

After having gained an appreciation of the size of the objects in memory, it is then worth while to check the actual images themselves. As humans, we understand the actual images and the colours, better than we understand the numbers.

In order to visualise the images, two custom functions are written, as shown in the following code chunk. These functions take in the data (as a 4-Dimensional array), and visualise the images as a plot.

set_MakeImage <- function(image, index=1) {

    # Extract elements
    image.r <- image[,,1]
    image.g <- image[,,2]
    image.b <- image[,,3]

    # Make rgb
    image.rgb <- rgb(
        image.r, 
        image.g, 
        image.b, 
        maxColorValue=255
    )

    # Fix dimensions
    dim(image.rgb) <- dim(image.r)

    # Return
    return(image.rgb)

}
plt_PlotImage <- function(images, classes, class_list, index=1) {

    # Slice images
    image <- images[index,,,]
    image %<>% set_MakeImage(index)
    lbl <- classes %>% 
        extract(index) %>% 
        as.character() %>% 
        class_list[[.]]

    # Create plot
    plot <- ggplot() + 
        ggtitle(lbl) +
        draw_image(image, interpolate=FALSE)

    # Return
    return(plot)

}

When the function is run on the top 16 images, the following is displayed. As shown, the images are extremely pixelated (which is expected, as they are only 32x32 pixels each), and you can see how each of the images are categorised.

# Set list
lst <- list()
# Loop 16 images
for (index in 1:16) {
    lst[[index]] <- plt_PlotImage(
        cifar$images, 
        cifar$classes, 
        ClassList,
        index)
    }
# View images
plt <- gridExtra::grid.arrange(grobs=lst, ncol=4)
Figure 1: Initial Images
Figure 1: Initial Images

7. Prepare the Data

There are four steps to preparing the data:

  1. Reclassify
  2. Split
  3. Reshape
  4. Standardise

7.1. Reclassify

For the purposes of this paper, let’s assume that we are trying to predict whether or not the picture is a car or not. This will require transforming the data in to a binary classification problem, where the Neural Network will be looking to predict a 1 or a 0 from the data. This will mean that the model output will be a probability distribution of scores, which can be easily classified by changing the cutoff variable.

First step is to reclassify the data so that all the cars have the value 1, and everything else is 0. We know from the classes we defined before that the car’s already have the value 1, which means the transformation will only need to occur for all the other classes.

# Implement within a pipe
cifar[["classes"]] <- cifar %>%
    extract2("classes") %>% 
    (function(classes){

        # View initial classes
        classes %>% as.vector %>% head(40) %>% print

        # Reclassify
        classes <- ifelse(classes==1,1,0)

        # View reclassified classes
        classes %>% as.vector %>% head(40) %>% print

        # Return
        return(classes)
    })

Which prints:

[1] 3 8 8 0 6 6 1 6 3 1 0 9 5 7 9 8 5 7 8 6 7 0 4 9 5 2 4 0 9 6
[1] 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7.2. Split Data

The next task is to split the data in to a training and testing set. The reason for doing this is covered elsewhere (such sites such as Wikipedia’s Cross-Validation, and Google’s Machine Learning Crash Course: Training and Test Sets Splitting Data).

Firstly, to get an appreciation of the current split of the data, the following code chunk visualises this data using the ggplot2 package. As show, the data is currently distributed with 90% with the class 0, and the remaining 10% with the class 1.

# Print Plot
cifar %>% 
    extract("classes") %>% 
    table(dnn="classes") %>% 
    data.frame() %>% 
    ggplot(aes(classes, Freq, fill=classes)) + 
        geom_col(colour="black") +
        geom_label(
            aes(label=Freq),
            show.legend=FALSE
        ) +
        scale_y_continuous(breaks=seq(0,10000,1000)) +
        theme(panel.grid.minor.y=element_blank()) +
        labs(
            title="Count of Each Class"
        )
Figure 2: Count of Each Class
Figure 2: Count of Each Class

To implement this data splitting, we are using the caret::createDataPartition() function. This creates a partition object, which is then used to segregate the cifar data accordingly. The proportion of splitting was arbitrarily chosen at 70% for training, and the remaining for testing. However, it could be justified for this to be 80%; this a is hyperparameter which can be tweaked at a later stage.

# Set seed for reproducibility
set.seed(1234)
# Create partition
partition <- createDataPartition(cifar$classes, p=0.7, list=FALSE)
# Split data
trn_img <- cifar$images[partition,,,]
tst_img <- cifar$images[-partition,,,]
trn_cls <- cifar$classes[partition]
tst_cls <- cifar$classes[-partition]

After splitting, the data is re-plotted, and the train/test split is easily seen to have achieved an even 70% distribution over the two classes.

# Print Plot
rbind(
    trn_cls %>% table,
    tst_cls %>% table
) %>% 
    set_rownames(c("train","test")) %>% 
    data.frame() %>% 
    rename_all(str_remove_all, "X") %>% 
    rownames_to_column("data") %>% 
    pivot_longer(2:3, names_to="classes", values_to="Freq") %>% 
    mutate(
        label=paste(data, classes, sep=": "),
        data=factor(data, levels=c("train","test"))
    ) %>% 
    ggplot(aes(classes, Freq, fill=data), position="dodge") + 
        geom_col(colour="black", position="dodge") +
        geom_label(
            aes(label=Freq), 
            position=position_dodge(width=0.9),
            show.legend=FALSE
        ) +
        scale_y_continuous(breaks=seq(0,10000,1000)) +
        theme(panel.grid.minor.y=element_blank()) +
        labs(
            title="Count of Each Class",
            subtitle="Split by Train/Test"
        )
Figure 3: Count of Each Class, Split by Train/Test
Figure 3: Count of Each Class, Split by Train/Test

Another method for checking that the data has been split properly, is to again run the get_ObjectAttributes() function, as shown in the following code chunk. The information shown here is consistent with the plot above. It is also interesting to note that the Training Image array is 82 Mb large, which will be important to know for later checking the performance of forward propagation.

for (name in c("trn_img","tst_img","trn_cls","tst_cls")) {
    name %>% 
        get() %>% 
        get_ObjectAttributes(
            name, 
            if (name %in% c("trn_cls","tst_cls")) TRUE else FALSE
        ) %>% 
        cat()
    if (name != "tst_cls") cat("n")
}

Which prints:

Name : trn_img
 - Size : 82 Mb
 - Clas : array
 - Type : integer
 - Mode : numeric
 - Dims : 7000x32x32x3
Name : tst_img
 - Size : 35.2 Mb
 - Clas : array
 - Type : integer
 - Mode : numeric
 - Dims : 3000x32x32x3
Name : trn_cls
 - Size : 54.7 Kb
 - Clas : numeric
 - Type : double
 - Mode : numeric
 - Dims : 7000
 - Freq :
     label Freq
   1 0     6309
   2 1      691
Name : tst_cls
 - Size : 23.5 Kb
 - Clas : numeric
 - Type : double
 - Mode : numeric
 - Dims : 3000
 - Freq :
     label Freq
   1 0     2691
   2 1      309

7.3. Reshape Data

For the first input layer of our Neural Network, we want to be a single dimension of nodes. Therefore, it is necessary to reshape the data from a 4-Dimensional array in to a 2-Dimensional array. This process is called Flattening, and more information can be found here: Memory Layout of Multi-Dimensional Arrays.

The implementation of this method can be done quite easily with the array() function, as it has the dim= argument which can be used to specify the dimensions required.

The desired matrix size should have each image in a new row, and each pixel on a different column. Because each pixel is comprised of the 2nd, 3rd and 4th dimensions, we need to take the product of these three numbers, and use that to specify the desired number of columns. Effectively, we are running the equation: 32×32×3, which equates to having 3072 columns. This equation is implemented programmatically in-line, in the next code chunk.

# Reshape data
trn_img %<>% array(dim=c(
    dim(.) %>% extract(1),
    dim(.) %>% extract(2:4) %>% prod()
))
tst_img %<>% array(dim=c(
    dim(.) %>% extract(1),
    dim(.) %>% extract(2:4) %>% prod()
))
trn_cls %<>% array(dim=c(
    length(.),
    1
))
tst_cls %<>% array(dim=c(
    length(.),
    1
))

When checking the object attributes once again, you will see that the image data has been manipulated correctly to have the number of rows as the number of images, and the number of columns as the number of pixels.

for (name in c("trn_img","tst_img","trn_cls","tst_cls")) {
    name %>% 
        get() %>% 
        get_ObjectAttributes(name, FALSE) %>% 
        cat()
    if (name != "tst_cls") cat("n")
}

Which prints:

Name : trn_img
 - Size : 82 Mb
 - Clas : matrix,array
 - Type : integer
 - Mode : numeric
 - Dims : 7000x3072
Name : tst_img
 - Size : 35.2 Mb
 - Clas : matrix,array
 - Type : integer
 - Mode : numeric
 - Dims : 3000x3072
Name : trn_cls
 - Size : 54.9 Kb
 - Clas : matrix,array
 - Type : double
 - Mode : numeric
 - Dims : 7000x1
Name : tst_cls
 - Size : 23.6 Kb
 - Clas : matrix,array
 - Type : double
 - Mode : numeric
 - Dims : 3000x1

7.4. Standardise Data

The final step to preparing the data is to standardise the data so that all the elements are between 0 and 1. The reason for this is to prevent exploding and vanishing gradients in later steps, as the Neural Network will try to fit to all the peaks and troughs which is caused from having the data in a 0 to 255 value range.

As documented on the TensorFlow website, the CIFAR10 data set consists of RGB Image data. And, as documented on Wikipedia, RGB data are all values between 0 and 255.

Therefore, what is necessary to do is to divide all the elements by 255, and they will inevitably result in a value between 0 and 1. As the image data is currently in an array, the function in the below code chunk will run as a vectorised function over the entire array, dividing all elements by 255 accordingly.

trn_img <- trn_img/255
tst_img <- tst_img/255

The data is now ready for the network. Next step is to build the network.

8. Instantiate the Network

8.1. Define the Architecture

Some quick notes on what the Network will actually be:

  • Overall architecture is a list.
  • Each element of the master list is another list, and these are going to form each of the layers of the overall network.
  • The first layer will always be the input layer.
  • The last layer will always be the output layer.
  • Each of the layers in between will be the hidden layers, and the names for these layers are simply named with numbers.
  • Each element of each layer will all be labelled the same, defined as follows:
    • nodz : Number of nodes in this layer.
    • inpt : Input matrix. Aka A_prev. This is a duplicate of the activation of the previous layer, so for large networks this needs to be taken in to consideration.
    • wgts : Weights matrix. Aka W.
    • bias : Bias vector. Aka b.
    • linr : Linear matrix. Aka Z. This is the result of the linear algebra between inpt, wgts and bias.
    • acti : Activation matrix. Aka A. The result of applying an activation function to the linr matrix.
    • acti_func : The activation function used.
    • cost : The overall cost of the model. This is a single value (the overall cost of the model), but is copied to each layer of the model.
    • back_cost : Gradient of the cost vector. Aka dA_cost.
    • back_acti : Gradient of the Activation matrix. Aka dA. The result of differentiation after having applied back propagation. With a given cost function.
    • back_linr : Gradient of the Linear algebra matrix. Aka dZ. The result of backwards linear differentiation back propagation.
    • back_wgts : Gradient of the Weights matrix. Aka dW. Also the result of back-prop.
    • back_bias : Gradient of the Bias vector. Aka db. Also the result of back-prop.

8.1. Set the Instantiation Function

For the below code chunk, the function set_InstantiateNetwork() is defined. It only has three input arguments, which specify the number of nodes to be used in each layer. Based on this information, the model will be instantiated and returned, ready for initialisation, in the next step.

set_InstantiateNetwork <- function(
    input=50, 
    hidden=c(30,20,10), 
    output=1
    ) {
    # Set up
    model = list()
    layers = c(
        "input",
        1:length(hidden),
        "output"
    )

    # Loop
    for (layer in layers) {

        # Make layer
        model[[layer]] <- list(
            "nodz"      = "",
            "inpt"      = "",
            "wgts"      = "",
            "bias"      = "",
            "linr"      = "",
            "acti"      = "",
            "acti_func" = "",
            "cost"      = "",
            "back_cost" = "",
            "back_acti" = "",
            "back_linr" = "",
            "back_wgts" = "",
            "back_bias" = "" 
        )

        # Set nodes
        if (layer=="input") {
            model[[layer]][["nodz"]] <- input
        } else if (layer=="output") {
            model[[layer]][["nodz"]] <- output
        } else {
            layer_index <- layer %>% as.numeric()
            model[[layer]][["nodz"]] <- hidden[layer_index]
        }

    }

    # Return
    return(model)

}

8.1. Create the Network

The below code chunk instantiates the network. The model will be set up with the following number of nodes in each layer:

  • The input layer will have 3072 nodes, the same number calculated above.
  • Each hidden layer will have a decreasing number of nodes, from 100 down to 20 notes.
  • The output layer will have 1 node, because this one node will be a floating point number between 0 and 1, which will be used to predict whether the relevant image is a car or not.
network_model <- set_InstantiateNetwork(
    input=3072, 
    hidden=c(100,75,50,30,20), 
    output=1
)

8.1. Visualise the Network

There is a very good website available which allows Neural Networks to be visualised: http://alexlenail.me/NN-SVG/AlexNet.html. The below image is a representation of the network that has just been created.

Further visualisation can be conducted once the network is fully initialised and forward propagated. See section Check Model Shapes for more details.

Figure 4: Visualisation of the Network
Figure 4: Visualisation of the Network

9. Initialise the Network

There are four steps to initialising the network:

  1. Set the Weight Initialisation functions
  2. Set the Layer Initialisation function
  3. Set the Model Initialisation function
  4. Run the Initialisation

9.1. Weight Initialisation

At it’s core, weight initialisation is simply generating a random normal number (with μ=0 and σ=1). However, by only using this randomly generated number, the model gradients are found to be exploding or vanishing when attempting to train deeper neural networks. Therefore, these weights need to be scaled after they are initialised, in order to be robust enough to continue to be trained at deeper layers.

There are many algorithms that can be used for weight initialisation. Two of the more common ones are the Xavier algorithm, and the He algorithm. Some good resources for understanding the details behind these algorithms include:

9.1.1. Xavier Algorithm

The equation for the Xavier initialisation is:

Equation 1: Xavier Initialisation Algorithm
Equation 1: Xavier Initialisation Algorithm

Where:

  • nᵢ is the number of nodes coming in to this layer. Also known as "fan-in".
  • nᵢ₊₁ is the number of nodes going out of this layer. Also known as "fan-out".

9.1.2. He Algorithm

The equation for the He initialisation is:

Equation 2: He Initialisation Algorithm
Equation 2: He Initialisation Algorithm

Where:

  • nᵢ is the number of nodes coming in to this layer.

9.1. Initialisation Functions

For programmatic purposes, these functions are written with the order value as part of the function arguments. This means that the order of magnitude of the equation can be altered at a later stage and used as a hyperparameter.

let_InitialiseXavier <- function(nodes_in, nodes_out, order=6) {

    # Do work
    numer <- sqrt(order)
    denom <- sqrt(nodes_in + nodes_out)
    output <- numer/denom

    # Return
    return(output)

}
let_InitialiseHe <- function(nodes_in, nodes_out, order=2) {

    # Do work
    numer <- order
    denom <- nodes_in
    output <- sqrt(numer/denom)

    # Return
    return(output)

}

9.3. Layer Initialisation

The next step is to construct a function that will initialise all the relevant aspects of an individual layer. This step is important because it is where the weight matrices are created, and these must be constructed in a certain manner so as to ensure that the dimensions allow successful forward propagation.

The steps for the layer construction are as follows:

  1. Determine the layer names of current layer (layer) and of the previous layer (layer_prev).

    • Used to access the relevant configurations from the network_model list.
  2. Determine the number of nodes feeding in to this layer (nodes_in), and the number of nodes feeding out of the current layer (nodes_out).

    • Used for parse’ing in to the Initialisation Algorithms.
  3. Create the Weight matrix.

    • The dimensions of which are as follows:
        1. Number of Rows is the number of nodes in the previous layer.
        1. Number of Columns is the number of nodes in the current layer.
    • Each element is created using the rnorm() function, which generates a random number on the normal curve, with μ=0 and σ=1.
  4. Determine the Initialisation Algorithm to be used.

    • The value value parsed in to this function is a lower-case word associated with the relevant algorithm.
    • It is coerced to a Title capitalisation, and then given the prefix let_Initialise.
    • The value is then parsed in to the get() function, which then calls the function, and executes the parameters parsed to it.
    • This is a programmatic way to flexibly call different algorithms, based on values parsed to the function.
  5. Scale the Weight matrix.

    • By multiplying each element by the Initialisation Algorithm.
  6. Create the Bias matrix.

    • The dimensions of which are as follows:
        1. The number of rows is the number of nodes in the current layer.
        1. There is only one column.
    • Every single element has value 0.
  7. Apply the Weight and Bias matrix back on to the network_model.
  8. Return the updated network_model object.

In order to achieve this, the function arguments used are for the function include:

  1. The network_model itself.
  2. The layer_index of the layer (where 1 is the input layer, and each increasing number is each subsequent layer).
  3. The initialisation_algorithm, which is either the value NA, or the value xavier or he which is the value of the relevant algorithm.
  4. The initialisation_order, which is an integer value to be used as the numerator in the relevant algorithm.
set_InitialiseLayer <- function(
    network_model,
    layer_index,
    initialisation_algorithm=NA,
    initialisation_order=6
    ) {

    # Get layer names
    layer_prev <- names(network_model)[layer_index-1]
    layer <- names(network_model)[layer_index]

    # Get number of nodes
    if (layer_index == 1) {
        nodes_in <- 0 #First layer is 'input'
    } else {
        nodes_in <- network_model %>% 
            extract2(layer_prev) %>% 
            extract2("nodz")
    }
    nodes_out <- network_model %>% 
        extract2(layer) %>% 
        extract2("nodz")

    # Set the seed of reproducibility
    set.seed(1234)

    # Initialise weight matrix
    w_matrix <- matrix(
        data=rnorm(nodes_in * nodes_out), 
        nrow=nodes_in,
        ncol=nodes_out
    )

    # Get initialisation algorithm
    if (!is.na(initialisation_algorithm)) {
        algorithm <- paste0(
            "let_Initialise", 
            str_to_title(initialisation_algorithm)
        )
    }

    # Scale weights
    if (layer_index != 1) {
        if (is.na(initialisation_algorithm)) {
            w_matrix <- w_matrix
        } else {
            w_matrix <- w_matrix * 
                get(algorithm)(
                    nodes_in=nodes_in, 
                    nodes_out=nodes_out, 
                    order=initialisation_order
                )
        }
    }

    # Initialise bias matrix
    b_matrix <- matrix(
        data=network_model %>% 
            extract2(layer) %>% 
            extract2("nodz") %>% 
            replicate(0),
        nrow=network_model %>% 
            extract2(layer) %>% 
            extract2("nodz"),
        ncol=1
    )

    # Place data back in to the model
    network_model[[layer]][["wgts"]] <- w_matrix
    network_model[[layer]][["bias"]] <- b_matrix

    # Return
    return(network_model)

}

9.4. Model Initialisation

The purpose of the set_InitialiseModel() function is to loop through each layer in the network_model object, to initialise them in accordance with the parameters set within the model itself. This function will take the number of nodes (set by the set_InstantiateNetwork() function).

The function will take only three input parameters:

  1. The network_model:

    • The Network, as instantiated by the set_InstantiateNetwork() function.
  2. The initialisation_algorithm:

    • The algorithm to be used for the network initialisation.
    • This should be the values NA, xavier, or he.
  3. The initialisation_order:

    • The order value to be used for the numerator of the equation.
    • This could be a numeric value, or the string layers which signifies that the order should be the number of hidden layers in the network.
set_InitialiseModel <- function(
    network_model, 
    initialisation_algorithm="xavier", 
    initialisation_order="layers"
    ) {

    # Redefine 'initialisation_order'
    if (initialisation_order == "layers") {
        initialisation_order <- get_CountOfElementsWithCondition(
            names(network_model), 
            function(x){is.integer(as.numeric(x))}
        )
    }

    # Initialise each layer
    for (layer_index in 1:length(names(network_model))) {
        network_model <- set_InitialiseLayer(
            network_model=network_model, 
            layer_index=layer_index, 
            initialisation_algorithm=initialisation_algorithm,
            initialisation_order=initialisation_order
        )
    }

    # Return
    return(network_model)

}

Notice, this function uses the user-defined function get_CountOfElementsWithCondition(). This function allows for the counting of the number of hidden layers in the model. The source code for the function can be found here.

9.5. Network Initialisation

The below code chunk initialises the network using the function defined above. The method utilised is %<>%, which is defined in the magrittr package, which states that it is used to update a value on the left-hand side by first piping it into the first argument position on the right-hand side, and then assigning the result back to the object on the left hand side.

network_model %<>% set_InitialiseModel()

9.6. Check Model Parameters

For a quick sanity-check, the below code chunk checks for the number of parameters for the defined model. This again uses a custom function get_ModelParametersCount(), the definition for which can be found in the source code for this article (located here).

network_model %>% 
    get_ModelParametersCount() %>% 
    format(big.mark=",")

Which prints:

[1] "320,846"

There are over 320,000 parameters in this neural network. Every single one of these parameters will need to be trained, and each of them will have an influence on the final outcome. We will continue to walk through how this can be achieved in the following sections.

10. Forward Propagation

10.1. The Theory

There is a very good website which outlines what actually happens during a matrix multiplication: Matrix Multiplication.

The theory for the forward propagation method is as follows:

  1. Apply matrix multiplication

      1. The first matrix is the activation matrix from the previous layer.
      1. The second matrix is the weight matrix for the current layer.
      1. The third matrix is the (first) linear activation matrix for the current layer.
  2. Apply bias matrix

      1. The first matrix is the (first) linear activation of the current layer.
      1. The second matrix is the bias matrix for the current layer.
      1. The third matrix is the (second) linear activation matrix for the current layer.
  3. Apply activation algorithm

      1. The first matrix is the (second) linear activation matrix for the current layer.
      1. The activation function is determined by the user during the time of the function being run. Can be the relu activation, sigmoid activation, or any of the other activation functions.
      1. The third matrix is the activation matrix for the current layer.

10.1.1. Step One

To illustrate, the following matrices shows the first step of the forward propagation, using fictitious numbers.

Figure 5: Forward Prop, Step One
Figure 5: Forward Prop, Step One

The below code shows how this process is implemented programmatically.

# Declare First matrix
matrix_input <- matrix(
    data=1:24, 
    nrow=8, 
    ncol=3, 
    byrow=TRUE
)
# Declare Weight matrix
matrix_weight <- matrix(
    data=1:15, 
    nrow=3, 
    ncol=5, 
    byrow=TRUE
)
# Apply matrix manipulation
matrix_layer <- matrix_input %*% matrix_weight

Which prints:

matrix_input:
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
[5,]   13   14   15
[6,]   16   17   18
[7,]   19   20   21
[8,]   22   23   24
matrix_weight:
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
matrix_layer:
     [,1] [,2] [,3] [,4] [,5]
[1,]   46   52   58   64   70
[2,]  100  115  130  145  160
[3,]  154  178  202  226  250
[4,]  208  241  274  307  340
[5,]  262  304  346  388  430
[6,]  316  367  418  469  520
[7,]  370  430  490  550  610
[8,]  424  493  562  631  700

10.1.2. Step Two

The next diagram shows how the bias matrix is applied. As shown, each element in the bias matrix is applied horizontally to each element of the initial matrix. This diagram shows how it works.

Figure 6: Forward Prop, Step Two
Figure 6: Forward Prop, Step Two

The below code shows how this process is implemented programmatically.

# Declare Bias matrix
vector_bias <- matrix(1:8, 8, 1)
# Apply Bias matrix
matrix_biased <- sweep(matrix_layer, 1, vector_bias, "+")

Which prints:

matrix_layer:
     [,1] [,2] [,3] [,4] [,5]
[1,]   46   52   58   64   70
[2,]  100  115  130  145  160
[3,]  154  178  202  226  250
[4,]  208  241  274  307  340
[5,]  262  304  346  388  430
[6,]  316  367  418  469  520
[7,]  370  430  490  550  610
[8,]  424  493  562  631  700
vector_bias:
     [,1]
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5
[6,]    6
[7,]    7
[8,]    8
matrix_biased:
     [,1] [,2] [,3] [,4] [,5]
[1,]   47   53   59   65   71
[2,]  102  117  132  147  162
[3,]  157  181  205  229  253
[4,]  212  245  278  311  344
[5,]  267  309  351  393  435
[6,]  322  373  424  475  526
[7,]  377  437  497  557  617
[8,]  432  501  570  639  708

10.1.3. Step Three

The activation function is one which is defined at the time of running the function. The algorithm is applied to each element of the initial matrix. In this instance, a simple multiplication of the initial matrix is used.

Figure 7: Forward Prop, Step Three
Figure 7: Forward Prop, Step Three

The below code shows how this process is implemented programmatically.

# Apply Activation function
matrix_output <- matrix_biased * (0.01 * matrix_biased)

Which prints:

matrix_biased:
     [,1] [,2] [,3] [,4] [,5]
[1,]   47   53   59   65   71
[2,]  102  117  132  147  162
[3,]  157  181  205  229  253
[4,]  212  245  278  311  344
[5,]  267  309  351  393  435
[6,]  322  373  424  475  526
[7,]  377  437  497  557  617
[8,]  432  501  570  639  708
matrix_output:
        [,1]    [,2]    [,3]    [,4]    [,5]
[1,]   22.09   28.09   34.81   42.25   50.41
[2,]  104.04  136.89  174.24  216.09  262.44
[3,]  246.49  327.61  420.25  524.41  640.09
[4,]  449.44  600.25  772.84  967.21 1183.36
[5,]  712.89  954.81 1232.01 1544.49 1892.25
[6,] 1036.84 1391.29 1797.76 2256.25 2766.76
[7,] 1421.29 1909.69 2470.09 3102.49 3806.89
[8,] 1866.24 2510.01 3249.00 4083.21 5012.64

10.2. Linear Component

When combined together, the linear algebraic function is implemented in just three lines of code, as shown in the below code chunk in the set_LinearForward() function.

set_LinearForward <- function(inpt, wgts, bias) {

    # Perform matrix multiplication
    linr <- inpt %*% wgts

    # Add bias
    linr <- sweep(linr, 2, bias, "+")

    # Return
    return(linr)

}

10.3. Non-Linear Component

The real power of Neural Network comes from their Activation function. It is now the network is able to capture the non-linear aspects, and this is what emphasises their predictive power.

The Activation function can be one of many different algorithms, depending on the purpose of the network. The choice of Activation can be a hyperparameter which is chosen at a later stage. The Desmos website provides an excellent interactive way of reviewing the different types of Activations: Activation Functions.

Each of the Activation functions are defined individually, and each only take one argument for the function, which is a matrix for which the activation will be applied. For the purposes of brevity, four of the more popular activations are provided here; but there are many, many others which can be used. The resources and equations for how to compute these functions are also provided:

sigmoid:

Equation 3: Sigmoid Activation
Equation 3: Sigmoid Activation

Sources:

  1. How to calculate a logistic sigmoid function in Python
  2. Implement sigmoid function using Numpy

relu:

Equation 4: Relu Activation
Equation 4: Relu Activation

Sources:

  1. A beginner’s guide to NumPy with Sigmoid, ReLu and Softmax activation functions

softmax:

Equation 5: Softmax Activation
Equation 5: Softmax Activation

Sources:

  1. Softmax Activation Function Explained

swish:

Equation 6: Swish Activation
Equation 6: Swish Activation

Sources:

  1. Searching for Activation Functions
  2. Implementing Swish Activation Function in Keras

These Activation Functions are defined programmatically as follows:

let_ActivateSigmoid <- function(linr) {

    # Do work
    acti <- 1/(1+exp(-linr))

    # Return
    return(acti)

}
let_ActivateRelu <- function(linr) {

    # Do work
    acti <- sapply(linr, max, 0) %>% 
        structure(dim=dim(linr))

    # Return
    return(acti)

}
let_ActivateSoftmax <- function(linr) {

    # Do work
    expo <- exp(linr)
    expo_sum <- sum(exp(linr))
    acti <- expo/expo_sum

    # Return
    return(acti)

}
let_ActivateSwish <- function(linr, beta=0.1) {

    # Do work
    acti <- linr * (beta * linr)

    # Return
    return(acti)

}

10.4. Set Forward Propagation Function

The set_ForwardProp() function pulls together all of the components mentioned above. It implements the following steps:

  1. Loop through each layer of the network_model.
  2. Get the layer name for the current layer.
  3. Implement a ‘pass-thru’ process for the input layer.
  4. Extract the relevant information, including: -1. Layer name of the previous layer -2. Activation matrix of the previous layer -3. Weights matrix of the current layer

      1. Bias matrix for the current layer
  5. Apply linear algebra component.
  6. Apply the non-linear activation component.
  7. Note the difference between the activation of the hidden layers and that of the final layer.
  8. Apply the relevant information back on to the network.
  9. Return the network_model object.

In order to implement this process, the function takes only four arguments:

  1. network_model: The network model to be updated.
  2. data_in: The 4-Dimensional array of training images, as defined above.
  3. activation_hidden: The Activation function to be applied to the hidden layers.
  4. activation_final: The Activation function to be applied to the final (output) layer.
set_ForwardProp <- function(
    network_model, 
    data_in, 
    activation_hidden="relu", 
    activation_final="sigmoid"
    ) {

    # Do work
    for (index in network_model %>% names() %>% length() %>% 1:.) {

        # Define layer name
        layr <- network_model %>% 
            names() %>% 
            extract(index)

        if (layr=="input") {

            # Pass-thru for 'input' layer
            network_model[[layr]][["inpt"]] <- data_in
            network_model[[layr]][["acti"]] <- data_in

        } else {

            # Extract data
            prev <- names(network_model)[index-1]
            inpt <- network_model %>% 
                extract2(prev) %>% 
                extract2("acti")
            wgts <- network_model %>% 
                extract2(layr) %>% 
                extract2("wgts")
            bias <- network_model %>% 
                extract2(layr) %>% 
                extract2("bias")

            # Calculate
            linr <- set_LinearForward(inpt, wgts, bias)

            # Activate
            if (layr=="output") {
                func <- activation_final %>% 
                    str_to_title() %>% 
                    paste0("let_Activate", .) %>% 
                    get()
                acti <- func(linr)
                network_model[[layr]][["acti_func"]] <- activation_final
            } else {
                func <- activation_hidden %>% 
                    str_to_title() %>% 
                    paste0("let_Activate", .) %>% 
                    get()
                acti <- func(linr)
                network_model[[layr]][["acti_func"]] <- activation_hidden
            }

            # Apply back to our model
            network_model[[layr]][["inpt"]] <- inpt
            network_model[[layr]][["linr"]] <- linr
            network_model[[layr]][["acti"]] <- acti

        }

    }

    # Return
    return(network_model)

}

10.5. Run Forward Propagation

The last step of the Forward Propagation process is to actually run the function. In the below code chunk, the tic() and toc() functions are implemented to time how long a process takes to run.

tic()
network_model %<>% set_ForwardProp(trn_img, "relu", "sigmoid")
toc()

Which prints:

7.05 sec elapsed

As mentioned above in the Split Data section, the trn_img object is over 82 Mb large, and the model has over 320,000 parameters. The entire end-to-end process for forward propagation took only 7 seconds to run; which is quite impressive, and is a testament to the power of mathematics.

10.6. Check Model Shapes

The Network can now be printed using the custom function get_PrintNetwork(). This function returns each layer in an individual box, containing key information like the relevant shapes of the matrices and the activation functions used. The source code for the function can be found here.

# Print the Network
network_model %>% 
    get_PrintNetwork() %>% 
    cat()

Which prints:

+--------------------------------+
|      Layer : input             |
|      Nodes : 3,072             |
| Inpt Shape : 7,000 x 3,072     |
| Wgts Shape :     0 x 3,072     |
| Outp Shape : 7,000 x 3,072     |
| Activation :                   |
+--------------------------------+
               |
               V
+--------------------------------+
|      Layer : 1                 |
|      Nodes : 100               |
| Inpt Shape : 7,000 x 3,072     |
| Wgts Shape : 3,072 x   100     |
| Outp Shape : 7,000 x   100     |
| Activation : relu              |
+--------------------------------+
               |
               V
+--------------------------------+
|      Layer : 2                 |
|      Nodes : 75                |
| Inpt Shape : 7,000 x   100     |
| Wgts Shape : 100   x    75     |
| Outp Shape : 7,000 x    75     |
| Activation : relu              |
+--------------------------------+
               |
               V
+--------------------------------+
|      Layer : 3                 |
|      Nodes : 50                |
| Inpt Shape : 7,000 x    75     |
| Wgts Shape : 75    x    50     |
| Outp Shape : 7,000 x    50     |
| Activation : relu              |
+--------------------------------+
               |
               V
+--------------------------------+
|      Layer : 4                 |
|      Nodes : 30                |
| Inpt Shape : 7,000 x    50     |
| Wgts Shape : 50    x    30     |
| Outp Shape : 7,000 x    30     |
| Activation : relu              |
+--------------------------------+
               |
               V
+--------------------------------+
|      Layer : 5                 |
|      Nodes : 20                |
| Inpt Shape : 7,000 x    30     |
| Wgts Shape : 30    x    20     |
| Outp Shape : 7,000 x    20     |
| Activation : relu              |
+--------------------------------+
               |
               V
+--------------------------------+
|      Layer : output            |
|      Nodes : 1                 |
| Inpt Shape : 7,000 x    20     |
| Wgts Shape : 20    x     1     |
| Outp Shape : 7,000 x     1     |
| Activation : sigmoid           |
+--------------------------------+

11. Calculate the Cost

Once the forward-propagation part has been completed, it’s then necessary to get a measure for how wrong the model is. This will then be used to update the model parameters during the backward propagation steps.

11.1. Set Cost Function

The first step is to write the functions to be used for getting the cost for the model. Ultimately, it does not matter how bad the results are for the first round of training; remember, the model was initialised with random numbers. What’s important is that the cost functions should be determining a single value for the cost of the network, and this single function will be what is used for the derivative functions to be used in the backwards propagation steps.

Note here, that there is a very small epsilon value used (epsi), which effectively adjusts for a perfect prediction made by the model. The reason for this is not that we don’t want the model to predict a perfect value, it’s that we want the model to predict the probability of a perfect value. Furthermore, it’s impossible to take a logarithmic value of a 0, so therefore it’s necessary to adjust this to be just a little bit off-zero.

get_ComputeCost <- function(pred, true, epsi=1e-10) {

    # Get number of samples
    samp <- length(true)

    # Instantiate totals
    total_cost <- 0

    # Loop for each prediction
    for (i in 1:samp) {

        # Adjust for perfect predictions
        if (pred[i]==1) {pred[i] %<>% subtract(epsi)}
        if (pred[i]==0) {pred[i] %<>% add(epsi)}

        # Calculate totals
        total_cost <- total_cost - 
            (
                true[i] * log(pred[i]) 
                + 
                (1-true[i]) * log(1-pred[i])
            )

    }

    # Take an average
    cost <- (1/samp) * total_cost

    # Return
    return(cost)

}

Then, the cost must be applied back on to the network. For this, the exact same value is applied to each layer of the network.

set_ApplyCost <- function(network_model, cost) {

    # Apply back to the model
    for (layer in network_model %>% names) {
        network_model[[layer]][["cost"]] <- cost
    }

    # Return
    return(network_model)
}

11.2. Run the Cost Function

The below code chunk runs the cost functions, utilising the functions defined above.

network_model %<>% set_ApplyCost(
    get_ComputeCost(network_model[["output"]][["acti"]], trn_cls)
)

12. Backward Propagation

The function of back-propagation is intended to take the cost of the model, and then differentiate each of the Weights and Biases matrices in the network to determine to what extent each and every single parameter in the network contributed to that final cost value. To do this, the process works backwards from the end to the start, following the following steps:

  1. Differentiate the final cost value
  2. Differentiate the activation matrices
  3. Differentiate the linear algebra matrices
  4. Continue to the previous layer.

Each step in the back-propagation process requires calculus to derive the values, and the final function is implemented here. As this paper is not intended to show how to derive the equations, but more how to run the functions, the necessary algebraic steps are not included here.

12.1. Differentiate Cost

As similar to the functions used to get the cost, the cost differential values are first calculated, then applied to each layer of the network.

get_DifferentiateCost <- function(pred, true) {

    # Do work
    diff_cost <- -(
        divide_by(true, pred) - divide_by(1-true, 1-pred)
    )

    # Return
    return(diff_cost)

}
set_ApplyDifferentiateCost <- function(
    network_model, 
    cost_differential
    ) {

    # Do work
    for (layer in names(network_model)) {
        network_model[[layer]][["back_cost"]] <- cost_differential
        if (layer=="output") {
            network_model[[layer]][["back_acti"]] <- network_model %>% 
                extract2(layer) %>% 
                extract2("back_cost") %>% 
                t()
        }
    }

    # Return
    return(network_model)

}

12.2. Differentiate Activation

As each of the activations have their own function, there is also a derivation of that function which can be calculated using calculus.

let_BackwardActivateRelu <- function(diff_acti_curr, linr_curr) {

    # Do work
    diff_linr_curr <- diff_acti_curr
    diff_linr_curr[linr_curr<=0] <- 0

    # Return
    return(diff_linr_curr)

}
let_BackwardActivateSigmoid <- function(diff_acti_curr, linr_curr) {

    # Do work
    temp <- 1/(1+exp(-linr_curr))
    diff_linr_curr <- t(diff_acti_curr) * temp * (1-temp)

    # Return
    return(t(diff_linr_curr))

}

12.3. Differentiate Linear

Having had defined all this so far, the next step is to combine in to one single function, which can be used once per layer, which runs the necessary back-propagation differentiation functions.

Notice here that the output is actually a list of three elements. This a key difference between R and python. In R, functions can only output a single element; as compared to python which is able to return multiple elements from each function.

get_DifferentiateLinear <- function(
    back_linr_curr,
    acti_prev,
    wgts,
    bias
    ) {

    # get number of samples
    samp <- acti_prev %>% dim %>% extract(2)

    # Differentiate weights
    diff_wgts <- 1/samp * (back_linr_curr %*% acti_prev)

    # Differentiate bias
    diff_bias <- 1/samp * rowSums(back_linr_curr, dims=1)

    # Differentiate activation
    diff_acti_prev <- wgts %*% back_linr_curr

    # Consolidate in to one list
    list_linr <- list(
        diff_acti_prev, 
        diff_wgts, 
        diff_bias
    )

    # Return
    return(list_linr)

}

12.4. Run Backward Propagation

After having defined the differentiation functions, next is necessary to combine these individual functions in to a single, combined, which can be run once per layer.

First, we will set up the function, and secondly we will run it.

12.4.1. Set Backward Propagation Function

The backward propagation function to be defined (set_BackwardProp()) must be designed to run through the following steps:

  1. Run through each layer in reverse direction. This is necessary because, quite logically, the backward propagation function needs to be run in a backwards direction.
  2. Extract the layer name.
  3. Skip the input layer. Again, quite logical, because this layer is at the beginning of the network, and does not need to be back-propagated.
  4. Extract the name of the previous layer.
  5. Extract the relevant matrices to be used for subsequent calculations.
  6. Extract the relevant activation function for this particular layer.
  7. Set up some empty matrices, which will house the relevant differentiated matrices.
  8. Differentiate the activation matrix of the current layer
  9. Differentiate the linear matrices, including:

    • weight matrix of the current layer.
    • bias matrix of the current layer.
    • activation matrix of the previous layer.
  10. Apply information back in to the relevant places in the network_model.
  11. Return the updated network_model.
set_BackwardProp <- function(network_model) {

    # Loop through each layer in reverse order
    for (layr_indx in network_model %>% names() %>% length() %>% 1:. %>% rev) {

        # Get the layer name
        layr_curr <- network_model %>% 
            names() %>% 
            extract(layr_indx)

        # Skip the 'input' layer
        if (layr_curr == "input") next

        # Get the previous layer name
        layr_prev <- network_model %>% 
            names %>% 
            extract(layr_indx-1)

        # Set up the existing matrices
        linr_curr <- network_model %>% 
            extract2(layr_curr) %>% 
            extract2("linr")
        wgts_curr <- network_model %>% 
            extract2(layr_curr) %>% 
            extract2("wgts")
        bias_curr <- network_model %>% 
            extract2(layr_curr) %>% 
            extract2("bias")
        acti_prev <- network_model %>% 
            extract2(layr_prev) %>% 
            extract2("acti")
        diff_acti_curr <- network_model %>% 
            extract2(layr_curr) %>% 
            extract2("back_acti")

        # Get the activation function
        acti_func_back <- network_model %>% 
            extract2(layr_curr) %>% 
            extract2("acti_func") %>%
            str_to_title %>% 
            paste0("let_BackwardActivate", .)

        # Set up the empty matrices
        diff_linr_curr <- matrix()
        diff_acti_prev <- matrix()
        diff_wgts_curr <- matrix()
        diff_bias_curr <- matrix()

        # Differentiate activation
        diff_linr_curr <- get(acti_func_back)(
            diff_acti_curr,
            linr_curr
        )

        # Differentiate linear
        list_linr <- get_DifferentiateLinear(
            back_linr_curr=diff_linr_curr,
            acti_prev=acti_prev,
            wgts=wgts_curr,
            bias=bias_curr
        )
        diff_acti_prev <- list_linr %>% extract2(1)
        diff_wgts_curr <- list_linr %>% extract2(2)
        diff_bias_curr <- list_linr %>% extract2(3)

        # Apply back to model
        network_model[[layr_prev]][["back_acti"]] <- diff_acti_prev
        network_model[[layr_curr]][["back_linr"]] <- diff_linr_curr
        network_model[[layr_curr]][["back_wgts"]] <- diff_wgts_curr
        network_model[[layr_curr]][["back_bias"]] <- diff_bias_curr

    }

    # Return
    return(network_model)

}

12.4.2. Run Backward Propagation Function

Having defined this function, the next step is to run the function. The below code chunk is wrapped with the tic() and toc() functions, in order to determine how much time it took for the function to run.

tic()
network_model %<>% set_BackwardProp()
toc()

Which prints:

8.84 sec elapsed

As shown, it took approximately 9 seconds to run this function. This is quite impressive, considering that there are over 320,000 parameters to be updated (see Check Model Parameters section).

13. Update Model Parameters

13.1. Context

After the model parameters have been differenced, it’s then necessary to update the relevant parameters of the network (the weights and biases) before re-training the network again by re-running the forward propagation function.

The method used to update these parameters is called Stochastic Gradient Descent. For more information, see Stochastic Gradient Learning in Neural Networks or Backpropagation and Stochastic Gradient Descent Method.

For sure, there are other methods of implementing neural network optimisation. The literature has been spending a lot of effort in this area. There are algorithms such as RMSProp and Adam, which implement intelligent methods to achieve a faster convergence, and a more accurate final result. Some good sources for this include An Empirical Comparison of Optimizers for Machine Learning Models and Overview of different Optimizers for Neural Networks. For further enhancement and optimisation, it is important to explore these optimisation options.

13.2. Process

Nonetheless, the process of implementing Stochastic Gradient Descent is really quite simple:

  1. Specify a given learning rate between 0 and 1 (usually very small number, for example 0.001).
  2. Take the differenced weight and bias matrices, and multiply by the learning rate in a negative direction.
  3. Add the updated differenced weight and bias matrices, and add to the original weight and bias matrix.
  4. Repeat this process layer-by-layer.

13.3. Set the Update Model Function

To set the function for updating the model parameters, the following steps are used:

  1. Specify the learning_rate as an argument for the function.
  2. Loop through each layer in the network in a forward direction.
  3. Extract the layer name.
  4. Skip the input layer (as it does not need updating).
  5. Define the gradient steps for the back_wgts and back_bias matrices.
  6. Apply the gradient steps to the original wgts and bias matrices.
  7. Return the updated mode.
set_UpdateModel <- function(network_model, learning_rate=0.001) {

    # Do work
    for (index in network_model %>% names() %>% length() %>% 1:.) {

        # Get layer name
        layr <- network_model %>% 
            names() %>% 
            extract(index)

        # Skip 'input' layer
        if (layr=="input") next

        # Define gradient steps for the weight
        grad_step_wgts <- -1 * 
            (
                learning_rate * network_model[[layr]][["back_wgts"]]
            )
        # Define gradient steps for the bias
        grad_step_bias <- -1 * 
            (
                learning_rate * network_model[[layr]][["back_bias"]]
            )

        # Take steps
        network_model[[layr]][["wgts"]] <- network_model %>% 
            extract2(layr) %>% 
            extract2("wgts") %>% 
            add(t(grad_step_wgts))
        network_model[[layr]][["bias"]] <- network_model %>% 
            extract2(layr) %>% 
            extract2("bias") %>% 
            add(grad_step_bias)

    }

    # Return
    return(network_model)

}

13.4. Run the Update Model Function

Having defined the function, then we run the model our network.

network_model %<>% set_UpdateModel(0.01)

14. Run the Model End-to-End

Now, it’s time to bring it all together. Running the model end-to-end essentially means:

  1. Applying forward propagation.
  2. Calculate the cost.
  3. Run backward propagation.
  4. Update the model parameters.
  5. Repeat…

Each time this is repeated is called an Epoch. It is quite typical for networks to be updated over many many Epochs. Sometimes hundreds, sometimes thousands of Epochs; the exact number of Epochs to run is a discretionary decision by the Data Scientist, based on numerous variables.

There is one additional step to be added in here, and that is batching the data. For this, we can consider within each Epoch, the data will be batched up in to equally divisible groups, and used for subsequent processing. In this instance, the model parameters will be updated after each batch. And when the full amount of data has been run through the model in it’s entirety, this is then considered one Epoch.

14.1. Set up the Train Model Function

To illustrate this programmatically, the following function is written.

Notice here, that there are many custom functions included (the source code for each of which can be found here). These functions include:

  • get_Modulus()
  • get_BatchIndexes()
  • get_VerbosityValues()
  • get_TimeDifference()
  • plt_PlotLearningCurve()

The steps for this function include:

  1. Begin the timer.
  2. Declare what information will be returned.
  3. Instantiate the network.
  4. Initialise the network.
  5. Loop through each Epoch.
  6. Loop through each batch.
  7. Subset the data for this particular batch.
  8. Run forward propagation.
  9. Calculate the cost.
  10. Apply the cost to the network.
  11. Differentiate the cost.
  12. Run backward propagation.
  13. Update the model parameters.
  14. Run through next batch.
  15. Save the overall cost at the end of the Epoch.
  16. Print the update at the relevant Epoch number (from the verbosity argument)
  17. Run through next Epoch.
  18. Save the updated network.
  19. Print the learning curve.
  20. Return the output.
let_TrainModel <- function(
    x_train, 
    y_train,
    input_nodes=dim(x_train)[2], 
    hidden_nodes=c(100, 50, 10), 
    output_nodes=1,
    initialisation_algorithm="xavier",
    initialisation_order="layers",
    activation_hidden="relu", 
    activation_final="sigmoid",
    batches=get_Modulus(dim(x_train)[1])[4], 
    epochs=500, 
    learning_rate=0.001,
    verbosity=NA,
    print_learning_curve=TRUE
    ) {

    # Begin the timer
    time_begin <- Sys.time()

    # Set return values
    output <- list(
        network_model=network_model,
        results=list(
            cost=vector()
            # Can add more, such as accuracy or specificity.
        )
    )

    # Instantiate
    network_model <- set_InstantiateNetwork(
        input=input_nodes,
        hidden=hidden_nodes, 
        output=output_nodes
    )

    # Initialise
    network_model <- set_InitialiseModel(
        network_model=network_model, 
        initialisation_algorithm=initialisation_algorithm, 
        initialisation_order=initialisation_order
    )

    # Loop each epoch
    for (epoch in 1:epochs) {

        # Loop each batch
        for (batch in 1:batches) {

            # Set indices
            batch_indexes <- get_BatchIndexes(
                vector=1:dim(x_train)[1], 
                batches=batches, 
                batch=batch,
                seed=1234
            )

            # Set data
            x_train_batch <- x_train[batch_indexes,]
            y_train_batch <- y_train[batch_indexes]

            # Forward Prop
            network_model <- set_ForwardProp(
                network_model = network_model, 
                data_in = x_train_batch, 
                activation_hidden = activation_hidden, 
                activation_final = activation_final
            )

            # Get cost
            cost <- get_ComputeCost(
                pred = network_model[["output"]][["acti"]], 
                true = y_train_batch, 
                epsi = 1e-10
            )

            # Apply cost
            network_model <- set_ApplyCost(
                network_model = network_model, 
                cost = cost
            )

            # Differentiate cost
            network_model <- set_ApplyDifferentiateCost(
                network_model = network_model, 
                cost_differential = get_DifferentiateCost(network_model[["output"]][["acti"]], y_train_batch)
            )

            # Backprop
            network_model <- set_BackwardProp(network_model)

            # Update parameters
            network_model <- set_UpdateModel(
                network_model = network_model, 
                learning_rate = learning_rate
            )

        }

        # Save cost
        output[["results"]][["cost"]] %<>% c(cost)

        # Print update
        if (!is.na(verbosity)) {
            if (epoch %in% get_VerbosityValues(epochs, verbosity)){
                if (epoch == verbosity) {
                    "Learning rate: {}n" %>% 
                        str_Format(learning_rate) %>% 
                        cat()
                }
                "Epoch {}, Cost: {}, Time: {}n" %>% 
                    str_Format(
                        epoch, 
                        round(cost, 5), 
                        get_TimeDifference(time_begin)
                    ) %>% 
                    cat()
            }
        }

    }

    # Re-apply back to the output list
    output[["network_model"]] <- network_model

    # Print the results
    if (print_learning_curve == TRUE) {

        tryCatch(
            expr={
                output %>% 
                    extract2("results") %>% 
                    extract2("cost") %>% 
                    plt_PlotLearningCurve(
                        input_nodes=input_nodes, 
                        hidden_nodes=hidden_nodes, 
                        output_nodes=output_nodes,
                        initialisation_algorithm=
                            initialisation_algorithm, 
                        initialisation_order=initialisation_order,
                        activation_hidden=activation_hidden, 
                        activation_final=activation_final,
                        epochs=epochs, 
                        learning_rate=learning_rate, 
                        verbosity=verbosity,
                        run_time=get_TimeDifference(time_begin)
                    ) %>% 
                    print()
            },
            warning=function(message){
                writeLines("A Warning occurred:")
                writeLines(message)
                return(invisible(NA))
            },
            error=function(message){
                writeLines("An Error occurred:")
                writeLines(message)
                return(invisible(NA))
            },
            finally={
                #Do nothing...
            }
        )

    }

    # Return
    return(output)

}

14.2. Run the Train Model Function

Having had set up the training function, let’ now run it.

Notice here that:

  1. Only the training data is used (see Split Data section).
  2. The He initialisation algorithm is used, with an order of 2.
  3. The relu activation algorithm is used for the hidden layers, with the sigmoid activation algorithm for the output layer.
  4. There are 56 batches, and 50 epochs.
  5. The model results are printed every 10 epochs.
  6. The learning curve is printed at the end.
training_output <- let_TrainModel(
    x_train=trn_img, 
    y_train=trn_cls,
    input_nodes=dim(trn_img)[2], 
    hidden_nodes=c(100,75,50,30,20), 
    output_nodes=1,
    initialisation_algorithm="he", 
    initialisation_order=2,
    activation_hidden="relu", 
    activation_final="sigmoid",
    batches=56, epochs=50, 
    learning_rate=0.01,
    verbosity=10, 
    print_learning_curve=TRUE
)

Which prints:

Learning rate: 0.01
Epoch 10, Cost: 0.32237, Time: 4.86 mins
Epoch 20, Cost: 0.30458, Time: 9.58 mins
Epoch 30, Cost: 0.28903, Time: 12.23 mins
Epoch 40, Cost: 0.29525, Time: 14.87 mins
Epoch 50, Cost: 0.29884, Time: 17.43 mins
Figure 8: Learning Curve for Neural Network
Figure 8: Learning Curve for Neural Network

It works! This plot proves that the model is training, and it is continuing to learn and improve its performance over time.

Notice here that the cost line begins at approximately 0.4, and very quickly decreases to 0.3 after 20 Epochs. In order for it to complete the full 50 Epochs, this took approximately 12 minutes. This can be seen as a success.

14.3. Further Experimentation

Due to the nature of how these functions have been set up, it is very easy for further experimentation and optimisation.

It may be reasonable to try:

  1. A different number of hidden layers, or number of nodes per layer (change the hidden_nodes parameter),
  2. A different initialisation algorithm, or initialisation order (change the initialisation_alghorithm or initialisation_order parameter),
  3. A different activation function on the hidden layers (change the activation_hidden parameter),
  4. A different number of batches per Epoch (change the batches parameter),
  5. A different number of epochs (change the epochs parameter),
  6. A different learning rate (change the learning_rate parameter),
  7. Get more data (recall in the Download Data section, only 10,000 images were downloaded; but there are another 50,000 images which can be used for further training).

Here is an experimentation of trying some different parameters. Notice how the results also change. This next round of training took 34 minutes to run.

training_output <- let_TrainModel(
    x_train=trn_img, 
    y_train=trn_cls,
    input_nodes=dim(trn_img)[2],
    hidden_nodes=c(100,75,50,30,20), 
    output_nodes=1,
    initialisation_algorithm="xavier", 
    initialisation_order="layers",
    activation_hidden="relu", 
    activation_final="sigmoid",
    batches=56, epochs=100, 
    learning_rate=0.001,
    verbosity=20, 
    print_learning_curve=TRUE
)

Which prints:

Learning rate: 0.001
Epoch 20, Cost: 0.33354, Time: 5.4 mins
Epoch 40, Cost: 0.29891, Time: 10.26 mins
Epoch 60, Cost: 0.30255, Time: 15.2 mins
Epoch 80, Cost: 0.29968, Time: 20.84 mins
Epoch 100, Cost: 0.29655, Time: 26.09 mins
Figure 9: Second Learning Curve for Neural Network
Figure 9: Second Learning Curve for Neural Network

It is then the job of the Data Scientist to identify the optimal configuration of these parameters, in order to achieve a better performance rate. See Hyperparameter Optimization in Machine Learning Models for more details.

However, as this article is not to achieve the best results, but more to show how to achieve said results, and the processes/methodologies involved. Therefore, further experimentation will not be conducted in this paper.

15. Create Prediction

After having trained the network, next step is to apply it to the test data to see how accurate it can predict unknown data.

15.1. Set Up Prediction

The first step is to write two custom functions. The first will take in the test data and trained model, and will return a data frame containing the predicted values and the truth values. Then pretty-print the Confusion Matrix in a ggplot-style of output.

get_Prediction <- function(
    x_test, 
    y_test, 
    network_model, 
    threshold=0.5
    ) {

    # Create prediction
    predic <- set_ForwardProp(
        network_model=network_model, 
        data_in=x_test, 
        activation_hidden="relu", 
        activation_final="sigmoid"
    )

    # Extract probabilities
    probas <- predic %>% 
        extract2("output") %>% 
        extract2("acti")

    # Define results
    result <- data.frame(
        probs=probas,
        truth=y_test
    )

    # Add prdic
    result %<>% 
        mutate(prdic=ifelse(probas>threshold, 1, 0))

    # Return
    return(result)

}
plt_ConfusionMatrix <- function(confusion_matrix) {
    # Do work
    plot <- confusion_matrix %>% 
        extract("table") %>% 
        as.data.frame() %>% 
        rename_all(str_remove_all, "table.") %>% 
        rename("Prediction"=1, "Reference"=2) %>% 
        mutate(
            goodbad = ifelse(
                Prediction == Reference, 
                "good", 
                "bad"
            )
        ) %>%
        group_by(Reference) %>% 
        mutate(prop = Freq/sum(Freq)) %>% 
        ungroup() %>% 
        {
            ggplot(., 
                aes(
                    x = Reference, 
                    y = Prediction, 
                    fill = goodbad, 
                    alpha = prop
                )) +
                geom_tile() +
                geom_text(
                    aes(label = Freq), 
                    vjust = .5, 
                    fontface  = "bold", 
                    alpha = 1
                ) +
                scale_fill_manual(
                    values=c(good="green", bad="red")
                ) +
                scale_x_discrete(
                    limits=levels(.$Reference), 
                    position="top"
                ) +
                scale_y_discrete(
                    limits=rev(levels(.$Prediction))
                ) +
                labs(
                    title="Confusion Matrix",
                    subtitle=paste0(
                        "For: '", 
                        .$Prediction[1], 
                        "' vs '", 
                        .$Prediction[2], 
                        "'"
                    )
                )
        }

    # Return
    return(plot)

}

15.2. Run Prediction

The next step is to actually run the prediction. This step is quite self explanatory. The parameters for this function includes the test data, and the trained model.

# Create prediction
Prediction <- get_Prediction(
    tst_img, 
    tst_cls, 
    training_output[["network_model"]], 
    0.1
)

15.3. View Prediction

Having created this prediction, it’s then necessary to visualise the output. In the same way as the Check Images section, the following code chunk visualises the top 16 images, and returns a label for each.

As seen, the model is clearly able to recognise some, but it also makes mistakes on others.

# Define classes
ClassList <- c("0"="Not", "1"="Car")
# Set list
lst <- list()
# Loop 16 images
for (image in 1:16) {
    lst[[image]] <- plt_PlotImage(
        cifar$images[-partition,,,], 
        Prediction[["prdic"]], 
        ClassList,
        image
    )
}
# View images
gridExtra::grid.arrange(grobs=lst, ncol=4)
Figure 10: Predicted Images
Figure 10: Predicted Images

15.4. Test the Prediction

The next step then is to statistically test the data to see how accurate it may be. For this, the confusionMatrix() function (from the caret package) used. From this output, it is possible to select the appropriate metric in order for further optimising the neural network.

Since this article is onlyabout showing the method, then we will not progress any further with more optimisation here.

# Set Up
ConfusionMatrix <- Prediction %>% 
    mutate_at(c("truth","prdic"), ~ifelse(.==1,"car","not")) %>% 
    select(prdic,truth) %>% 
    table %>% 
    caret::confusionMatrix()
# Print
ConfusionMatrix %>% print()

Which prints:

Confusion Matrix and Statistics
truth
prdic  car  not
  car  188 1118
  not  121 1573

               Accuracy : 0.587              
                 95% CI : (0.569, 0.605)     
    No Information Rate : 0.897              
    P-Value [Acc > NIR] : 1                  

                  Kappa : 0.079              

 Mcnemar's Test P-Value : <0.0000000000000002

            Sensitivity : 0.6084             
            Specificity : 0.5845             
         Pos Pred Value : 0.1440             
         Neg Pred Value : 0.9286             
             Prevalence : 0.1030             
         Detection Rate : 0.0627             
   Detection Prevalence : 0.4353             
      Balanced Accuracy : 0.5965             

       'Positive' Class : car

Next step is to plot the confusion matrix. A good source of understanding and interpreting confusion matrices can be found here: Confusion Matrix. In this website, it also includes an image (also copied below) for a very good visual understanding.

Figure 11: Confusion Matrix Metrics
Figure 11: Confusion Matrix Metrics

When we visualise the confusion matrix of our model, we can see that it successfully predicted a majority of the images. However, the largest number of bad classifications is where it predicted that the image was a car, when it actually was not.

# Plot Confusion Matrix
ConfusionMatrix %>% 
    plt_ConfusionMatrix()
Figure 11: Generated Confusion Matrix
Figure 11: Generated Confusion Matrix

Another good analytical and plotting tool is to use the ROC Curve (receiver operating characteristics). An ROC Curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied (see Receiver Operating Characteristic).

Essentially, a model that creates a perfect prediction would have the curve perfectly ‘hugging’ the top-left of this plot. As our model is not doing this, it is clearly obvious that further training and optimisation is necessary. For this, it is necessary to continue experimenting as mentioned in the Further Experimentation section.

# Print ROC
Prediction %>% 
    ggplot() +
    plotROC::geom_roc(aes(m=probs, d=truth), n.cuts=0) +
    plotROC::style_roc(
        theme=theme_grey,
        ylab="True Positive Rate",
        xlab="False Positive Rate"
    ) +
    theme(
        plot.title = element_text(hjust=0.5),
        plot.subtitle = element_text(hjust=0.5)
    ) + 
    labs(
        title="ROC Curve",
        y="True Positive Rate",
        x="False Positive Rate"
    )
Figure 12: ROC Curve
Figure 12: ROC Curve

16. Conclusion

This is an effective representation for how to build Vanilla Neural Networks in R. Here, it has been shown how to:

  1. Access and check the data
  2. Instantiate and Initialise the network
  3. Run forward propagation
  4. Compute the cost
  5. Run backward propagation
  6. Update the model
  7. Set up a training method to loop through every thing
  8. Predict and assess the performance

It has also been shown how to do this entirely in R, and without using any pre-defined deep learning framework. For sure, there are other packages that can perform all of the steps, and perhaps in a more computationally efficient manner. But the use of these packages is not the target of this article. Here, the aim is to dispel the ‘blackbox’ phenomenon, and to show how to create these deep learning frameworks from the ground up.

As shown, the architecture of these feed-forward neural networks effectively use matrix multiplication and differential calculus in order to adjust the ‘weights’ of a network, and increase it’s predictive power. It does require lots of data, and does require a long time to train and optimise. Furthermore, there are many (many, many) different architectural choices that can be made along the way, all of which are at the discretion of the Data Scientist applying this methodology.

At the end of the day, this article has proven that deep learning can definitely be achieved in R. Thereby, dispelling the myth that ‘in order to do Deep Learning, you must use Python‘. Certainly, there are many successful implementations of Deep Learning in Python; for example: Building a Feedforward Neural Network from Scratch in Python. There’s also example of achieving the same in Swift (Deep Neural Networks in Swift, lessons learned), and in C++ (Neural Network Implementation in C++ From Scratch), and in Java (Implementing an Artificial Neural Network in Pure Java (No external dependencies)). Therefore, the specific language used is effectively irrelevant. What matters is the use case, the environment, the business artifacts, and the language that the Data Scientist feels comfortable in.

17. Post Script

Acknowledgements: This report was compiled with some assistance from others. Acknowledgements go to:

Publications: This report is also published on the following sites:

Change Log: This publication was modified on the following dates:

  • 02/Nov/2020: Original Publication date.

Related Articles