If you’re seeking to understand artificial Neural Networks at a high level and see them implemented for computer vision, this post is for you.
Computer vision: First off, I think it’s appropriate to introduce this article by explaining a bit about what is meant by computer vision.
Computer vision can be broadly thought of as the ability for computers to interpret digital images or videos.
You may have seen it in action with face recognition software and automatic photo tagging on social media. It is often used in combination with other machine learning tasks. For example, it’s used in series with text to speech machine learning to provide audio descriptions for blind people helping them to comprehend video content.
Challenge
I wish to take you through an entry level computer vision challenge. Using neural networks, we will build a programme to interpret handwritten digits.
The challenge is available as an open Kaggle competition. So, if you’re new to computer vision I encourage you to enter the competition, build your own network and submit your score. With any type of machine learning, reading supplemented with practice will lead to mastery in the long term.
Pre-requisites
I will assume that the reader is already somewhat familiar with machine learning as a concept. You should understand how the machine learning process works, what is meant by training and testing a model, and how they’re commonly evaluated. If you’re not at all familiar with machine learning, please read this as a start.
Data
We have been given two CSV’s from MNIST (Modern National Institute of Standards and Technology) containing handwritten digits in a digital format. The digits (from 0 to 9) are represented by a one dimensional array of size 784. This is essentially a 28 by 28 pixel image flattened out.
Each position in the array represents a pixel for which an integer in the range 0 to 255 is assigned expressing the intensity of the pixel.
As is commonly the case with Kaggle machine learning tasks, one of the files is our training data and the other our testing data. The training data is labeled and the testing data is not. I have further split the training data into a train and validation set following machine learning best practices.
Image by author: the first 10 entries in the MNIST training data provided by Kaggle
Tools
We will be using the Keras API with Tensorflow as a backend. There are many frameworks for computer vision, however I find Keras to be extremely beginner friendly and the code easy to interpret and implement.
What are Neural Networks?
I won’t cover neural networks extensively here, but I’ll provide a brief intuition.
Many books use analogies of human brain cells as a way of describing and introducing neural networks. Following this, they often proceed to overwhelm the reader with heavy mathematical proofs. In my early days, I found this approach to be confusing because I could not reconcile the mathematics with the analogy. I will take a slightly different approach and attempt to demystify artificial neural networks by leaving out the analogies.
I like to think of neural networks as mathematical functions that have incredible flexibility. Because of this, they can identify patterns in data that more rudimentary functions cannot.
Note: a mathematical function is simply something that takes an input, performs an operation, and returns an output.
The flexibility of artificial neural networks can be understood by looking at the following properties:
Layers: The basic structure of a neural network is three layers: input, hidden and output. Deep neural networks have more than two hidden layers. Adding more layers can increase the flexibility of a network to an extent.
Architecture: The architecture of an artificial neural network refers to the way the input, hidden and output layers connect with each other. There are a many neural network architectures available, and each is fit for purpose for specific tasks. For example, recurrent neural networks are good for predictive text tasks.
Learning: Artificial neural networks learn by the process of back-propagation. This is where weights in the network are tweaked over several iterations (epochs) to minimise the error between the value predicted by the network and the true label value. In general, more weights lead to more flexibility in predicting complex patterns.
Activation functions: These are non-linear mathematical functions that enable the neural networks to detect patterns. They are set at the hidden and output layers; their type depends on the prediction task. For example, binary label classification requires the activation function to be a sigmoid function (similarly to logistic regression).
Architecture & Parameters
For our digit recognition task, I’ve constructed two simple artificial neural network architectures. They have the following architecture and parameters:
Layers: One input layer which is defined as a one-dimensional array of size 784. Two hidden layers, the first having 300 neurons and the second having 100 neurons. An output layer of ten neurons for predicting any one of the ten handwritten digits.
Activation functions: The activation functions for the hidden layer neurons are ReLU, which is generally easy to train. The output layer activation function is Softmax. The rational for this is that we have a multi-class classification problem.
Loss function: This is the objective function that we need to minimise to train our network. For our task we will use ‘sparse categorical cross-entropy’. This is the most suitable because the target variable (label) is 0 to 9 and each label is exclusive. Stochastic gradient descent is the algorithm used to minimise this loss function.
Epochs: This refers to the number of iterations of back propagation we will perform over the network. I will arbitrarily set this to 30 for each architecture.
Network 1: Sequential Neural Network
This is the simplest kind of artificial neural network that can be built in Keras. The structure consists of layers connected to each other sequentially from input to output.
Image by author: sequential neural network schema
Building the network: Keras provides a template for the sequential neural net. It’s very intuitive to simply add the layers you want from there. The summary() function provides you with details of the network’s architecture including the number of weights in each layer.
A summary is available to provide information about the architecture of the networks. There are 266,610 parameters (weights) in our network.
Compiling: Implemented in just a few lines of code.
Fitting: And fitting too…
Results: At the 30th epoch the sequential neural network has an accuracy of 99.82% on the training data and 97.64% on the validation data. This isn’t a bad performance given minimal hyperparameter tuning.
Note – it’s expected that our network performs slightly worse on the validation data, but we should be concerned about overfitting when it performs significantly worse.
Image by author: epochs on x-axis accuracy score on y-axis
Network 2: Deep & wide Neural network
This network has an architecture that enables it to learn both complex and simple patterns by passing the input layer through the entire network and directly to the output layer.
Image by author: deep & wide neural network
Building: Keras doesn’t provide a template for the deep and wide network so we have to construct this manually.
Note – the code for compiling and fitting our deep and wide network is exactly the same as our sequential network
There are an additional 8,850 parameters in our deep and wide neural net compared with our sequential version.
Results: At the 30th epoch our deep and wide network has an accuracy of 99.69% on the training set and 97.62% on the validation set. This is pretty much an identical performance to the sequential network. Why do you think this might be?
Test Submission
The sequential neural network scored 97.21% accuracy on the testing data. This places the model in the top 70% on the Kaggle leader board.
There are many things you could try to improve on this score. For example, a convolutional neural network architecture might lead to a higher accuracy.
Try playing around with the number of neurons in each layer, the number of epochs and see if that leads to a better score. As always, the code is available for your use via my GitHub page.