Flux.jl on MNIST — Variations of a theme

Flux.jl is a ML-stack offering lightweight components to create models and train them. Using the MNIST dataset we will see how easy it is to build different approaches for classification of this dataset just by plugging such components together.

Roland Schätzle

Published in

Towards Data Science

12 min readJun 23, 2022

Overview

Flux.jl

Flux.jl is a package written in 100% Julia. It’s aimed at building models which are typically trained using an iterative approach based on automated differentiation. The most common class of this sort of models are probably neural networks (NN) which are trained using a variant of the gradient descent algorithm (GD).

In contrast to some other ML libraries which consist of a ready-to-use “machinery” to build and train models, Flux rather offers a set of lightweight lego bricks you may snap together according to your specific needs.

In this article I will show how this lego-like approach can be applied to build several NN models (multilayer perceptrons to be exact) for classifying images and implement different variants of the GD algorithm to train them.

MNIST dataset

For this purpose, we will use the well-known MNIST dataset which consists of 70.000 images of handwritten (and labeled) digits. 60.000 of these images serve as training data and the remaining 10.000 instances will be used to test the models. Each image consists of 28x28 gray-scale pixels.

The whole dataset can easily be loaded using the MLDatasets package:

The first expression (line 3) automatically fetches the 60.000 training images with the corresponding labels and the variant in line 4 with split = :test the rest for testing. So train_x_raw is a 28x28x60000-element array and test_x_raw a 28x28x10000-element array after this operation. The labels are stored in train_y_raw and test_y_raw which are arrays of size 60.000 and 10.000 respectively containing integer numbers between 0 and 9.

The function convert2image allows us to display such an image. We get e.g. the 10th training image by calling convert2image(MNIST, train_x_raw[:,:,10]):

Gray-scale image of handwritten digit 4 [image by author]

The corresponding label (4) is in train_y_raw[10].

The data underlying this image (in train_x_raw[:,:,10]) is just a 28x28-element matrix containing numbers between 0 and 1, each representing a shade of grey.

Models

As stated above, our aim is to create different models which are able to classify such images. I.e. the input to our models will be an image of a handwritten digit (in the form of a 28x28 matrix) and the output a number between 0 and 9 telling us, which digit the image contains.

Basic processing pipeline [image by author]

Data

Unfortunately NNs neither directly process matrices nor output digits in the way we want. So we have to adapt our data a bit.

The input for a NN has to be a (column) vector containing all values. Therefore we have to transform our 28x28 matrices and stack the 28 columns of each image on top of each other. The result is a 784-element vector (28 x 28 = 784). The Flux function flatten() does exactly this:

train_x = Flux.flatten(train_x_raw)
test_x  = Flux.flatten(test_x_raw)

This results in a 784x60000 and a 784x10000-element array respectively.

The output of a NN for such a classification problem is typically a so-called one-hot vector. This is a bit vector (in our case with 10 elements as we have 10 different classes) where exactly one bit is 1 and all remaining bits are 0. The element with value 1 represents the respective class, i.e. if the first bit is 1 this represents “digit 0”, if the 4th bit is 1 it represents “digit 3” etc.

The Flux function onehotbatch() does this conversion for a whole array of labels:

train_y = Flux.onehotbatch(train_y_raw, 0:9)
test_y  = Flux.onehotbatch(test_y_raw, 0:9)

The argument 0:9 informs about the (possible) range of numbers which have to be represented by the resulting one-hot vectors. The results of these statements are a 10x60000 and a 10x10000 bit array respectively.

So our adapted processing pipeline for image prediction looks like this:

Adapted processing pipeline [image by author]

Note: In a real-world application a NN will rarely ever produce a “genuine” one-hot-vector. Instead a vector containing probabilities will be produced and if the NN works well, one of these values will be close to 1 and all other values will be close to 0.

Multilayer Perceptron Networks

The models we want to use are so-called multilayer perceptron networks (MLP). These are “classical” neural networks, consisting of several layers of neurons, where each neuron of one layer is connected to all neurons of the next layer (also called fully connected or dense networks):

Different MLPs differ in the

the number of layers,
the number of neurons in each layer and
the so-called activation function which is applied to the result of a neuron before it is passed to the next layer; this activation function may differ from layer to layer.

Models for MNIST classification

In our case the number of input values in the first layer is already fixed by the input data (to 784) and the last layer must have 10 neurons (resulting in 10 output values) because we have 10 classes. But we are free in choosing the remaining characteristics of our models.

For the sake of simplicity, I have chosen three models which can be found elsewhere on the internet: One comes from Grant Sanderson’s fantastic videos about neural networks on his YouTube-channel “3Blue1Brown” (here you can find the first one with its wonderful visualizations), a second (adapted) one from this in-depth treatment of our topic and the third one from the Flux documentation.

In detail we have the following models (which I have named according to their main characteristics)

4LS: A 4-layer model using 16 nodes in the inner layers and the sigmoid activation function
3LS: A 3-layer model using 60 nodes in the inner layers and the sigmoid activation function
2LR: A 2-layer model using 32 nodes in the inner layer and the relu activation function

Sidenote: I didn’t count the input nor the output as a separate layer (as is done elsewhere) in order to be closer to the following model definitions in Flux.

Using the means of Flux we can define these three models as follows (in the comments you can see the number of parameters each model has):

The third model (model2LR) uses the relu-activation function only on the first layer. On the second layer no such function is specified. This means that the Flux-default is used (identity). As a consequence, this layer may produce values in the range [-∞, ∞], which is not what we want. Therefore the softmax-function is applied on the results standardizing them to a range from 0 to 1 and ensuring that they sum up to 1.

The first two models using the sigmoid-function for activation are rather “traditional” NNs, whereas nowadays sigmoid has been mostly replaced by relu.

Training with Gradient Descent

Training such a model means finding optimal parameter values so that the model is able to make predictions with a high degree of accuracy.

Cost function

In order to measure, how good a specific set of parameters performs with respect to this aim, we need a so-called cost function. This cost function C compares the predictions ŷ made with a specific parameter set to the actual classes y in the training data and calculates on this basis an indicator for this “distance” (= C(ŷ, y)). I.e. the smaller the result of the cost function is, the better the chosen parameters are. So our aim is to find a set of parameters which minimizes the cost function.

Typical cost functions used in this context are:

mean squared error (mse)
cross entropy (ce)

The mse is typically used with models which have the sigmoid-function for activation, whereas the ce is typically used in conjunction with the relu-activation function. The rationale behind these pairings is, that in these constellations the cost function can be guaranteed to be convex, i.e. it has exactly one minimum. Therefore an algorithm for finding the (global) minimum will not be distracted by possibly other (local) minima.

This being said, I will not conceal that “the presence of local minima in the loss function does not seem to be a major problem in practice” (which is “somewhat of a theoretical mystery”) as LeCun et al. write in their paper on “Gradient-Based Learning Applied to Document Recognition” (Proc. of the IEEE, Nov. 1998). So, other pairings are used as well.

Gradient Descent

The basic idea of the gradient descent algorithm to train a model (i.e. to find a set of parameters for the model which minimizes the cost function C) is as follows:

Start with an arbitrarily chosen set of parameters W₀
Compute a new set of “improved” parameters W₁ by moving a little step (α) in the opposite direction of the gradient of C at position W₀ (∇C(W₀)):
W₁ = W₀ – α ∇C(W₀)

As the gradient of C (a vector) at position W₀ points in the direction where C is steepest, we have to move in the opposite direction to get to the minimum (therefore the minus in the formula).

Step 2 is repeated iteratively until C converges towards the minimum (or until we have a set or parameters Wi which delivers sufficient accuracy for the application we have in mind):

Iteration step of the Gradient Descent algorithm

The step size α is called the learning rate. It is chosen heuristically. If it is too large, the algorithm may “overshoot”. If it is too small, the algorithm converges very slowly. In practice it is a good idea to experiment with values like 0.1, 0.01 and 0.001 and adapt them according to the results.

Implementation in Flux.jl

So, how to implement these concepts in Julia using Flux.jl? The models (model4LS, model3LS and model2LR) are already defined and the training data is ready to use (in train_x and train_y).

What remains is the definition of a cost function (which Flux calls a loss function) and the implementation of the gradient descent algorithm. At the core of each GD implementation in Flux is the Flux.train!() function. It is called on each iteration of GD and computes a (slightly) improved version of the model parameters (as explained above). The following arguments have to be passed to train!():

the cost function loss
the model parameters params (Wi from the last iteration step or the initial parameters on the first run)
the training data data (in train_x and train_y)
a so-called “optimizer” opt; this is the formula used to produce a new (improved) version of the model parameters as shown above (“Iteration step of the Gradient Descent algorithm”). The formula above is the most basic variant. In recent years more sophisticated variants have been developed. The main difference to the basic variant is, that they dynamically adapt the learning rate α on each iteration step.
Apart from the basic variant, which we get in Flux with Descent(α), we will apply one of the most advanced versions: ADAM (= ADAptive Moment estimation), which is available in Flux with ADAM(α).

We can access the parameters of each model using the Flux.params() function:

params4LS = Flux.params(model4LS)
params3LS = Flux.params(model3LS)
params2LR = Flux.params(model2LR)

The cost functions mse and ce are also pre-defined in Flux (as Flux.Losses.mse() and Flux.Losses.crossentropy()). As train!() expects a cost function where the training data (train_x and train_y) may be passed directly, whereas the two pre-defined loss functions expect the model predictions based on train_x as their first parameter, we have to define some “translations”:

loss4LSmse(x,y) = Flux.Losses.mse(model4LS(x),y)
loss4LSce(x,y)  = Flux.Losses.crossentropy(model4LS(x),y)
loss3LSmse(x,y) = Flux.Losses.mse(model3LS(x),y)
loss3LSce(x,y)  = Flux.Losses.crossentropy(model3LS(x),y)
loss2LRce(x,y)  = Flux.Losses.crossentropy(model2LR(x),y)

As the mse cost function doesn’t work well with a NN using relu, we define for the 2LR model only the loss function based on ce.

Now we can look how large the loss of our models with their initial (random) parameters is by calling these functions. E.g. loss4LSmse(x_train, y_train) delivers in my environment a value of 0.2304102. But as the initial parameters are randomly chosen, you will get another value, if you try that on your computer. So this value serves just for (relative) comparison, in order to see how much it will have changed after some iterations of GC.

Implementing variants of Gradient Descent

Batch Gradient Descent

Then we are ready to implement a first version of GD using Julia and Flux.jl:

Our first version of gradient descent train_batch() takes as arguments the training data X and y, the loss function loss, the optimizer opt, the model’s parameters params and the number of iterations we want to train (epochs). This implementation of GD is essentially a for-loop calling train!() on each iteration.

As train!() expects the training data to be wrapped inside a tuple in an array, we prepare this structure in line 2. That’s it!

To train e.g. our 4LS model using the mse cost function and the standard GD formula with a learning rate of 0.1 on 100 iterations, we have to write:

train_batch(train_x, train_y, loss4LSmse, 
            Descent(0.1), params4LS, 100)

After executing this statement, we should see an improvement in the loss value. In my environment, I get a value of 0.12874180 by calling loss4LSmse(x_train, y_train). So we could cut the loss almost in half within these 100 iterations.

This variant of GD is called batch gradient descent as in each iteration the gradient of the loss function is evaluated on the complete training dataset. I.e. all 60.000 images are used to compute that value (on each iteration!).

Obviously, this is computationally quite expensive! Therefore other variants of GD have been developed.

Stochastic Gradient Descent

The stochastic gradient descent algorithm uses on each iteration only a single (randomly chosen) instance of the training data (i.e. one single image) to evaluate the gradient of the loss function. This is clearly less costly.

Considering just one instance (instead of 60.000) won’t produce of course the actual gradient of the loss function. I.e. the resulting vector won’t point in the direction where that function is steepest. But in practice the resulting vector isn’t actually that bad.

Using this approach, we won’t get the most direct path to the minimum, but rather a zig-zag way to (or near) that point. Even if we don’t get directly to the minimum, the result it typically close enough and we get there at considerably lower cost. So this is often the preferred variant.

The implementation looks as follows:

Mini-batch Gradient Descent

The third variant of GD is a good compromise between the first two variants we saw. It works neither on the whole dataset nor on just a single instance on each iteration, but selects a random sample of e.g. 100 instances and calculates the gradient on that basis. This sample is called a mini-batch.

The size of this mini-batch can freely be chosen. Therefore we need an additional parameter batchsize for this purpose:

We implemented here a simple strategy to extract the mini-batch: A random position within the whole dataset is chosen in line 4 and then we extract a mini-batch of size batchsize starting from that position (line 5+6).

A more advanced strategy would be to choose all instances of the mini-batch randomly. That can be achieved by using a Flux DataLoader.

This third variant of the GD is of course the most generic implementation and we can express the first two variants in terms of the train_minibatch() as follows (so we actually only need the implementation of the mini-batch GD):

train_batch(X, y, loss, opt, params, epochs) = 
        train_minibatch(X, y, loss, opt, params, epochs, size(X)[2])train_stochastic(X, y, loss, opt, params, epochs) = 
        train_minibatch(X, y, loss, opt, params, epochs, 1)

Conclusion

We have seen how Flux.jl can be used to pre-process data and to define models. Furthermore using building blocks like

loss functions
optimizers
the train!()-function

we could implement different variations of the GD algorithm quite easily.

In a follow-up article I will evaluate the models presented here in combination with different loss functions, optimizers and other parameters to see how much effort is necessary to train them and what accuracy can be achieved for the recognition of handwritten digits from the MNIST dataset. So, stay tuned!