Fastai — Exploring the Training Process — the Pixel Similarity approach

Starting with unraveling the training process with the help of a simple image classification task

Published in

Towards Data Science

8 min readDec 23, 2020

In my previous article, I talked about the quickest way to getting started with Fastai by building an image classification model in just a few lines of code and a very limited amount of time. Here, I try to look under the hood a little and start exploring the inner workings of the library by constructing a loss function for the model from scratch.

Here’s the first part in this series of articles in which I’m documenting my journey of learning fastai. Happy reading!

I’ll be using the same dataset — the Rock, Paper, Scissors dataset from Kaggle, which is a multi-class image classification problem. Since Fastai works with Pytorch, we’ll have quite a few Pytorch specific functions popping up in our journey to construct an appropriate loss function for our task.

If you’d like to jump the gun and view the code all at once, here’s the repo I’ve kept it in: https://github.com/yashprakash13/RockPaperScissorsFastAI

The first approach described by the authors of the library is the Pixel Similarity approach. Essentially, this is the most basic method of determining which two images are the most similar in pixel structure, hence producing a conclusion as to whether the image we’re testing belongs to a particular class or not.

In here, we’re first going to look at this basic method and in the future, I’ll go about building an actual Stochastic Gradient Descent loss function from scratch using some Pytorch functions.

Let’s get on with it then!

What exactly is the Pixel Similarity approach?

Let’s define this particular method first, shall we?

We calculate the average pixel value for all of the rock, paper and scissor images respectively. This will give us somewhat of an ideal pixel value of what an image belonging to that specific class should look like. Then, we’ll move on to testing the images and determining if their pixels look most similar to which of those ideal images.

In the example showed in the fastai book, the authors achieved a hefty 90% accuracy with this naive approach at classifying the images. Naturally, I was curious to see how it’ll perform on a different dataset and with RGB images, so here we are.

Like in the previous article, we import the library and set the dataset path like this:

from fastai.vision.all import *
DATASET_PATH = Path('RockPaperScissors/data')
DATASET_PATH.ls()'''OUTPUT:
[Path('RockPaperScissors/data/valid'),Path('RockPaperScissors/data/.DS_Store'),Path('RockPaperScissors/data/Rock-Paper-Scissors'),Path('RockPaperScissors/data/train'),Path('RockPaperScissors/data/test2')]'''

We confirm that our train and valid folders are now accessible via the path variable we made.

Now, we define the class specific dataset path for our training folders.

rock_train = (DATASET_PATH/'train'/'rock').ls().sorted()
paper_train = (DATASET_PATH/'train'/'paper').ls().sorted()
scissors_train = (DATASET_PATH/'train'/'scissors').ls().sorted()

The output we get will show us the list of images in our folders. Something like so:

OUTPUT:
((#840) [Path('RockPaperScissors/data/train/rock/rock01-000.png'),Path('RockPaperScissors/data/train/rock/rock01-001.png'),Path('RockPaperScissors/data/train/rock/rock01-002.png'),Path('RockPaperScissors/data/train/rock/rock01-003.png'),Path('RockPaperScissors/data/train/rock/rock01-004.png'),
and so on.

Did you notice the little 840 at the beginning? It’s the number of images we have in the train folders for each class.

Introduction to Pytorch tensors

My first dealings with building and manipulating tensors made me learn quite a few new things about how images are represented in the library when we import them, and I’ll try to document most of it as we move ahead.

Let’s first pick a random image to view:

rock_img = rock_train[3]
img = Image.open(rock_img)
img

The same image can now be represented as a Pytorch tensor in memory via a single line of code.

tensor(img)

If you’re following along — which I’ll recommend by the way, — go ahead and print it out!

OUTPUT:
tensor([[[255, 255, 255, 255],
         [255, 255, 255, 255],
         [255, 255, 255, 255]],

        [[254, 254, 254, 255],
         [255, 255, 255, 255],
...so on.

…which is a collection of pixel values of the image. An important thing to view here is the shape of the tensor.

tensor(img).shape'''OUTPUT:
torch.Size([300, 300, 4])'''

Two of the three dimensions above are easy to recognise, and those are the width and height of the image which is 300x300. Then what does the 4 represent?

I spent quite a while understanding the way image channels work, and it’s a topic very much worth learning if you’re looking towards progressing in building more vision models! I’ll tell you in short though, and link an article at the end that I found most helpful in learning about them.

Basically, image channels are the number used to represent whether the image is a black and white image or a RGB(colour) image, or a mixture of both RGB and Black and white. Here, the number 4 could’ve also been 1 (for a black and white image), 3 (for a proper RGB image). Here in our dataset, we have a mixture of both so Pytorch is intelligent enough to know that 4 channels are needed to properly represent the pixel values of each and every image. So here, 4 is the alpha component added to our image tensors in order to support a wider colour gamut in our image dataset and is used to define areas of a photo, such as to define transparency or to preserve a saved selection ( if you have some experience with using Photoshop, you might know the term already :)) .

Pixels aren’t just limited to defining colour like the RGB channels do. An alpha channel defines which pixels are fully selected, which are partially and which pixels aren’t selected at all in our image.

I know this all sounds very confusing, and I also had a bit of a hard time trying to wrap my mind around this (I still am not sure I get it completely :P). Hopefully, the article I’m linking at the end will make sure you get a better understanding of components of image channels in a little more detail.

Now that we have seen what an image tensor looks like, we can go ahead and convert our whole train+validation data to tensors as well!

Stacks of training and validation Pytorch tensors

#cumulate all the images of three classes as Pytorch tensorsrock_tensors = [tensor(Image.open(o)) for o in rock_train]
paper_tensors = [tensor(Image.open(o)) for o in paper_train]
scissors_tensors = [tensor(Image.open(o)) for o in scissors_train]stacked_rock = torch.stack(rock_tensors).float()/255
stacked_paper = torch.stack(paper_tensors).float()/255
stacked_scissors = torch.stack(scissors_tensors).float()/255

stacked_rock.shape, stacked_paper.shape, stacked_scissors.shape'''OUTPUT:
(torch.Size([840, 300, 300, 4]),
 torch.Size([840, 300, 300, 4]),
 torch.Size([840, 300, 300, 4]))''''''
Do THE SAME FOR VALIDATION FOLDERS
'''

So what do we see here? The last part of the tensor dimensions is pretty much clear, is it not? 300 x 300 x 4 is the dimension of a single image we saw earlier, and now that we have stacked all of our images in the train folder, we have 840 of them for each class, via the stack function.

Now we get to the part of visualising what that ‘mean’ rock, paper and scissor images look like.

Rock mean

rock_mean = stacked_rock.mean(0)
show_image(rock_mean)

This is the image we get after averaging all the 840 training rock images.

Go ahead and do the same for the paper as well as scissor training images.

Here are the results:

Notice how the images, especially, our scissor and rock mean images look very similar to each other? That’s because they are to this naive approach we’re using!

We’ll also get to see how poor this approach is, as compared to other approaches, other learning approaches as I should say, such as the Stochastic Gradient Descent.

Now we must move on to calculating those pixel similarity distances between images I said earlier.

Two methods of pixel similarities

They are called: the Mean Absolute Error method and the Root Mean Square method. Let’s see both of them in action.

#taking a random rock tensor from our stacked rock tensors
rock_rand = stacked_rock[7]

Now calculate the distance between that random rock image and the rock mean:

mean_abs_rock_rock = (rock_rand - rock_mean).abs().mean()
mean_abs_rock_rock'''OUTPUT:
tensor(0.0942)'''

Similarly, we can do the RMS error as well:

#doing the same with rms
rms_rock_rock = ((rock_rand - rock_mean) ** 2).mean().sqrt()
rms_rock_rock'''OUTPUT:tensor(0.1697)'''

For later use, I made a custom function for the same:

def get_pixel_diff(a, b):
    return (a-b).abs().mean((-1, -2, -3))

get_pixel_diff(rock_rand, rock_mean)

What’s that mean((-1, -2, -3)) in here? It’s the dimensions upon which we take our mean across the stack of images. Here, we take it across the image channels(4), and the height and width of the image(300, 300).

As expected, we get the same result as before:

OUTPUT:
tensor(0.0942)

The moment of truth — why this naive approach is actually bad and does not work in real life

We have seen that the pixel similarity approach does help us — to our naked eye at least — distinguish between the three classes of images. BUT, does it work with a separate set of images that we test it across?

I define a function that will calculate if an image passed as a parameter to it is an image of the same class or not. For this example, I make the is_rock function:

def is_rock(x):
    return (get_pixel_diff(x, rock_mean) < get_pixel_diff(x, paper_mean)) \
        & (get_pixel_diff(x, rock_mean) < get_pixel_diff(x, scissors_mean))

If the distances of a passed image from the other two classes is more, meaning is less similar to that of the mean rock image we defined earlier, then we can conclude that the given algorithm is working as expected.

We can now run it through on the whole validation folder and see the results!

valid_rock_dist = get_pixel_diff(rock_mean, stacked_rock_val)accuracy_rock = is_rock(stacked_rock_val).float().mean()'''
OUTPUTtensor(0.1371)'''

13% ?! Such bad accuracy?! This is clearly not feasible in practise!

Let’s do it again with an is_paper function this time:

accuracy_paper = is_paper(stacked_paper).float().mean()
accuracy_paper'''
OUTPUT:tensor(0.5536)'''

This doesn’t seem too bad, does it, when we see and know that we aren’t performing virtually any training at all?

Concluding…

With this experience, perhaps it is time to try a method that does some real learning on the images — that is, one that can automatically modify itself to improve its performance. We need to build an actual training pipeline next, in order to improve how a model will be seeing our images, and learn every time when it predicts both right and wrong.

That particular method then starts with the journey into learning about SGD — stochastic gradient descent algorithm.

This seems like something to ponder over, doesn’t it? If you’ve made it this far, I applaud you and hope you’ll stay tuned for my next article while I continue documenting as I learn new things with fastai and Pytorch.

The reading resources I talked about:

Get familiar with Pytorch basics and tensors: https://johnwlambert.github.io/pytorch-tutorial/

Pro tip: Only read till the tensors basics section and not the following convolution layers section, if you’re just starting out like me :)

2. Some jargon about image channels: https://www.cs.virginia.edu/~vicente/recognition/notebooks/image_processing_lab.html

Happy reading! 😁

Do you want to get one free, clean email from me every week or two weeks containing the best of my curated articles and tutorials that I publish? Join my Codecast!

Connect with me on Twitter and LinkedIn!