The world’s leading publication for data science, AI, and ML professionals.

Computer Vision: Part 1

An introduction to image data processing and feature extraction

An introduction to Deep Learning for Computer Vision

"Computer Vision" is an area of Machine Learning that deals with image recognition and classification. Computer Vision models can be developed to accomplish tasks like facial recognition, identifying which breed a dog belongs to and even identifying a tumor from CT scans: the possibilities are endless.

In a series of articles on the topic, I will explore some of the key concepts surrounding Computer Vision. In this article, I will provide some intuition around how computers process images, and how objects can be recognised. Subsequent articles will deal with the actual implementation of a Deep Learning model that will learn to classify an image into one of several categories – all in fewer than a hundred lines of code.


How do computers "see" images?

Computers don’t see images the same way humans do – they can only understand numbers. So the first step in any computer vision problem is to convert the information that an image contains to a machine-readable form. Fortunately, this is very easy to do once we break an image down into it’s building blocks.

Images are made up of a grid of pixels – each pixel is like a tiny box that covers a very small part of an image, as shown below:

Original image from Pexels
Original image from Pexels

Each pixel can be seen as a "spot" of a single colour. The more pixels you have, the more granularly you can represent each part of the image (higher resolution images have more pixels).

Now we know what the building block of each image is, but we still need to figure out how to convert this into a series of numbers that exactly describes the image. Let’s think of a high definition image, one that has 1920 * 1080 pixels. We have established that this image can be broken down into tiny boxes. In this case, we will have 2,073,600 of these tiny boxes, and each box can be represented by it’s colour. So we effectively have ~2 million pieces of information – each describing the colour in a specific part of the image.

Luckily, there is a way to describe colours in numerical form. Every possible colour be described by a unique 3 number code – it’s RGB co-ordinates (you can see the RGB codes for any color here). So we can now represent any image numerically by breaking it down into pixels, and represent each pixel by a set of 3 numbers. We have now broken our HD image by into ~6 million numbers that our computer can now understand. Phew!


How do we recognise objects?

Think about how humans recognise objects: we can tell a cat apart from a dog even if we see only their outlines, i.e. we don’t have to conduct a thorough visual inspection and process every detail of their form to be able to identify one from the other. This is because our brain associates a few key "features" (such as size, and shape) to each object. We can then focus only on these features and our brain will still be able to recognise what it is.

It stands to reason then that a Computer Vision model should do the same thing – extract features from images, and associate each feature to a certain category of objects. Then when it is given a new image, it will try to identify it by matching it’s features to the correct category. But how do we train a computer to extract these features from a series of numbers? We do this through a process called "convolution".


Convolutions and feature extraction

Let’s understand how convolution work with a simple example:

Convolution: Image by author
Convolution: Image by author

To transform a raw grid into a "convolved" grid, we need a convolution matrix. The purpose of this matrix is to specify which "features" to extract. We then apply this matrix on the original grid of numbers by multiplying it’s values to the corresponding values of the original grid and adding up the result.

In the example above, we have a 99 grid that we wish to convolve with our 33 convolution matrix. We split our original grid into 9 smaller 3*3 grids (to match the size of our convolution matrix). We then multiply the first 9 cells (greyed) by our convolution matrix. In the simplified case here, all the cells will become 0 except for 14, and adding all these numbers we have: 14. This then becomes the first cell of our "transformed" grid. We follow this process for all the other cells in the original grid to populate our convolved grid. In practice there are other parameters to consider for this process, such as the "stride size", i.e. how many places our convolution matrix "moves" to the right after each operation and "padding" the edges – but the core principle stays the same.

So what have we done here? We have reduced a 99 grid to a 33 grid. If we think of this operation as being performed over the RGB co-ordinates of each pixel in an image, we have effectively "extracted" some features from the original image. This very simple process can produce powerful results, and is the underlying mechanism of every image filter! Not convinced? We can see this in action below:

Feature extraction with convolutions: "outlines": Image by author
Feature extraction with convolutions: "outlines": Image by author

(You can try applying convolutions of your own by downloading the GIMP GUI and following the instructions here.)

We can now appreciate just how powerful convolutions are in terms of:

  1. Reducing the dimensions of the data (our 99 grid became a 33 grid)
  2. Extracting features from our images, which our model can then learn to associate with different objects- hence learning to recognise them.

In practice we may also look to "pool" the result of convolutions, further reducing the dimensionality of the image. Indeed, this is the approach we will follow while developing a deep learning model to learn to recognise objects later in this series.


Putting it all together

At a very high level, the steps we would follow in order to develop a Computer Vision model are:

  1. Convert images to machine-readable data (numbers)
  2. Set up some convolutions that can extract "features" from the images
  3. Feed this information to a Neural Network which then learns which features are associated with which category of objects (say "dogs" versus "cats")
  4. Fit and evaluate this model

Related Articles