
Computer vision is one of the hot topics today. It provides robust solutions to different problems involving images. In this article, we will create a Convolutional Neural Network (CNN) from scratch using Tensorflow. Then, we will use it to classify images of cats and dogs, one of the most popular datasets for image classification.
This article aims to help people new in the field, especially with Tensorflow, learn the basics principles to get the data, process it, build a model, train it and finally make predictions using it.
What is classification?
The first question we have to answer is what classification is. I will keep these principles as short as possible and provide other valuable articles that cover them more deeply.
Classification is the process of assigning a label to a given input data. For example, we can classify whether an email is spam or not. In this case, for a given image, we will predict if the image is of a cat or a dog.
For a more detailed explanation of what classification is and the most popular algorithms, I strongly recommend reading the following article:
Convolutional Neural Networks (CNN)
The following key concept is CNN (or ConvNet). A CNN is a class of neural network that takes an image as an input, applies a series of operations to it and returns an output. This output can be a probability, a prediction for the class of the image, or another new image. It depends on the network architecture and the problem we are trying to solve.
In this case, we will use a simple CNN to classify the input images in a dog or cat class. It will return a probability for a given image to be a dog or a cat.
To understand more and better the operations that this kind of neural network performs in the image, I suggest this article:
A Comprehensive Guide to Convolutional Neural Networks – the ELI5 way
Building our network
(You can find all the code in this repo)
It is time to get down to work and build our classifier. As we said earlier, we will make a cat/dog classifier using a simple CNN architecture using TensorFlow and Keras.
Load dataset
We will use the Oxford-IIIT Pet Dataset, containing more than 7,000 images of cats and dogs. This dataset has a CC-BY-SA 4.0 license, which means that we can share and adapt the data for any purpose.
First of all, we are going to need the following libraries.
Next, we are going to load the dataset. You can download it from Kaggle and copy the folder into your working directory. Also, you get the data directly from the website of the dataset.
Once we have the dataset, it will contain a single folder called ‘images’ with all the pet image’s inside.
Each image will contain its label in the name of the file. As we can see in the dataset’s info, the data is split into breeds.

There are 12 breeds of cats and 25 dogs. The strategy we will follow is to use all the dog breeds as dog images and all the cat breeds as cat images. We will not consider the breeds since we want to build a binary classifier.
Let’s see how much data we have.
There are 7390 images in the dataset
As we can see, there are 7,390 images in total. Now it is time to get rid of the breeds. We will create two lists, one with all the photos of dogs and the other with all the images of cats.
Now we can take a look to check how many images of each animal we have.
There are 2400 images of cats
There are 4990 images of dogs
The dataset has 2,400 cats and 4,990 dogs.
Data partition
For our network, we want to use three different sets:
- train: images to train the model.
- validation: images to test the model during the training process.
- test: images to test the model once it has completed its training.
We will use 70% of the images for training, 10% for validation and the 20% remaining for test.
Since we have only two lists of images (dogs and cats), we will need to divide these lists into different sets. We will use pandas to do so.
First, we shuffle the data and divide it into three sets, with its corresponding data distribution of 70/10/20.
Next, we have to create the dataframes using pandas.
They will have only two columns, one with the image name and the other with the label ‘cat’ or ‘dog’.
Finally, we can concatenate the dataframes.
There are 5173 images for training
There are 739 images for validation
There are 1478 images for testing
There are 7000 images for training
There are 1000 images for validation
There are 2000 images for testing
Perfect! We have created the three partitions from our dataset.
Data preprocessing
The following step is to preprocess all the images. We want them to have the same dimensions (because the CNN will need all the input data with the same dimension) and their pixels normalized. The reason for doing this is to train faster our neural network model.
We are going to use the ImageDataGenerator class from Keras.
We have just created 3 new datasets, each with preprocessed images. Also, we applied the shuffle property to reorganize the order of the images randomly and _batchsize to join images in groups of 32 elements. This means that we will give the CNN 32 images at a time. The fewer images in each step, the better the model will learn, but it will take longer to complete the training process. This is a parameter that we can play with to check how the performance changes with it.
Visualization
If we want to check how the dataset has turned out, we can do the following comprobations.
Batch shape: (32, 224, 224, 3)
Label shape: (32,)
We have 32 images with their associated labels. Also, we can plot these images (the fourth one, for example).

Label: 1.0
The labels are 1 for dogs and 0 for cats.
Model
Now that we have all the data prepared, it is time to build the model.
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 222, 222, 64) 1792 max_pooling2d (MaxPooling2D) (None, 111, 111, 64) 0 conv2d_1 (Conv2D) (None, 109, 109, 128) 73856 max_pooling2d_1 (MaxPooling2D) (None, 54, 54, 128) 0 conv2d_2 (Conv2D) (None, 52, 52, 256) 295168 max_pooling2d_2 (MaxPooling2D) (None, 26, 26, 256) 0 conv2d_3 (Conv2D) (None, 24, 24, 512) 1180160 global_average_pooling2d (G (None, 512) 0 lobalAveragePooling2D) dense (Dense) (None, 1) 513
=================================================================
Total params: 1,551,489
Trainable params: 1,551,489
Non-trainable params: 0 _________________________________________________________________
This is a simple CNN model with four convolutional layers and an output layer composed of a dense layer with 1 output neuron.
Let’s go with the training phase. We need to compile and fit the model. We will use binary cross-entropy for the loss function since we are working with classes with integer labels (0 for cat and 1 for dog). The optimizer will be the adam algorithm, and we will use the accuracy as a metric for the process.
We are going to train the network for 15 epochs.
Now we have trained our first CNN! We can look at the network’s accuracy and loss values during the training process.

The last step is to evaluate this model. To do that, we are going to use the test dataset.
47/47 [==============================] - 10s 204ms/step - loss: 0.4277 - accuracy: 0.8051
Loss: 0.4276582598686218
Accuracy: 0.8051421046257019
The model has an Accuracy of 80%.
You have trained your first convolutional neural network, congratulations! Now that we have our new model trained, we could use it to predict real images of cats and dogs. This is the part where we can make predictions using our model. These images should have never been seen before at any process stage.
Predictions
The prediction phase is the most similar scenario to a real-world problem, where once the model has been trained, it has to perform the classification of unseen data.
In our case, we will predict the class for the following image.

For a human, this is clearly an image of a cute cat, but let’s see what our model has to say about that.
array([[0.08732735]], dtype=float32)
In this case, we have to reshape the image into the input format of the network and normalize it as we did with all the dataset’s images.
Then we give it to the model, and the output is 0.08. This output can vary from 0 to 1, since the network’s output layer has a sigmoid activation function. Since our classes were 0 for cats and 1 for dogs, our model agrees that this image is of a cat.
You can see the complete code in this repo.
Next steps
We have seen a simple approach to a CNN from scratch. It has a good performance, but there is still much room for improvement. Here, I will enumerate some possible modifications you can investigate to enhance the network.
- Change the network architecture. You can try new layers, edit the existing ones or change activation functions. The more you try, the more knowledge you will acquire in CNNs.
- Add Dropout and Batch Normalization layers. There are very useful, especially in bigger models. These layers will also help to avoid overfitting in the model.
- Try new optimizers for the training process. There are a lot of different optimizers that you can use to train the network. You can change them and see whether there are improvements or not.
- Do image augmentation, a technique used for creating new images from the ones we already have. You can try zooming, rotating or cropping sections of the images to generate more data.
Once you have all these points (and any other you can think of), I suggest working on a multiclass classification model. For example, you can use this same dataset taking into account this time the breeds. Create a classifier that differentiates not only between cats and dogs but the breeds available in the dataset. It will be a fun challenge!
I hope you have found the tutorial helpful. If you have any questions feel free to leave a comment! 👋🏻