The world’s leading publication for data science, AI, and ML professionals.

Building a Convolutional Neural Network Model to Understand Scenes

A complete step-by-step tutorial with data augmentation and transfer learning

Data Science in R

Table of Contents
· Library
· Dataset
· Exploratory Data Analysis
· Data Preprocessing
· Modeling
  ∘ Simple CNN
  ∘ Deeper CNN
  ∘ Deeper CNN with Pretrained Weights
· Conclusion

Since I began writing on Medium, I rely heavily on Unsplash. It is a place for the beautiful creation of high-quality images. But do you know that Unsplash uses machine learning to help tagging photos?

For every image uploaded to Unsplash […] we run the image through a series of machine learning algorithms to understand the content of a photo, removing the need for the contributor to manually tag the photo. – Unsplash blog

Labeling photos is such an important task that can be done fast by machines. Accordingly, we will be building a model that can extract information from images and give the correct label to categorize an image database based on thematic locations. We will make a prediction to classify whether an image is about a "Buildings", "Forest", "Glacier", "Mountain", "Sea", or "Street" using Convolutional Neural Network (CNN). Therefore, this is an image classification problem.

Library

Apart from recurring libraries that we normally use in R, we will also utilize keras. Keras is a high-level neural networks API developed with a focus on enabling fast experimentation.

library(keras)        # Deep Learning
library(tidyverse)    # data wrangling
library(imager)       # image manipulation
library(caret)        # model evaluation
library(grid)         # display images in a grid
library(gridExtra)    # display images in a grid
RS <- 42              # random state constant

Note that we created a variable called RS which is just a number, used for reproducibility of future random processes.

Dataset

The data consists of images with 6 different labels: "Buildings", "Forest", "Glacier", "Mountain", "Sea", and "Street". Unlike the previous article in which the image pixel data is already transformed into columns of a .csv file, this time we directly read the images using a data generator.

Hand Gesture Recognition

To do this, we need to know the image folder structure, which is displayed as follows.

seg_train
└── seg_train
    ├── buildings
    ├── forest
    ├── glacier
    ├── mountain
    ├── sea
    └── street
seg_test
└── seg_test
    ├── buildings
    ├── forest
    ├── glacier
    ├── mountain
    ├── sea
    └── street

Inside each buildings , forest , glacier , mountain , sea , and street subfolder, the corresponding images are saved. As the name suggests, we will use seg_train for model training and seg_test for model validation.

Exploratory Data Analysis

First, we need to locate the parent folder address of each category.

folder_list <- list.files("seg_train/seg_train/")
folder_path <- paste0("seg_train/seg_train/", folder_list, "/")
folder_path
#> [1] "seg_train/seg_train/buildings/" "seg_train/seg_train/forest/"    "seg_train/seg_train/glacier/"   "seg_train/seg_train/mountain/" 
#> [5] "seg_train/seg_train/sea/"       "seg_train/seg_train/street/"

Then, make a list of all seg_train image addresses from each parent folder address.

file_name <- 
  map(folder_path, function(x) paste0(x, list.files(x))) %>% 
  unlist()

We can see below that there are 14034 seg_train images in total.

cat("Number of train images:", length(file_name))
#> Number of train images: 14034

As a sanity check, let’s look at a couple of seg_train images.

set.seed(RS)
sample_image <- sample(file_name, 18)
img <- map(sample_image, load.image)
grobs <- lapply(img, rasterGrob)
grid.arrange(grobs=grobs, ncol=6)

Take the first image.

img <- load.image(file_name[1])
img
#> Image. Width: 150 pix Height: 150 pix Depth: 1 Colour channels: 3

As can be seen below, the dimension of this image is 150 × 150 × 1 × 3. It means that this particular image has a width of 150 pixels, a height of 150 pixels, a depth of 1 pixel, and 3 color channels (for Red, Green, and Blue, also known as RGB).

dim(img)
#> [1] 150 150   1   3

Now, we will build a function to acquire the width and height of an image and apply the function to all images.

get_dim <- function(x){
  img <- load.image(x) 
  df_img <- data.frame(
    width = width(img),
    height = height(img),
    filename = x
  )
  return(df_img)
}

file_dim <- map_df(file_name, get_dim)
head(file_dim)
#>   width height                                filename
#> 1   150    150     seg_train/seg_train/buildings/0.jpg
#> 2   150    150 seg_train/seg_train/buildings/10006.jpg
#> 3   150    150  seg_train/seg_train/buildings/1001.jpg
#> 4   150    150 seg_train/seg_train/buildings/10014.jpg
#> 5   150    150 seg_train/seg_train/buildings/10018.jpg
#> 6   150    150 seg_train/seg_train/buildings/10029.jpg

We got the following distribution of width and height of images.

hist(file_dim$width, breaks = 20)

hist(file_dim$height, breaks = 20)

summary(file_dim)
#>      width         height        filename        
#>  Min.   :150   Min.   : 76.0   Length:14034      
#>  1st Qu.:150   1st Qu.:150.0   Class :character  
#>  Median :150   Median :150.0   Mode  :character  
#>  Mean   :150   Mean   :149.9                     
#>  3rd Qu.:150   3rd Qu.:150.0                     
#>  Max.   :150   Max.   :150.0

As we can see, the dataset has varying dimensions of images. All of the widths are 150 pixels. However, the maximum and minimum heights are 150 and 76 pixels, respectively. Before fitted into the model, all of these images have to be in the same dimension. This is crucial since:

  1. the input layer of the model into which each image pixel value is fitted has a fixed number of neurons,
  2. if the image dimensions are too high, training the model might end up taking too long, and
  3. if the image dimensions are too low, then too much information is lost.

Data Preprocessing

One of the problems of Neural Network models that may occur is that they tend to memorize the images in seg_train dataset so that when new seg_test dataset comes in, they can’t recognize it. Data augmentation is one of many techniques to solve this problem. Given an image, data augmentation will transform it slightly to create some new images. These new images are then fitted into the model. This way, the model knows many versions of the original image and hopefully understands what the image means instead of memorizing it. We will only use some simple transformations such as:

  1. Randomly horizontal flip the image
  2. Randomly rotate by 10 degree
  3. Randomly zoom by a factor of 0.1
  4. Randomly shift horizontally by 0.1 fractions of the total width
  5. Randomly shift horizontally by 0.1 fractions of the total height

We don’t use vertical flip since in our case they can change the meaning of the image. This data augmentation can be done using image_data_generator() function. Save the generator to an object named train_data_gen. Note that train_data_gen is only applied while training, we don’t use it when predicting.

In train_data_gen, we also perform normalization to reduce the effect of illumination’s differences. Moreover, the CNN models converge faster on [0..1] data than on [0..255]. To do this, simply divide each pixel value by 255.

train_data_gen <- image_data_generator(
  rescale = 1/255,            # scaling pixel value
  horizontal_flip = T,        # flip image horizontally
  vertical_flip = F,          # flip image vertically 
  rotation_range = 10,        # rotate image from 0 to 45 degrees
  zoom_range = 0.1,           # zoom in or zoom out range
  width_shift_range = 0.1,    # shift horizontally to the width
  height_shift_range = 0.1,   # shift horizontally to the height
)

We will use 150 × 150 pixels as input images’ shape since 150 pixels is the most common width and height among all images (look again at EDA), and store the size as target_size. Besides, we will train the model in batches with 32 observations in each batch, stored as batch_size.

target_size <- c(150, 150)
batch_size <- 32

Now, build generators to flow train and validation datasets from their respective directories. Fill in the parameters target_size and batch_size. Since we have colored RGB images, set color_mode to "rgb". Lastly, use train_data_gen as generator to apply the previously created data augmentation.

# for training dataset
train_image_array_gen <- flow_images_from_directory(
  directory = "seg_train/seg_train/",   # folder of the data
  target_size = target_size,   # target of the image dimension
  color_mode = "rgb",          # use RGB color
  batch_size = batch_size ,    # number of images in each batch
  seed = RS,                   # set random seed
  generator = train_data_gen   # apply data augmentation
)

# for validation dataset
val_image_array_gen <- flow_images_from_directory(
  directory = "seg_test/seg_test/",
  target_size = target_size, 
  color_mode = "rgb", 
  batch_size = batch_size ,
  seed = RS,
  generator = train_data_gen
)

Next, we will see the proportion of labels in the target variable to check class imbalance. If there is any, classifiers tend to make biased learning model that has a poorer predictive accuracy over the minority classes compared to the majority classes. We can solve this issue in the simplest way by upsampling or downsampling the training dataset.

output_n <- n_distinct(train_image_array_gen$classes)
table("Frequency" = factor(train_image_array_gen$classes)) %>% 
  prop.table()
#> Frequency
#>         0         1         2         3         4         5 
#> 0.1561208 0.1618213 0.1712983 0.1789939 0.1620351 0.1697307

Luckily, as can be seen above, all classes are relatively balanced! We don’t need further treatment.

Modeling

First, let’s save the number of training and validation images we use. We need to have different data for validation other than train data because we don’t want our model to only be good at predicting images it has seen, but also can generalize to unseen images. This need for unseen images is exactly why we have to also see the performance of the model on the validation dataset.

Thus, we can observe below that we have 14034 images for training (as stated before) and 3000 images for validating the model.

train_samples <- train_image_array_gen$n
valid_samples <- val_image_array_gen$n
train_samples
#> [1] 14034
valid_samples
#> [1] 3000

We will build three models progressively from the simplest.

Simple CNN

This model has only 4 hidden layers including max pooling and flatten and 1 output layer as detailed below:

  1. Convolution layer: filter 16, kernel size 3 × 3, same padding, relu activation function
  2. Max pooling layer: pool size 2 × 2
  3. Flatten layer
  4. Dense layer: 16 nodes, relu activation function
  5. Dense layer (output): 6 nodes, softmax activation function

Please note that we use flatten layer as a bridge from the convolutional part of the network to the dense part. Basically, what a flatten layer does is – as the name suggests – to flatten the dimension of the last convolution layer to a single dense layer. For example, let’s say we have a convolution layer with size (8, 8, 32). Here, 32 is the number of filters. Flatten layer will reshape this tensor into a vector of size 2048.

In the output layer, we use the softmax activation function because this is a multiclass classification problem. Lastly, we need to specify the image size necessary for the input layer of CNN. As stated before, we will use a 150 × 150 pixel image size with 3 RGB channels as stored in target_size.

Now, we are ready.

# Set Initial Random Weight
tensorflow::tf$random$set_seed(RS)

model <- keras_model_sequential(name = "simple_model") %>% 

  # Convolution Layer
  layer_conv_2d(filters = 16,
                kernel_size = c(3,3),
                padding = "same",
                activation = "relu",
                input_shape = c(target_size, 3) 
                ) %>% 

  # Max Pooling Layer
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 

  # Flattening Layer
  layer_flatten() %>% 

  # Dense Layer
  layer_dense(units = 16,
              activation = "relu") %>% 

  # Output Layer
  layer_dense(units = output_n,
              activation = "softmax",
              name = "Output")

summary(model)
#> Model: "simple_model"
#> _________________________________________________________________
#> Layer (type)                                                  Output Shape                                           Param #              
#> =================================================================
#> conv2d (Conv2D)                                               (None, 150, 150, 16)                                   448                  
#> _________________________________________________________________
#> max_pooling2d (MaxPooling2D)                                  (None, 75, 75, 16)                                     0                    
#> _________________________________________________________________
#> flatten (Flatten)                                             (None, 90000)                                          0                    
#> _________________________________________________________________
#> dense (Dense)                                                 (None, 16)                                             1440016              
#> _________________________________________________________________
#> Output (Dense)                                                (None, 6)                                              102                  
#> =================================================================
#> Total params: 1,440,566
#> Trainable params: 1,440,566
#> Non-trainable params: 0
#> _________________________________________________________________

After building, we compile and train the model. We use categorical crossentropy for the loss function because, again, this is a multiclass classification problem. We use adam optimizer with the default learning rate of 0.001 since adam is one the most efficient optimizer there is. We also use accuracy as a metric for simplicity. More importantly, since we don’t prefer one class above the others and every class is balanced, accuracy is favored compared to precision, sensitivity, or specificity. We will train the model for 10 epochs.

model %>% 
  compile(
    loss = "categorical_crossentropy",
    optimizer = optimizer_adam(lr = 0.001),
    metrics = "accuracy"
  )

# fit data into model
history <- model %>% 
  fit_generator(
    # training data
    train_image_array_gen,

    # training epochs
    steps_per_epoch = as.integer(train_samples / batch_size), 
    epochs = 10, 

    # validation data
    validation_data = val_image_array_gen,
    validation_steps = as.integer(valid_samples / batch_size)
  )
plot(history)

From the final training and validation accuracy on the tenth epoch, we can see that they have similar values and are relatively high, which means no overfitting occurred. Next, we will perform prediction on all images on the validation dataset (instead of per batch as done in training). First, let’s tabulate the path of each image and its corresponding class.

val_data <- data.frame(file_name = paste0("seg_test/seg_test/", val_image_array_gen$filenames)) %>% 
  mutate(class = str_extract(file_name, "buildings|forest|glacier|mountain|sea|street"))

head(val_data)
#>                                file_name     class
#> 1 seg_test/seg_test/buildings20057.jpg buildings
#> 2 seg_test/seg_test/buildings20060.jpg buildings
#> 3 seg_test/seg_test/buildings20061.jpg buildings
#> 4 seg_test/seg_test/buildings20064.jpg buildings
#> 5 seg_test/seg_test/buildings20073.jpg buildings
#> 6 seg_test/seg_test/buildings20074.jpg buildings

Then, we will convert each image into an array. Don’t forget to normalize the pixel values, that is, divide them by 255.

image_prep <- function(x, target_size) {
  arrays <- lapply(x, function(path) {
    img <- image_load(
      path, 
      target_size = target_size, 
      grayscale = F
    )
    x <- image_to_array(img)
    x <- array_reshape(x, c(1, dim(x)))
    x <- x/255
  })
  do.call(abind::abind, c(arrays, list(along = 1)))
}

test_x <- image_prep(val_data$file_name, target_size)
dim(test_x)
#> [1] 3000  150  150    3

Next, predict.

pred_test <- predict_classes(model, test_x) 
head(pred_test)
#> [1] 4 0 0 0 4 3

Now, decode each prediction into its appropriate class.

decode <- function(x){
  case_when(
    x == 0 ~ "buildings",
    x == 1 ~ "forest",
    x == 2 ~ "glacier",
    x == 3 ~ "mountain",
    x == 4 ~ "sea",
    x == 5 ~ "street",
  )
}

pred_test <- sapply(pred_test, decode)
head(pred_test)
#> [1] "sea"       "buildings" "buildings" "buildings" "sea"       "mountain"

Finally, analyze the confusion matrix.

cm_simple <- confusionMatrix(as.factor(pred_test), as.factor(val_data$class))
acc_simple <- cm_simple$overall['Accuracy']
cm_simple
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  buildings forest glacier mountain sea street
#>   buildings       348     24      14       20  35    106
#>   forest            8    418       3        4   4     19
#>   glacier           7      5     357       53  38      5
#>   mountain         19      6      98      381  61      5
#>   sea              13      1      75       65 363      6
#>   street           42     20       6        2   9    360
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.7423               
#>                  95% CI : (0.7263, 0.7579)     
#>     No Information Rate : 0.1843               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6909               
#>                                                
#>  Mcnemar's Test P-Value : 0.0000000001327      
#> 
#> Statistics by Class:
#> 
#>                      Class: buildings Class: forest Class: glacier Class: mountain Class: sea Class: street
#> Sensitivity                    0.7963        0.8819         0.6456          0.7257     0.7118        0.7186
#> Specificity                    0.9224        0.9850         0.9559          0.9236     0.9357        0.9684
#> Pos Pred Value                 0.6362        0.9167         0.7677          0.6684     0.6941        0.8200
#> Neg Pred Value                 0.9637        0.9780         0.9227          0.9407     0.9407        0.9449
#> Prevalence                     0.1457        0.1580         0.1843          0.1750     0.1700        0.1670
#> Detection Rate                 0.1160        0.1393         0.1190          0.1270     0.1210        0.1200
#> Detection Prevalence           0.1823        0.1520         0.1550          0.1900     0.1743        0.1463
#> Balanced Accuracy              0.8593        0.9334         0.8007          0.8247     0.8238        0.8435

From the confusion matrix, it can be seen that the model had a hard time differentiating each class. We got 74% accuracy on the validation dataset. There are 106 street images predicted as buildings, which is more than 20% of all street images. This makes sense since in many street images, buildings also exist.

We can improve the model performance in various ways. But for now, let’s improve it by simply changing the architecture.

Deeper CNN

Now we make a deeper CNN with more convolution layers. Here’s the architecture:

  1. Block 1: 2 convolution layers and 1 max-pooling layer
  2. Block 2: 1 convolution layer and 1 max-pooling layer
  3. Block 3: 1 convolution layer and 1 max-pooling layer
  4. Block 4: 1 convolution layer and 1 max-pooling layer
  5. Flatten layer
  6. One dense layer
  7. Output layer
tensorflow::tf$random$set_seed(RS)

model_big <- keras_model_sequential(name = "model_big") %>%

  # First convolutional layer
  layer_conv_2d(filters = 32,
                kernel_size = c(5,5), # 5 x 5 filters
                padding = "same",
                activation = "relu",
                input_shape = c(target_size, 3)
                ) %>% 

  # Second convolutional layer
  layer_conv_2d(filters = 32,
                kernel_size = c(3,3), # 3 x 3 filters
                padding = "same",
                activation = "relu"
                ) %>% 

  # Max pooling layer
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 

  # Third convolutional layer
  layer_conv_2d(filters = 64,
                kernel_size = c(3,3),
                padding = "same",
                activation = "relu"
                ) %>% 

  # Max pooling layer
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 

  # Fourth convolutional layer
  layer_conv_2d(filters = 128,
                kernel_size = c(3,3),
                padding = "same",
                activation = "relu"
                ) %>% 

  # Max pooling layer
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 

  # Fifth convolutional layer
  layer_conv_2d(filters = 256,
                kernel_size = c(3,3),
                padding = "same",
                activation = "relu"
                ) %>% 

  # Max pooling layer
  layer_max_pooling_2d(pool_size = c(2,2)) %>% 

  # Flattening layer
  layer_flatten() %>% 

  # Dense layer
  layer_dense(units = 64,
              activation = "relu") %>% 

  # Output layer
  layer_dense(name = "Output",
              units = output_n, 
              activation = "softmax")

summary(model_big)
#> Model: "model_big"
#> _________________________________________________________________
#> Layer (type)                                                  Output Shape                                           Param #              
#> =================================================================
#> conv2d_5 (Conv2D)                                             (None, 150, 150, 32)                                   2432                 
#> _________________________________________________________________
#> conv2d_4 (Conv2D)                                             (None, 150, 150, 32)                                   9248                 
#> _________________________________________________________________
#> max_pooling2d_4 (MaxPooling2D)                                (None, 75, 75, 32)                                     0                    
#> _________________________________________________________________
#> conv2d_3 (Conv2D)                                             (None, 75, 75, 64)                                     18496                
#> _________________________________________________________________
#> max_pooling2d_3 (MaxPooling2D)                                (None, 37, 37, 64)                                     0                    
#> _________________________________________________________________
#> conv2d_2 (Conv2D)                                             (None, 37, 37, 128)                                    73856                
#> _________________________________________________________________
#> max_pooling2d_2 (MaxPooling2D)                                (None, 18, 18, 128)                                    0                    
#> _________________________________________________________________
#> conv2d_1 (Conv2D)                                             (None, 18, 18, 256)                                    295168               
#> _________________________________________________________________
#> max_pooling2d_1 (MaxPooling2D)                                (None, 9, 9, 256)                                      0                    
#> _________________________________________________________________
#> flatten_1 (Flatten)                                           (None, 20736)                                          0                    
#> _________________________________________________________________
#> dense_1 (Dense)                                               (None, 64)                                             1327168              
#> _________________________________________________________________
#> Output (Dense)                                                (None, 6)                                              390                  
#> =================================================================
#> Total params: 1,726,758
#> Trainable params: 1,726,758
#> Non-trainable params: 0
#> _________________________________________________________________

The rest is the same as previously done.

model_big %>%
  compile(
    loss = "categorical_crossentropy",
    optimizer = optimizer_adam(lr = 0.001),
    metrics = "accuracy"
  )

history <- model_big %>%
  fit_generator(
    train_image_array_gen,
    steps_per_epoch = as.integer(train_samples / batch_size),
    epochs = 10,
    validation_data = val_image_array_gen,
    validation_steps = as.integer(valid_samples / batch_size)
  )
plot(history)

pred_test <- predict_classes(model_big, test_x)
pred_test <- sapply(pred_test, decode)
cm_big <- confusionMatrix(as.factor(pred_test), as.factor(val_data$class))
acc_big <- cm_big$overall['Accuracy']
cm_big
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  buildings forest glacier mountain sea street
#>   buildings       390      3      24       24  11     34
#>   forest            3    465      11        7   8     11
#>   glacier           2      0     367       35   9      1
#>   mountain          0      2      82      415  17      1
#>   sea               3      1      57       42 461      6
#>   street           39      3      12        2   4    448
#> 
#> Overall Statistics
#>                                                
#>                Accuracy : 0.8487               
#>                  95% CI : (0.8353, 0.8613)     
#>     No Information Rate : 0.1843               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.8185               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#> 
#> Statistics by Class:
#> 
#>                      Class: buildings Class: forest Class: glacier Class: mountain Class: sea Class: street
#> Sensitivity                    0.8924        0.9810         0.6637          0.7905     0.9039        0.8942
#> Specificity                    0.9625        0.9842         0.9808          0.9588     0.9562        0.9760
#> Pos Pred Value                 0.8025        0.9208         0.8865          0.8027     0.8088        0.8819
#> Neg Pred Value                 0.9813        0.9964         0.9281          0.9557     0.9798        0.9787
#> Prevalence                     0.1457        0.1580         0.1843          0.1750     0.1700        0.1670
#> Detection Rate                 0.1300        0.1550         0.1223          0.1383     0.1537        0.1493
#> Detection Prevalence           0.1620        0.1683         0.1380          0.1723     0.1900        0.1693
#> Balanced Accuracy              0.9275        0.9826         0.8222          0.8746     0.9301        0.9351

This result is overall better than the earlier model‘s since model_big is more complex and hence is able to capture more features. We got 85% accuracy on the validation dataset. While predictions for street images have gotten better, predictions for glacier images are still over the place.

Deeper CNN with Pretrained Weights

There are actually many models developed already by researchers for image classification problems, from the VGG model family to the one of the latest state-of-the-art EfficientNet developed by Google. For the purpose of learning, in this section, we will use VGG16 model as it is one of the simplest among all models which consists of only convolution layers, max-pooling layers, and dense layers as we’ve introduced before. This process is called transfer learning, which transfers the knowledge of pre-trained models to solve our problem.

The original VGG16 model was trained on 1000 classes. To make it suitable for our problem, we will exclude the top (dense) layers of the model and plug in our version of predictive layers which consist of one global average pooling layer (as a replacement for the flatten layer), one dense layer with 64 nodes, and one output layer with 6 nodes (for 6 classes).

Let’s see the overall architecture.

# load original model without top layers
input_tensor <- layer_input(shape = c(target_size, 3))
base_model <- application_vgg16(input_tensor = input_tensor, 
                                weights = 'imagenet', 
                                include_top = FALSE)

# add our custom layers
predictions <- base_model$output %>%
  layer_global_average_pooling_2d() %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = output_n, activation = 'softmax')

# this is the model we will train
vgg16 <- keras_model(inputs = base_model$input, outputs = predictions)
summary(vgg16)
#> Model: "model"
#> _________________________________________________________________
#> Layer (type)                                                  Output Shape                                           Param #              
#> =================================================================
#> input_1 (InputLayer)                                          [(None, 150, 150, 3)]                                  0                    
#> _________________________________________________________________
#> block1_conv1 (Conv2D)                                         (None, 150, 150, 64)                                   1792                 
#> _________________________________________________________________
#> block1_conv2 (Conv2D)                                         (None, 150, 150, 64)                                   36928                
#> _________________________________________________________________
#> block1_pool (MaxPooling2D)                                    (None, 75, 75, 64)                                     0                    
#> _________________________________________________________________
#> block2_conv1 (Conv2D)                                         (None, 75, 75, 128)                                    73856                
#> _________________________________________________________________
#> block2_conv2 (Conv2D)                                         (None, 75, 75, 128)                                    147584               
#> _________________________________________________________________
#> block2_pool (MaxPooling2D)                                    (None, 37, 37, 128)                                    0                    
#> _________________________________________________________________
#> block3_conv1 (Conv2D)                                         (None, 37, 37, 256)                                    295168               
#> _________________________________________________________________
#> block3_conv2 (Conv2D)                                         (None, 37, 37, 256)                                    590080               
#> _________________________________________________________________
#> block3_conv3 (Conv2D)                                         (None, 37, 37, 256)                                    590080               
#> _________________________________________________________________
#> block3_pool (MaxPooling2D)                                    (None, 18, 18, 256)                                    0                    
#> _________________________________________________________________
#> block4_conv1 (Conv2D)                                         (None, 18, 18, 512)                                    1180160              
#> _________________________________________________________________
#> block4_conv2 (Conv2D)                                         (None, 18, 18, 512)                                    2359808              
#> _________________________________________________________________
#> block4_conv3 (Conv2D)                                         (None, 18, 18, 512)                                    2359808              
#> _________________________________________________________________
#> block4_pool (MaxPooling2D)                                    (None, 9, 9, 512)                                      0                    
#> _________________________________________________________________
#> block5_conv1 (Conv2D)                                         (None, 9, 9, 512)                                      2359808              
#> _________________________________________________________________
#> block5_conv2 (Conv2D)                                         (None, 9, 9, 512)                                      2359808              
#> _________________________________________________________________
#> block5_conv3 (Conv2D)                                         (None, 9, 9, 512)                                      2359808              
#> _________________________________________________________________
#> block5_pool (MaxPooling2D)                                    (None, 4, 4, 512)                                      0                    
#> _________________________________________________________________
#> global_average_pooling2d (GlobalAveragePooling2D)             (None, 512)                                            0                    
#> _________________________________________________________________
#> dense_3 (Dense)                                               (None, 64)                                             32832                
#> _________________________________________________________________
#> dense_2 (Dense)                                               (None, 6)                                              390                  
#> =================================================================
#> Total params: 14,747,910
#> Trainable params: 14,747,910
#> Non-trainable params: 0
#> _________________________________________________________________

Wow, that’s a lot of layers! We can directly use vgg16 for training and predicting, but again, for the purpose of learning let’s make VGG16 model from scratch ourselves.

model_bigger <- keras_model_sequential(name = "model_bigger") %>% 

  # Block 1
  layer_conv_2d(filters = 64, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                input_shape = c(94, 94, 3),
                name='block1_conv1') %>% 

  layer_conv_2d(filters = 64, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block1_conv2') %>% 

  layer_max_pooling_2d(pool_size = c(2, 2), 
                       strides=c(2, 2), 
                       name='block1_pool') %>% 

  # Block 2
  layer_conv_2d(filters = 128, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block2_conv1') %>% 

  layer_conv_2d(filters = 128, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block2_conv2') %>% 

  layer_max_pooling_2d(pool_size = c(2, 2), 
                       strides=c(2, 2), 
                       name='block2_pool') %>% 

  # Block 3
  layer_conv_2d(filters = 256, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block3_conv1') %>% 

  layer_conv_2d(filters = 256, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block3_conv2') %>% 

  layer_conv_2d(filters = 256, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block3_conv3') %>% 

  layer_max_pooling_2d(pool_size = c(2, 2), 
                       strides=c(2, 2), 
                       name='block3_pool') %>% 

  # Block 4
  layer_conv_2d(filters = 512, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block4_conv1') %>% 

  layer_conv_2d(filters = 512, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block4_conv2') %>% 

  layer_conv_2d(filters = 512, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block4_conv3') %>% 

  layer_max_pooling_2d(pool_size = c(2, 2), 
                       strides=c(2, 2), 
                       name='block4_pool') %>% 

  # Block 5
  layer_conv_2d(filters = 512, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block5_conv1') %>% 

  layer_conv_2d(filters = 512, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block5_conv2') %>% 

  layer_conv_2d(filters = 512, 
                kernel_size = c(3, 3), 
                activation='relu', 
                padding='same', 
                name='block5_conv3') %>% 

  layer_max_pooling_2d(pool_size = c(2, 2), 
                       strides=c(2, 2), 
                       name='block5_pool') %>% 

  # Dense
  layer_global_average_pooling_2d() %>%
  layer_dense(units = 64, activation = 'relu') %>%
  layer_dense(units = output_n, activation = 'softmax')

model_bigger
#> Model
#> Model: "model_bigger"
#> _________________________________________________________________
#> Layer (type)                                                  Output Shape                                           Param #              
#> =================================================================
#> block1_conv1 (Conv2D)                                         (None, 94, 94, 64)                                     1792                 
#> _________________________________________________________________
#> block1_conv2 (Conv2D)                                         (None, 94, 94, 64)                                     36928                
#> _________________________________________________________________
#> block1_pool (MaxPooling2D)                                    (None, 47, 47, 64)                                     0                    
#> _________________________________________________________________
#> block2_conv1 (Conv2D)                                         (None, 47, 47, 128)                                    73856                
#> _________________________________________________________________
#> block2_conv2 (Conv2D)                                         (None, 47, 47, 128)                                    147584               
#> _________________________________________________________________
#> block2_pool (MaxPooling2D)                                    (None, 23, 23, 128)                                    0                    
#> _________________________________________________________________
#> block3_conv1 (Conv2D)                                         (None, 23, 23, 256)                                    295168               
#> _________________________________________________________________
#> block3_conv2 (Conv2D)                                         (None, 23, 23, 256)                                    590080               
#> _________________________________________________________________
#> block3_conv3 (Conv2D)                                         (None, 23, 23, 256)                                    590080               
#> _________________________________________________________________
#> block3_pool (MaxPooling2D)                                    (None, 11, 11, 256)                                    0                    
#> _________________________________________________________________
#> block4_conv1 (Conv2D)                                         (None, 11, 11, 512)                                    1180160              
#> _________________________________________________________________
#> block4_conv2 (Conv2D)                                         (None, 11, 11, 512)                                    2359808              
#> _________________________________________________________________
#> block4_conv3 (Conv2D)                                         (None, 11, 11, 512)                                    2359808              
#> _________________________________________________________________
#> block4_pool (MaxPooling2D)                                    (None, 5, 5, 512)                                      0                    
#> _________________________________________________________________
#> block5_conv1 (Conv2D)                                         (None, 5, 5, 512)                                      2359808              
#> _________________________________________________________________
#> block5_conv2 (Conv2D)                                         (None, 5, 5, 512)                                      2359808              
#> _________________________________________________________________
#> block5_conv3 (Conv2D)                                         (None, 5, 5, 512)                                      2359808              
#> _________________________________________________________________
#> block5_pool (MaxPooling2D)                                    (None, 2, 2, 512)                                      0                    
#> _________________________________________________________________
#> global_average_pooling2d_1 (GlobalAveragePooling2D)           (None, 512)                                            0                    
#> _________________________________________________________________
#> dense_5 (Dense)                                               (None, 64)                                             32832                
#> _________________________________________________________________
#> dense_4 (Dense)                                               (None, 6)                                              390                  
#> =================================================================
#> Total params: 14,747,910
#> Trainable params: 14,747,910
#> Non-trainable params: 0
#> _________________________________________________________________

Note that model_bigger has exactly the same number of parameters for each layer as vgg16. The advantage of transfer learning is that we don’t have to start training the model from random weights, but instead we will start from pre-trained weights of the original model. These pre-trained weights are already optimized for image classification problems and we will only need to fine-tune them to match our purpose. Hence the metaphor:

We are standing on the shoulders of giants.

That being said, let’s assign all the weights of vgg16 to model_bigger.

set_weights(model_bigger, get_weights(vgg16))

Here’s the summary of our model_bigger layers:

layers <- model_bigger$layers
for (i in 1:length(layers))
  cat(i, layers[[i]]$name, "n")
#> 1 block1_conv1 
#> 2 block1_conv2 
#> 3 block1_pool 
#> 4 block2_conv1 
#> 5 block2_conv2 
#> 6 block2_pool 
#> 7 block3_conv1 
#> 8 block3_conv2 
#> 9 block3_conv3 
#> 10 block3_pool 
#> 11 block4_conv1 
#> 12 block4_conv2 
#> 13 block4_conv3 
#> 14 block4_pool 
#> 15 block5_conv1 
#> 16 block5_conv2 
#> 17 block5_conv3 
#> 18 block5_pool 
#> 19 global_average_pooling2d_1 
#> 20 dense_5 
#> 21 dense_4

Note that layers 19–21 still have random weights since they are created by us and don’t come from the original model. So, we need to train only these layers first by freezing all layers before.

freeze_weights(model_bigger, from = 1, to = 18)

To train those predictive layers, we will simply use the previous settings.

# compile the model (should be done *after* setting base layers to non-trainable)
model_bigger %>% compile(loss = "categorical_crossentropy",
                         optimizer = optimizer_adam(lr = 0.001),
                         metrics = "accuracy")

history <- model_bigger %>%
  fit_generator(
  train_image_array_gen,
  steps_per_epoch = as.integer(train_samples / batch_size),
  epochs = 10,
  validation_data = val_image_array_gen,
  validation_steps = as.integer(valid_samples / batch_size)
)

Now, fine-tune the model. To do this, we should apply a low learning rate to the optimizer so the established pre-trained weights won’t be disorganized. We will use a 0.00001 learning rate. Also, to save time, we only train the model for 4 epochs.

Before fine-tuning, don’t forget to unfreeze the layers to be trained. In our case, we will unfreeze all layers.

unfreeze_weights(model_bigger)

# compile again with low learning rate
model_bigger %>% compile(loss = "categorical_crossentropy",
                         optimizer = optimizer_adam(lr = 0.00001),
                         metrics = "accuracy")

history <- model_bigger %>%
  fit_generator(
  train_image_array_gen,
  steps_per_epoch = as.integer(train_samples / batch_size),
  epochs = 4,
  validation_data = val_image_array_gen,
  validation_steps = as.integer(valid_samples / batch_size)
)
plot(history)

pred_test <- predict_classes(model_bigger, test_x)
pred_test <- sapply(pred_test, decode)
cm_bigger <- confusionMatrix(as.factor(pred_test), as.factor(val_data$class))
acc_bigger <- cm_bigger$overall['Accuracy']
cm_bigger
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  buildings forest glacier mountain sea street
#>   buildings       396      0       2        1   2     13
#>   forest            1    469       2        2   4      0
#>   glacier           1      2     479       61   5      0
#>   mountain          0      0      50      452   4      0
#>   sea               1      1      16        7 492      2
#>   street           38      2       4        2   3    486
#> 
#> Overall Statistics
#>                                               
#>                Accuracy : 0.9247              
#>                  95% CI : (0.9146, 0.9339)    
#>     No Information Rate : 0.1843              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.9095              
#>                                               
#>  Mcnemar's Test P-Value : 0.00281             
#> 
#> Statistics by Class:
#> 
#>                      Class: buildings Class: forest Class: glacier Class: mountain Class: sea Class: street
#> Sensitivity                    0.9062        0.9895         0.8662          0.8610     0.9647        0.9701
#> Specificity                    0.9930        0.9964         0.9718          0.9782     0.9892        0.9804
#> Pos Pred Value                 0.9565        0.9812         0.8741          0.8933     0.9480        0.9084
#> Neg Pred Value                 0.9841        0.9980         0.9698          0.9707     0.9927        0.9939
#> Prevalence                     0.1457        0.1580         0.1843          0.1750     0.1700        0.1670
#> Detection Rate                 0.1320        0.1563         0.1597          0.1507     0.1640        0.1620
#> Detection Prevalence           0.1380        0.1593         0.1827          0.1687     0.1730        0.1783
#> Balanced Accuracy              0.9496        0.9929         0.9190          0.9196     0.9769        0.9752

We’ve found the winner. model_bigger has 92% accuracy on the validation dataset! Still, there are some misclassifications as no models are perfect. Here’s the summary of the predictions:

  1. Some buildings are mistakenly predicted as streets, and vice versa. Again, this is due to some images of buildings that also contain streets which confuse the model.
  2. Forests are predicted almost perfectly.
  3. Many glaciers are predicted as mountains and seas, also many mountains are predicted as glaciers. This may come from the same blueness of glacier, mountain, and sea images.
  4. Good predictions on seas.

Conclusion

rbind(
  "Simple CNN" = acc_simple,
  "Deeper CNN" = acc_big,
  "Fine-tuned VGG16" = acc_bigger
)
#>                   Accuracy
#> Simple CNN       0.7423333
#> Deeper CNN       0.8486667
#> Fine-tuned VGG16 0.9246667

We’ve successfully performed image classification with 6 classes: "Buildings", "Forest", "Glacier", "Mountain", "Sea", and "Street". Since images are unstructured data, this problem can be solved by Machine Learning using Neural Network which does feature extraction automatically without human intervention. For better performance, we use Convolutional Neural Network to do this continued with Dense layers for prediction. In the end, we use the VGG16 model with IMAGENET weight initialization which achieves 92% accuracy.

With image classification, companies that have websites with large visual databases can easily organize and categorize their database because it allows for automatic classification of images in large quantities. This helps them monetize their visual content without investing countless hours in manual sorting and tagging.


🔥 Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.

🔖 Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading:

Machine Learning from Scratch

Advanced Optimization Methods

MLOps Megaproject

My Best Stories

Data Science in R


Related Articles