Data Science in R
Table of Contents
· Library
· Dataset
· Exploratory Data Analysis
· Data Preprocessing
· Modeling
∘ Simple CNN
∘ Deeper CNN
∘ Deeper CNN with Pretrained Weights
· Conclusion
Since I began writing on Medium, I rely heavily on Unsplash. It is a place for the beautiful creation of high-quality images. But do you know that Unsplash uses machine learning to help tagging photos?
For every image uploaded to Unsplash […] we run the image through a series of machine learning algorithms to understand the content of a photo, removing the need for the contributor to manually tag the photo. – Unsplash blog
Labeling photos is such an important task that can be done fast by machines. Accordingly, we will be building a model that can extract information from images and give the correct label to categorize an image database based on thematic locations. We will make a prediction to classify whether an image is about a "Buildings", "Forest", "Glacier", "Mountain", "Sea", or "Street" using Convolutional Neural Network (CNN). Therefore, this is an image classification problem.
Library
Apart from recurring libraries that we normally use in R, we will also utilize keras. Keras is a high-level neural networks API developed with a focus on enabling fast experimentation.
library(keras) # Deep Learning
library(tidyverse) # data wrangling
library(imager) # image manipulation
library(caret) # model evaluation
library(grid) # display images in a grid
library(gridExtra) # display images in a grid
RS <- 42 # random state constant
Note that we created a variable called RS
which is just a number, used for reproducibility of future random processes.
Dataset
The data consists of images with 6 different labels: "Buildings", "Forest", "Glacier", "Mountain", "Sea", and "Street". Unlike the previous article in which the image pixel data is already transformed into columns of a .csv
file, this time we directly read the images using a data generator.
To do this, we need to know the image folder structure, which is displayed as follows.
seg_train
└── seg_train
├── buildings
├── forest
├── glacier
├── mountain
├── sea
└── street
seg_test
└── seg_test
├── buildings
├── forest
├── glacier
├── mountain
├── sea
└── street
Inside each buildings
, forest
, glacier
, mountain
, sea
, and street
subfolder, the corresponding images are saved. As the name suggests, we will use seg_train
for model training and seg_test
for model validation.
Exploratory Data Analysis
First, we need to locate the parent folder address of each category.
folder_list <- list.files("seg_train/seg_train/")
folder_path <- paste0("seg_train/seg_train/", folder_list, "/")
folder_path
#> [1] "seg_train/seg_train/buildings/" "seg_train/seg_train/forest/" "seg_train/seg_train/glacier/" "seg_train/seg_train/mountain/"
#> [5] "seg_train/seg_train/sea/" "seg_train/seg_train/street/"
Then, make a list of all seg_train
image addresses from each parent folder address.
file_name <-
map(folder_path, function(x) paste0(x, list.files(x))) %>%
unlist()
We can see below that there are 14034 seg_train
images in total.
cat("Number of train images:", length(file_name))
#> Number of train images: 14034
As a sanity check, let’s look at a couple of seg_train
images.
set.seed(RS)
sample_image <- sample(file_name, 18)
img <- map(sample_image, load.image)
grobs <- lapply(img, rasterGrob)
grid.arrange(grobs=grobs, ncol=6)
Take the first image.
img <- load.image(file_name[1])
img
#> Image. Width: 150 pix Height: 150 pix Depth: 1 Colour channels: 3
As can be seen below, the dimension of this image is 150 × 150 × 1 × 3. It means that this particular image has a width of 150 pixels, a height of 150 pixels, a depth of 1 pixel, and 3 color channels (for Red, Green, and Blue, also known as RGB).
dim(img)
#> [1] 150 150 1 3
Now, we will build a function to acquire the width and height of an image and apply the function to all images.
get_dim <- function(x){
img <- load.image(x)
df_img <- data.frame(
width = width(img),
height = height(img),
filename = x
)
return(df_img)
}
file_dim <- map_df(file_name, get_dim)
head(file_dim)
#> width height filename
#> 1 150 150 seg_train/seg_train/buildings/0.jpg
#> 2 150 150 seg_train/seg_train/buildings/10006.jpg
#> 3 150 150 seg_train/seg_train/buildings/1001.jpg
#> 4 150 150 seg_train/seg_train/buildings/10014.jpg
#> 5 150 150 seg_train/seg_train/buildings/10018.jpg
#> 6 150 150 seg_train/seg_train/buildings/10029.jpg
We got the following distribution of width and height of images.
hist(file_dim$width, breaks = 20)
hist(file_dim$height, breaks = 20)
summary(file_dim)
#> width height filename
#> Min. :150 Min. : 76.0 Length:14034
#> 1st Qu.:150 1st Qu.:150.0 Class :character
#> Median :150 Median :150.0 Mode :character
#> Mean :150 Mean :149.9
#> 3rd Qu.:150 3rd Qu.:150.0
#> Max. :150 Max. :150.0
As we can see, the dataset has varying dimensions of images. All of the widths are 150 pixels. However, the maximum and minimum heights are 150 and 76 pixels, respectively. Before fitted into the model, all of these images have to be in the same dimension. This is crucial since:
- the input layer of the model into which each image pixel value is fitted has a fixed number of neurons,
- if the image dimensions are too high, training the model might end up taking too long, and
- if the image dimensions are too low, then too much information is lost.
Data Preprocessing
One of the problems of Neural Network models that may occur is that they tend to memorize the images in seg_train
dataset so that when new seg_test
dataset comes in, they can’t recognize it. Data augmentation is one of many techniques to solve this problem. Given an image, data augmentation will transform it slightly to create some new images. These new images are then fitted into the model. This way, the model knows many versions of the original image and hopefully understands what the image means instead of memorizing it. We will only use some simple transformations such as:
- Randomly horizontal flip the image
- Randomly rotate by 10 degree
- Randomly zoom by a factor of 0.1
- Randomly shift horizontally by 0.1 fractions of the total width
- Randomly shift horizontally by 0.1 fractions of the total height
We don’t use vertical flip since in our case they can change the meaning of the image. This data augmentation can be done using image_data_generator()
function. Save the generator to an object named train_data_gen
. Note that train_data_gen
is only applied while training, we don’t use it when predicting.
In train_data_gen
, we also perform normalization to reduce the effect of illumination’s differences. Moreover, the CNN models converge faster on [0..1] data than on [0..255]. To do this, simply divide each pixel value by 255.
train_data_gen <- image_data_generator(
rescale = 1/255, # scaling pixel value
horizontal_flip = T, # flip image horizontally
vertical_flip = F, # flip image vertically
rotation_range = 10, # rotate image from 0 to 45 degrees
zoom_range = 0.1, # zoom in or zoom out range
width_shift_range = 0.1, # shift horizontally to the width
height_shift_range = 0.1, # shift horizontally to the height
)
We will use 150 × 150 pixels as input images’ shape since 150 pixels is the most common width and height among all images (look again at EDA), and store the size as target_size
. Besides, we will train the model in batches with 32 observations in each batch, stored as batch_size
.
target_size <- c(150, 150)
batch_size <- 32
Now, build generators to flow train and validation datasets from their respective directories. Fill in the parameters target_size
and batch_size
. Since we have colored RGB images, set color_mode
to "rgb". Lastly, use train_data_gen
as generator
to apply the previously created data augmentation.
# for training dataset
train_image_array_gen <- flow_images_from_directory(
directory = "seg_train/seg_train/", # folder of the data
target_size = target_size, # target of the image dimension
color_mode = "rgb", # use RGB color
batch_size = batch_size , # number of images in each batch
seed = RS, # set random seed
generator = train_data_gen # apply data augmentation
)
# for validation dataset
val_image_array_gen <- flow_images_from_directory(
directory = "seg_test/seg_test/",
target_size = target_size,
color_mode = "rgb",
batch_size = batch_size ,
seed = RS,
generator = train_data_gen
)
Next, we will see the proportion of labels in the target variable to check class imbalance. If there is any, classifiers tend to make biased learning model that has a poorer predictive accuracy over the minority classes compared to the majority classes. We can solve this issue in the simplest way by upsampling or downsampling the training dataset.
output_n <- n_distinct(train_image_array_gen$classes)
table("Frequency" = factor(train_image_array_gen$classes)) %>%
prop.table()
#> Frequency
#> 0 1 2 3 4 5
#> 0.1561208 0.1618213 0.1712983 0.1789939 0.1620351 0.1697307
Luckily, as can be seen above, all classes are relatively balanced! We don’t need further treatment.
Modeling
First, let’s save the number of training and validation images we use. We need to have different data for validation other than train data because we don’t want our model to only be good at predicting images it has seen, but also can generalize to unseen images. This need for unseen images is exactly why we have to also see the performance of the model on the validation dataset.
Thus, we can observe below that we have 14034 images for training (as stated before) and 3000 images for validating the model.
train_samples <- train_image_array_gen$n
valid_samples <- val_image_array_gen$n
train_samples
#> [1] 14034
valid_samples
#> [1] 3000
We will build three models progressively from the simplest.
Simple CNN
This model has only 4 hidden layers including max pooling and flatten and 1 output layer as detailed below:
- Convolution layer: filter 16, kernel size 3 × 3, same padding, relu activation function
- Max pooling layer: pool size 2 × 2
- Flatten layer
- Dense layer: 16 nodes, relu activation function
- Dense layer (output): 6 nodes, softmax activation function
Please note that we use flatten layer as a bridge from the convolutional part of the network to the dense part. Basically, what a flatten layer does is – as the name suggests – to flatten the dimension of the last convolution layer to a single dense layer. For example, let’s say we have a convolution layer with size (8, 8, 32). Here, 32 is the number of filters. Flatten layer will reshape this tensor into a vector of size 2048.
In the output layer, we use the softmax activation function because this is a multiclass classification problem. Lastly, we need to specify the image size necessary for the input layer of CNN. As stated before, we will use a 150 × 150 pixel image size with 3 RGB channels as stored in target_size
.
Now, we are ready.
# Set Initial Random Weight
tensorflow::tf$random$set_seed(RS)
model <- keras_model_sequential(name = "simple_model") %>%
# Convolution Layer
layer_conv_2d(filters = 16,
kernel_size = c(3,3),
padding = "same",
activation = "relu",
input_shape = c(target_size, 3)
) %>%
# Max Pooling Layer
layer_max_pooling_2d(pool_size = c(2,2)) %>%
# Flattening Layer
layer_flatten() %>%
# Dense Layer
layer_dense(units = 16,
activation = "relu") %>%
# Output Layer
layer_dense(units = output_n,
activation = "softmax",
name = "Output")
summary(model)
#> Model: "simple_model"
#> _________________________________________________________________
#> Layer (type) Output Shape Param #
#> =================================================================
#> conv2d (Conv2D) (None, 150, 150, 16) 448
#> _________________________________________________________________
#> max_pooling2d (MaxPooling2D) (None, 75, 75, 16) 0
#> _________________________________________________________________
#> flatten (Flatten) (None, 90000) 0
#> _________________________________________________________________
#> dense (Dense) (None, 16) 1440016
#> _________________________________________________________________
#> Output (Dense) (None, 6) 102
#> =================================================================
#> Total params: 1,440,566
#> Trainable params: 1,440,566
#> Non-trainable params: 0
#> _________________________________________________________________
After building, we compile and train the model. We use categorical crossentropy for the loss function because, again, this is a multiclass classification problem. We use adam optimizer with the default learning rate of 0.001 since adam is one the most efficient optimizer there is. We also use accuracy as a metric for simplicity. More importantly, since we don’t prefer one class above the others and every class is balanced, accuracy is favored compared to precision, sensitivity, or specificity. We will train the model for 10 epochs.
model %>%
compile(
loss = "categorical_crossentropy",
optimizer = optimizer_adam(lr = 0.001),
metrics = "accuracy"
)
# fit data into model
history <- model %>%
fit_generator(
# training data
train_image_array_gen,
# training epochs
steps_per_epoch = as.integer(train_samples / batch_size),
epochs = 10,
# validation data
validation_data = val_image_array_gen,
validation_steps = as.integer(valid_samples / batch_size)
)
plot(history)
From the final training and validation accuracy on the tenth epoch, we can see that they have similar values and are relatively high, which means no overfitting occurred. Next, we will perform prediction on all images on the validation dataset (instead of per batch as done in training). First, let’s tabulate the path of each image and its corresponding class.
val_data <- data.frame(file_name = paste0("seg_test/seg_test/", val_image_array_gen$filenames)) %>%
mutate(class = str_extract(file_name, "buildings|forest|glacier|mountain|sea|street"))
head(val_data)
#> file_name class
#> 1 seg_test/seg_test/buildings20057.jpg buildings
#> 2 seg_test/seg_test/buildings20060.jpg buildings
#> 3 seg_test/seg_test/buildings20061.jpg buildings
#> 4 seg_test/seg_test/buildings20064.jpg buildings
#> 5 seg_test/seg_test/buildings20073.jpg buildings
#> 6 seg_test/seg_test/buildings20074.jpg buildings
Then, we will convert each image into an array. Don’t forget to normalize the pixel values, that is, divide them by 255.
image_prep <- function(x, target_size) {
arrays <- lapply(x, function(path) {
img <- image_load(
path,
target_size = target_size,
grayscale = F
)
x <- image_to_array(img)
x <- array_reshape(x, c(1, dim(x)))
x <- x/255
})
do.call(abind::abind, c(arrays, list(along = 1)))
}
test_x <- image_prep(val_data$file_name, target_size)
dim(test_x)
#> [1] 3000 150 150 3
Next, predict.
pred_test <- predict_classes(model, test_x)
head(pred_test)
#> [1] 4 0 0 0 4 3
Now, decode each prediction into its appropriate class.
decode <- function(x){
case_when(
x == 0 ~ "buildings",
x == 1 ~ "forest",
x == 2 ~ "glacier",
x == 3 ~ "mountain",
x == 4 ~ "sea",
x == 5 ~ "street",
)
}
pred_test <- sapply(pred_test, decode)
head(pred_test)
#> [1] "sea" "buildings" "buildings" "buildings" "sea" "mountain"
Finally, analyze the confusion matrix.
cm_simple <- confusionMatrix(as.factor(pred_test), as.factor(val_data$class))
acc_simple <- cm_simple$overall['Accuracy']
cm_simple
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction buildings forest glacier mountain sea street
#> buildings 348 24 14 20 35 106
#> forest 8 418 3 4 4 19
#> glacier 7 5 357 53 38 5
#> mountain 19 6 98 381 61 5
#> sea 13 1 75 65 363 6
#> street 42 20 6 2 9 360
#>
#> Overall Statistics
#>
#> Accuracy : 0.7423
#> 95% CI : (0.7263, 0.7579)
#> No Information Rate : 0.1843
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.6909
#>
#> Mcnemar's Test P-Value : 0.0000000001327
#>
#> Statistics by Class:
#>
#> Class: buildings Class: forest Class: glacier Class: mountain Class: sea Class: street
#> Sensitivity 0.7963 0.8819 0.6456 0.7257 0.7118 0.7186
#> Specificity 0.9224 0.9850 0.9559 0.9236 0.9357 0.9684
#> Pos Pred Value 0.6362 0.9167 0.7677 0.6684 0.6941 0.8200
#> Neg Pred Value 0.9637 0.9780 0.9227 0.9407 0.9407 0.9449
#> Prevalence 0.1457 0.1580 0.1843 0.1750 0.1700 0.1670
#> Detection Rate 0.1160 0.1393 0.1190 0.1270 0.1210 0.1200
#> Detection Prevalence 0.1823 0.1520 0.1550 0.1900 0.1743 0.1463
#> Balanced Accuracy 0.8593 0.9334 0.8007 0.8247 0.8238 0.8435
From the confusion matrix, it can be seen that the model had a hard time differentiating each class. We got 74% accuracy on the validation dataset. There are 106 street images predicted as buildings, which is more than 20% of all street images. This makes sense since in many street images, buildings also exist.
We can improve the model performance in various ways. But for now, let’s improve it by simply changing the architecture.
Deeper CNN
Now we make a deeper CNN with more convolution layers. Here’s the architecture:
- Block 1: 2 convolution layers and 1 max-pooling layer
- Block 2: 1 convolution layer and 1 max-pooling layer
- Block 3: 1 convolution layer and 1 max-pooling layer
- Block 4: 1 convolution layer and 1 max-pooling layer
- Flatten layer
- One dense layer
- Output layer
tensorflow::tf$random$set_seed(RS)
model_big <- keras_model_sequential(name = "model_big") %>%
# First convolutional layer
layer_conv_2d(filters = 32,
kernel_size = c(5,5), # 5 x 5 filters
padding = "same",
activation = "relu",
input_shape = c(target_size, 3)
) %>%
# Second convolutional layer
layer_conv_2d(filters = 32,
kernel_size = c(3,3), # 3 x 3 filters
padding = "same",
activation = "relu"
) %>%
# Max pooling layer
layer_max_pooling_2d(pool_size = c(2,2)) %>%
# Third convolutional layer
layer_conv_2d(filters = 64,
kernel_size = c(3,3),
padding = "same",
activation = "relu"
) %>%
# Max pooling layer
layer_max_pooling_2d(pool_size = c(2,2)) %>%
# Fourth convolutional layer
layer_conv_2d(filters = 128,
kernel_size = c(3,3),
padding = "same",
activation = "relu"
) %>%
# Max pooling layer
layer_max_pooling_2d(pool_size = c(2,2)) %>%
# Fifth convolutional layer
layer_conv_2d(filters = 256,
kernel_size = c(3,3),
padding = "same",
activation = "relu"
) %>%
# Max pooling layer
layer_max_pooling_2d(pool_size = c(2,2)) %>%
# Flattening layer
layer_flatten() %>%
# Dense layer
layer_dense(units = 64,
activation = "relu") %>%
# Output layer
layer_dense(name = "Output",
units = output_n,
activation = "softmax")
summary(model_big)
#> Model: "model_big"
#> _________________________________________________________________
#> Layer (type) Output Shape Param #
#> =================================================================
#> conv2d_5 (Conv2D) (None, 150, 150, 32) 2432
#> _________________________________________________________________
#> conv2d_4 (Conv2D) (None, 150, 150, 32) 9248
#> _________________________________________________________________
#> max_pooling2d_4 (MaxPooling2D) (None, 75, 75, 32) 0
#> _________________________________________________________________
#> conv2d_3 (Conv2D) (None, 75, 75, 64) 18496
#> _________________________________________________________________
#> max_pooling2d_3 (MaxPooling2D) (None, 37, 37, 64) 0
#> _________________________________________________________________
#> conv2d_2 (Conv2D) (None, 37, 37, 128) 73856
#> _________________________________________________________________
#> max_pooling2d_2 (MaxPooling2D) (None, 18, 18, 128) 0
#> _________________________________________________________________
#> conv2d_1 (Conv2D) (None, 18, 18, 256) 295168
#> _________________________________________________________________
#> max_pooling2d_1 (MaxPooling2D) (None, 9, 9, 256) 0
#> _________________________________________________________________
#> flatten_1 (Flatten) (None, 20736) 0
#> _________________________________________________________________
#> dense_1 (Dense) (None, 64) 1327168
#> _________________________________________________________________
#> Output (Dense) (None, 6) 390
#> =================================================================
#> Total params: 1,726,758
#> Trainable params: 1,726,758
#> Non-trainable params: 0
#> _________________________________________________________________
The rest is the same as previously done.
model_big %>%
compile(
loss = "categorical_crossentropy",
optimizer = optimizer_adam(lr = 0.001),
metrics = "accuracy"
)
history <- model_big %>%
fit_generator(
train_image_array_gen,
steps_per_epoch = as.integer(train_samples / batch_size),
epochs = 10,
validation_data = val_image_array_gen,
validation_steps = as.integer(valid_samples / batch_size)
)
plot(history)
pred_test <- predict_classes(model_big, test_x)
pred_test <- sapply(pred_test, decode)
cm_big <- confusionMatrix(as.factor(pred_test), as.factor(val_data$class))
acc_big <- cm_big$overall['Accuracy']
cm_big
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction buildings forest glacier mountain sea street
#> buildings 390 3 24 24 11 34
#> forest 3 465 11 7 8 11
#> glacier 2 0 367 35 9 1
#> mountain 0 2 82 415 17 1
#> sea 3 1 57 42 461 6
#> street 39 3 12 2 4 448
#>
#> Overall Statistics
#>
#> Accuracy : 0.8487
#> 95% CI : (0.8353, 0.8613)
#> No Information Rate : 0.1843
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.8185
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Statistics by Class:
#>
#> Class: buildings Class: forest Class: glacier Class: mountain Class: sea Class: street
#> Sensitivity 0.8924 0.9810 0.6637 0.7905 0.9039 0.8942
#> Specificity 0.9625 0.9842 0.9808 0.9588 0.9562 0.9760
#> Pos Pred Value 0.8025 0.9208 0.8865 0.8027 0.8088 0.8819
#> Neg Pred Value 0.9813 0.9964 0.9281 0.9557 0.9798 0.9787
#> Prevalence 0.1457 0.1580 0.1843 0.1750 0.1700 0.1670
#> Detection Rate 0.1300 0.1550 0.1223 0.1383 0.1537 0.1493
#> Detection Prevalence 0.1620 0.1683 0.1380 0.1723 0.1900 0.1693
#> Balanced Accuracy 0.9275 0.9826 0.8222 0.8746 0.9301 0.9351
This result is overall better than the earlier model
‘s since model_big
is more complex and hence is able to capture more features. We got 85% accuracy on the validation dataset. While predictions for street images have gotten better, predictions for glacier images are still over the place.
Deeper CNN with Pretrained Weights
There are actually many models developed already by researchers for image classification problems, from the VGG model family to the one of the latest state-of-the-art EfficientNet developed by Google. For the purpose of learning, in this section, we will use VGG16 model as it is one of the simplest among all models which consists of only convolution layers, max-pooling layers, and dense layers as we’ve introduced before. This process is called transfer learning, which transfers the knowledge of pre-trained models to solve our problem.
The original VGG16 model was trained on 1000 classes. To make it suitable for our problem, we will exclude the top (dense) layers of the model and plug in our version of predictive layers which consist of one global average pooling layer (as a replacement for the flatten layer), one dense layer with 64 nodes, and one output layer with 6 nodes (for 6 classes).
Let’s see the overall architecture.
# load original model without top layers
input_tensor <- layer_input(shape = c(target_size, 3))
base_model <- application_vgg16(input_tensor = input_tensor,
weights = 'imagenet',
include_top = FALSE)
# add our custom layers
predictions <- base_model$output %>%
layer_global_average_pooling_2d() %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = output_n, activation = 'softmax')
# this is the model we will train
vgg16 <- keras_model(inputs = base_model$input, outputs = predictions)
summary(vgg16)
#> Model: "model"
#> _________________________________________________________________
#> Layer (type) Output Shape Param #
#> =================================================================
#> input_1 (InputLayer) [(None, 150, 150, 3)] 0
#> _________________________________________________________________
#> block1_conv1 (Conv2D) (None, 150, 150, 64) 1792
#> _________________________________________________________________
#> block1_conv2 (Conv2D) (None, 150, 150, 64) 36928
#> _________________________________________________________________
#> block1_pool (MaxPooling2D) (None, 75, 75, 64) 0
#> _________________________________________________________________
#> block2_conv1 (Conv2D) (None, 75, 75, 128) 73856
#> _________________________________________________________________
#> block2_conv2 (Conv2D) (None, 75, 75, 128) 147584
#> _________________________________________________________________
#> block2_pool (MaxPooling2D) (None, 37, 37, 128) 0
#> _________________________________________________________________
#> block3_conv1 (Conv2D) (None, 37, 37, 256) 295168
#> _________________________________________________________________
#> block3_conv2 (Conv2D) (None, 37, 37, 256) 590080
#> _________________________________________________________________
#> block3_conv3 (Conv2D) (None, 37, 37, 256) 590080
#> _________________________________________________________________
#> block3_pool (MaxPooling2D) (None, 18, 18, 256) 0
#> _________________________________________________________________
#> block4_conv1 (Conv2D) (None, 18, 18, 512) 1180160
#> _________________________________________________________________
#> block4_conv2 (Conv2D) (None, 18, 18, 512) 2359808
#> _________________________________________________________________
#> block4_conv3 (Conv2D) (None, 18, 18, 512) 2359808
#> _________________________________________________________________
#> block4_pool (MaxPooling2D) (None, 9, 9, 512) 0
#> _________________________________________________________________
#> block5_conv1 (Conv2D) (None, 9, 9, 512) 2359808
#> _________________________________________________________________
#> block5_conv2 (Conv2D) (None, 9, 9, 512) 2359808
#> _________________________________________________________________
#> block5_conv3 (Conv2D) (None, 9, 9, 512) 2359808
#> _________________________________________________________________
#> block5_pool (MaxPooling2D) (None, 4, 4, 512) 0
#> _________________________________________________________________
#> global_average_pooling2d (GlobalAveragePooling2D) (None, 512) 0
#> _________________________________________________________________
#> dense_3 (Dense) (None, 64) 32832
#> _________________________________________________________________
#> dense_2 (Dense) (None, 6) 390
#> =================================================================
#> Total params: 14,747,910
#> Trainable params: 14,747,910
#> Non-trainable params: 0
#> _________________________________________________________________
Wow, that’s a lot of layers! We can directly use vgg16
for training and predicting, but again, for the purpose of learning let’s make VGG16 model from scratch ourselves.
model_bigger <- keras_model_sequential(name = "model_bigger") %>%
# Block 1
layer_conv_2d(filters = 64,
kernel_size = c(3, 3),
activation='relu',
padding='same',
input_shape = c(94, 94, 3),
name='block1_conv1') %>%
layer_conv_2d(filters = 64,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block1_conv2') %>%
layer_max_pooling_2d(pool_size = c(2, 2),
strides=c(2, 2),
name='block1_pool') %>%
# Block 2
layer_conv_2d(filters = 128,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block2_conv1') %>%
layer_conv_2d(filters = 128,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block2_conv2') %>%
layer_max_pooling_2d(pool_size = c(2, 2),
strides=c(2, 2),
name='block2_pool') %>%
# Block 3
layer_conv_2d(filters = 256,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block3_conv1') %>%
layer_conv_2d(filters = 256,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block3_conv2') %>%
layer_conv_2d(filters = 256,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block3_conv3') %>%
layer_max_pooling_2d(pool_size = c(2, 2),
strides=c(2, 2),
name='block3_pool') %>%
# Block 4
layer_conv_2d(filters = 512,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block4_conv1') %>%
layer_conv_2d(filters = 512,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block4_conv2') %>%
layer_conv_2d(filters = 512,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block4_conv3') %>%
layer_max_pooling_2d(pool_size = c(2, 2),
strides=c(2, 2),
name='block4_pool') %>%
# Block 5
layer_conv_2d(filters = 512,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block5_conv1') %>%
layer_conv_2d(filters = 512,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block5_conv2') %>%
layer_conv_2d(filters = 512,
kernel_size = c(3, 3),
activation='relu',
padding='same',
name='block5_conv3') %>%
layer_max_pooling_2d(pool_size = c(2, 2),
strides=c(2, 2),
name='block5_pool') %>%
# Dense
layer_global_average_pooling_2d() %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = output_n, activation = 'softmax')
model_bigger
#> Model
#> Model: "model_bigger"
#> _________________________________________________________________
#> Layer (type) Output Shape Param #
#> =================================================================
#> block1_conv1 (Conv2D) (None, 94, 94, 64) 1792
#> _________________________________________________________________
#> block1_conv2 (Conv2D) (None, 94, 94, 64) 36928
#> _________________________________________________________________
#> block1_pool (MaxPooling2D) (None, 47, 47, 64) 0
#> _________________________________________________________________
#> block2_conv1 (Conv2D) (None, 47, 47, 128) 73856
#> _________________________________________________________________
#> block2_conv2 (Conv2D) (None, 47, 47, 128) 147584
#> _________________________________________________________________
#> block2_pool (MaxPooling2D) (None, 23, 23, 128) 0
#> _________________________________________________________________
#> block3_conv1 (Conv2D) (None, 23, 23, 256) 295168
#> _________________________________________________________________
#> block3_conv2 (Conv2D) (None, 23, 23, 256) 590080
#> _________________________________________________________________
#> block3_conv3 (Conv2D) (None, 23, 23, 256) 590080
#> _________________________________________________________________
#> block3_pool (MaxPooling2D) (None, 11, 11, 256) 0
#> _________________________________________________________________
#> block4_conv1 (Conv2D) (None, 11, 11, 512) 1180160
#> _________________________________________________________________
#> block4_conv2 (Conv2D) (None, 11, 11, 512) 2359808
#> _________________________________________________________________
#> block4_conv3 (Conv2D) (None, 11, 11, 512) 2359808
#> _________________________________________________________________
#> block4_pool (MaxPooling2D) (None, 5, 5, 512) 0
#> _________________________________________________________________
#> block5_conv1 (Conv2D) (None, 5, 5, 512) 2359808
#> _________________________________________________________________
#> block5_conv2 (Conv2D) (None, 5, 5, 512) 2359808
#> _________________________________________________________________
#> block5_conv3 (Conv2D) (None, 5, 5, 512) 2359808
#> _________________________________________________________________
#> block5_pool (MaxPooling2D) (None, 2, 2, 512) 0
#> _________________________________________________________________
#> global_average_pooling2d_1 (GlobalAveragePooling2D) (None, 512) 0
#> _________________________________________________________________
#> dense_5 (Dense) (None, 64) 32832
#> _________________________________________________________________
#> dense_4 (Dense) (None, 6) 390
#> =================================================================
#> Total params: 14,747,910
#> Trainable params: 14,747,910
#> Non-trainable params: 0
#> _________________________________________________________________
Note that model_bigger
has exactly the same number of parameters for each layer as vgg16
. The advantage of transfer learning is that we don’t have to start training the model from random weights, but instead we will start from pre-trained weights of the original model. These pre-trained weights are already optimized for image classification problems and we will only need to fine-tune them to match our purpose. Hence the metaphor:
We are standing on the shoulders of giants.
That being said, let’s assign all the weights of vgg16
to model_bigger
.
set_weights(model_bigger, get_weights(vgg16))
Here’s the summary of our model_bigger
layers:
layers <- model_bigger$layers
for (i in 1:length(layers))
cat(i, layers[[i]]$name, "n")
#> 1 block1_conv1
#> 2 block1_conv2
#> 3 block1_pool
#> 4 block2_conv1
#> 5 block2_conv2
#> 6 block2_pool
#> 7 block3_conv1
#> 8 block3_conv2
#> 9 block3_conv3
#> 10 block3_pool
#> 11 block4_conv1
#> 12 block4_conv2
#> 13 block4_conv3
#> 14 block4_pool
#> 15 block5_conv1
#> 16 block5_conv2
#> 17 block5_conv3
#> 18 block5_pool
#> 19 global_average_pooling2d_1
#> 20 dense_5
#> 21 dense_4
Note that layers 19–21 still have random weights since they are created by us and don’t come from the original model. So, we need to train only these layers first by freezing all layers before.
freeze_weights(model_bigger, from = 1, to = 18)
To train those predictive layers, we will simply use the previous settings.
# compile the model (should be done *after* setting base layers to non-trainable)
model_bigger %>% compile(loss = "categorical_crossentropy",
optimizer = optimizer_adam(lr = 0.001),
metrics = "accuracy")
history <- model_bigger %>%
fit_generator(
train_image_array_gen,
steps_per_epoch = as.integer(train_samples / batch_size),
epochs = 10,
validation_data = val_image_array_gen,
validation_steps = as.integer(valid_samples / batch_size)
)
Now, fine-tune the model. To do this, we should apply a low learning rate to the optimizer so the established pre-trained weights won’t be disorganized. We will use a 0.00001 learning rate. Also, to save time, we only train the model for 4 epochs.
Before fine-tuning, don’t forget to unfreeze the layers to be trained. In our case, we will unfreeze all layers.
unfreeze_weights(model_bigger)
# compile again with low learning rate
model_bigger %>% compile(loss = "categorical_crossentropy",
optimizer = optimizer_adam(lr = 0.00001),
metrics = "accuracy")
history <- model_bigger %>%
fit_generator(
train_image_array_gen,
steps_per_epoch = as.integer(train_samples / batch_size),
epochs = 4,
validation_data = val_image_array_gen,
validation_steps = as.integer(valid_samples / batch_size)
)
plot(history)
pred_test <- predict_classes(model_bigger, test_x)
pred_test <- sapply(pred_test, decode)
cm_bigger <- confusionMatrix(as.factor(pred_test), as.factor(val_data$class))
acc_bigger <- cm_bigger$overall['Accuracy']
cm_bigger
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction buildings forest glacier mountain sea street
#> buildings 396 0 2 1 2 13
#> forest 1 469 2 2 4 0
#> glacier 1 2 479 61 5 0
#> mountain 0 0 50 452 4 0
#> sea 1 1 16 7 492 2
#> street 38 2 4 2 3 486
#>
#> Overall Statistics
#>
#> Accuracy : 0.9247
#> 95% CI : (0.9146, 0.9339)
#> No Information Rate : 0.1843
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.9095
#>
#> Mcnemar's Test P-Value : 0.00281
#>
#> Statistics by Class:
#>
#> Class: buildings Class: forest Class: glacier Class: mountain Class: sea Class: street
#> Sensitivity 0.9062 0.9895 0.8662 0.8610 0.9647 0.9701
#> Specificity 0.9930 0.9964 0.9718 0.9782 0.9892 0.9804
#> Pos Pred Value 0.9565 0.9812 0.8741 0.8933 0.9480 0.9084
#> Neg Pred Value 0.9841 0.9980 0.9698 0.9707 0.9927 0.9939
#> Prevalence 0.1457 0.1580 0.1843 0.1750 0.1700 0.1670
#> Detection Rate 0.1320 0.1563 0.1597 0.1507 0.1640 0.1620
#> Detection Prevalence 0.1380 0.1593 0.1827 0.1687 0.1730 0.1783
#> Balanced Accuracy 0.9496 0.9929 0.9190 0.9196 0.9769 0.9752
We’ve found the winner. model_bigger
has 92% accuracy on the validation dataset! Still, there are some misclassifications as no models are perfect. Here’s the summary of the predictions:
- Some buildings are mistakenly predicted as streets, and vice versa. Again, this is due to some images of buildings that also contain streets which confuse the model.
- Forests are predicted almost perfectly.
- Many glaciers are predicted as mountains and seas, also many mountains are predicted as glaciers. This may come from the same blueness of glacier, mountain, and sea images.
- Good predictions on seas.
Conclusion
rbind(
"Simple CNN" = acc_simple,
"Deeper CNN" = acc_big,
"Fine-tuned VGG16" = acc_bigger
)
#> Accuracy
#> Simple CNN 0.7423333
#> Deeper CNN 0.8486667
#> Fine-tuned VGG16 0.9246667
We’ve successfully performed image classification with 6 classes: "Buildings", "Forest", "Glacier", "Mountain", "Sea", and "Street". Since images are unstructured data, this problem can be solved by Machine Learning using Neural Network which does feature extraction automatically without human intervention. For better performance, we use Convolutional Neural Network to do this continued with Dense layers for prediction. In the end, we use the VGG16 model with IMAGENET weight initialization which achieves 92% accuracy.
With image classification, companies that have websites with large visual databases can easily organize and categorize their database because it allows for automatic classification of images in large quantities. This helps them monetize their visual content without investing countless hours in manual sorting and tagging.
🔥 Hi there! If you enjoy this story and want to support me as a writer, consider becoming a member. For only $5 a month, you’ll get unlimited access to all stories on Medium. If you sign up using my link, I’ll earn a small commission.
🔖 Want to know more about how classical machine learning models work and how they optimize their parameters? Or an example of MLOps megaprojects? What about cherry-picked top-notch articles of all time? Continue reading: