The world’s leading publication for data science, AI, and ML professionals.

Part 2 – How I Tackled My First Kaggle Challenge Using Deep Learning

This is Part-2 of my how I tackled the State Farm Distracted Driver Detection Challenge on Kaggle. A lot of the work here is based on…

This is Part-2 of my how I tackled the State Farm Distracted Driver Detection Challenge on Kaggle. A lot of the work here is based on lessons and tutorials from the excellent fast.ai course taught by Jeremy Howard and Rachel Thomas.

I am particularly interested in this challenge as I am currently building KamCar, a AI-powered dash cam mobile app to make driving a safer and richer experience. You can read up about Part-1 HERE.

Driver talking on the phone while driving
Driver talking on the phone while driving

We left off from the first Part-1 after having applied some augmentations to the sample training set and reaching 50% accuracy on the validation set (i.e. Step 5). Now we are embarking on the remainder of our Deep Learning adventures with a much bigger dataset as ally.

Step 6 – Invite the whole dataset

While experimenting with the sample set, I have also come to understand that it would be much more challenging to drastically increase accuracy without training the model on a much richer dataset. Therefore it is natural to load the whole training and validation sets and train the model against the former.

Step 7 – Drop some of these weights

I planned to use the same-ish small CNN employed in Part 1, but this time I introduce Dropout, which basically enables us to "drop" weights in our dense layers. I use Dropout because it forces the network to forget some of the features it computed on the training set, thereby reducing overfitting of the model on training data. Without Dropout, the features our model would develop may not be generic enough to work well on new data (i.e. the validation or test sets).

def conv_model_with_dropout():
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,img_size_1D,img_size_1D)),
            Convolution2D(64,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D(),
            Convolution2D(128,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D(),
            Convolution2D(256,3,3, activation='relu'),
            BatchNormalization(axis=1),
            MaxPooling2D(),
            Flatten(),
            Dense(200, activation='relu'),
            BatchNormalization(),
            Dropout(0.5),
            Dense(200, activation='relu'),
            BatchNormalization(),
            # Using a simple dropout of 50%
            Dropout(0.5),
            Dense(num_classes, activation='softmax')
        ])
    model.compile(Adam(lr=1e-5), loss='categorical_crossentropy', metrics=['accuracy'])
    return model

After a few runs, performance is hovering at around 60%–62% accuracy on the validation set, without making much progress:

Epoch 1/10
17990/17990 [==============================] - 521s - loss: 0.7413 - acc: 0.7560 - val_loss: 1.3689 - val_acc: 0.6087
Epoch 2/10
17990/17990 [==============================] - 520s - loss: 0.7123 - acc: 0.7665 - val_loss: 1.3105 - val_acc: 0.6238
Epoch 3/10
17990/17990 [==============================] - 520s - loss: 0.6663 - acc: 0.7815 - val_loss: 1.2932 - val_acc: 0.6283
Epoch 4/10
17990/17990 [==============================] - 520s - loss: 0.6253 - acc: 0.7973 - val_loss: 1.2504 - val_acc: 0.6245

Time to move on to something much more powerful!

Step 8 – Don’t be a hero: let Transfer Learning rescue you

As you can see, all those models I created previously were quite straightforward and not very deep. I could certainly create a much deeper architecture, but would most likely need to train it for a very long time on a massive dataset (much bigger than what I currently have) to be able to significantly improve accuracy. Plus I would not be able to afford the humongous AWS bill that 2 weeks of training would cost me.

Fortunately, many awesome researchers have made our lives easier by training deep learning models on millions of images, saved the final weights of their models, and made their results available. One of the most active researchers in this field, Andrej Karpathy, recommends using what’s already out there instead of reinventing a costly wheel on his CS231N course:

"don’t be a hero": Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch.

Taking an already pre-trained model and fine tuning it with our own training data is referred to as Transfer Learning. For this challenge, I will be using a pre-trained VGG network that won the ImageNet ILSVRC-2014 competition.

VGG Architecture - From Toronto University
VGG Architecture – From Toronto University

Step 8 – Keep those convolutional layers close

As a reminder, these are some of the reasons why I decided to use a pre-trained VGG network:

  • We have very little training data and therefore will not be able to produce a model that easily becomes generic
  • VGG has been trained on millions of images
  • The features extracted in the convolutional layers of VGG should be generic enough to apply to images of humans
  • VGG has more convolutional layers than my makeshift neural network above, and my assumption is that we will get higher accuracy

The most important components of a neural network are its convolutional layers (Matthew Zeiler & Rob Fergus write about it in the paper Visualizing and Understanding Convolutional Networks), which learn how to recognise basic features such as edges in the initial layers, and more complex features in the upper layers.

Visualisaion of early layers of a CNN - from Karpathy's blog
Visualisaion of early layers of a CNN – from Karpathy’s blog

We therefore need to extract the convolutional layers from our pre-trained VGG network as we will be creating our own dense layers later on:

# This will import the full gcc model along with its weights
vgg = Vgg16()
model = vgg.model
# identify the convolutional layers
last_conv_idx = [i for i,l in enumerate(model.layers) if type(l)  is Convolution2D][-1]
conv_layers = model.layers[:last_conv_idx + 1]
# Make sure those convolutional layers' weights remain fixed. We are purely using them for prediction, not training.
for layer in conv_layers: layer.trainable = False
# Create a model now out of the convolutional layers
conv_model = Sequential(conv_layers)

Step 9— Pre-compute your convolutional features

Remember that from the diagram of the VGG architecture, the last convolutional layer produces an output of dimensions 512 x 14 x 14: that’s precisely what our convolution-only model above will do. What is needed then is to pre-compute our training, validation, and test features on the convolution-only model, to then use them in the dense model. In other words, we are using the output of our convolution-only model as the input into our dense-only model.

# batches shuffle must be set to False when pre-computing features
train_batches = load_in_batches(train_dir, shuffle=False, batch_size=img_batch_size)
# Running predictions on the conv_model only
conv_train_feat = conv_model.predict_generator(train_batches, train_batches.N)
conv_val_feat = conv_model.predict_generator(val_batches, val_batches.N)
# Predict feature probabilities on test set
test_batches = load_in_batches(test_dir, shuffle=False, batch_size=img_batch_size)
conv_test_feat = conv_model.predict_generator(test_batches, test_batches.N)

We load our images while making sure the shuffle flag is set to false, otherwise our input features and labels would not match anymore.

Step 10— Make the dense layer your own

Recall that the VGG network produces an output containing 1000 classes. That’s because the ImageNet competition it was trained against required the classification of images into 1000 different classes. But in this Kaggle challenge, we are only dealing with 10 classes, therefore the original VGG dense layers are inadequate for the problem we are trying to resolve.

Therefore, we need to create a suitable dense layer architecture that will ultimately produce predictions for the number of classes we expect, using SoftMax as the activation function for the last dense layer. Moreover, remember, that we are using the output of the convolution-only model as the input into our bespoke dense model. We also introduce Dropout at this stage to prevent overfitting on the training set.

def get_dense_layers(dropout_rate = 0.5, dense_layer_size = 256):
    return [
        # Here input to MaxPooling is the output of the last convolution layer (without image count):  512 x 14 x 14
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dropout(dropout_rate),
        Dense(dense_layer_size, activation='relu'),
        BatchNormalization(),
        Dropout(dropout_rate),
        Dense(dense_layer_size, activation='relu'),
        BatchNormalization(),
        Dropout(dropout_rate),
        # num_classes is set to 10
        Dense(num_classes, activation='softmax')
        ]

We can now train our dense model using the pre-computed features of the convolution-only model from step 9:

d_rate = 0.6
dense_model = Sequential(get_dense_layers(dropout_rate = d_rate))
# We will use a more aggressive learning rate right off the bat, as our model has been pre-trained 
dense_model.compile(Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
# See how we call fit and not fit_generator, as our array of features has already been pre-computed
dense_model.fit(conv_train_feat, trn_labels, batch_size=img_batch_size, nb_epoch=3, 
               validation_data=(conv_val_feat, val_labels), verbose=1)

This enables us to attain a validation accuracy of around 70%

Train on 17990 samples, validate on 4434 samples
Epoch 1/3
17990/17990 [==============================] - 9s - loss: 0.9238 - acc: 0.7178 - val_loss: 1.0323 - val_acc: 0.6340
Epoch 2/3
17990/17990 [==============================] - 9s - loss: 0.1435 - acc: 0.9565 - val_loss: 0.8463 - val_acc: 0.7352
Epoch 3/3
17990/17990 [==============================] - 9s - loss: 0.0802 - acc: 0.9748 - val_loss: 0.9537 - val_acc: 0.7082
Doctor House's nod of approval
Doctor House’s nod of approval

While this is a significant improvement from where I started, we can actually do even better: I have not yet introduced data augmentation into this battle nor used some unsupervised techniques and other tricks I will cover in Part 3. This is going to be a really exciting part so don’t miss out!

Stay tuned, share if you like and don’t hesitate to leave a comment :).


_I am a building KamCar, the AI-powered dash cam app to make driving a safer and richer experience. If you are a mobile developer and want to work on some exciting tech and product that has a REAL impact, or just someone who wants to give some advice, hit me up on Twitter or here!_


Related Articles