Building a deep learning model to judge if you are at risk.

Predict vehicle collisions moments before it happens using CNN + LSTMs and Carla!

Paarvendhan

Published in

Towards Data Science

11 min readMay 16, 2020

About

The project combines CNNs and LSTMs to predict whether a vehicle is in a collision course using a series of images moments before it happens. CNNs are good for image understanding but without sequence relation between images, we miss out on a lot of temporal information to predict how a series of events can cause an incident.

Note:

The post assumes you have a basic understanding of CNNs and LSTMs.You don’t have to read the entire thing, this story is more about explaining the challenges faced, various experiments and optimizations done in building the project. So, you can be selective about what you want to read. Treat it like a starting point or a guide for solving the problem you are facing. Not all knowledge is useful.

Show me the code gang go here.

Environment

A solid simulation environment is needed to collect data. Carla is a driving simulation software which can provide environment level control. Since we need to replay the memories from the accident point, Carla has options to raise red flags whenever a violation is made or an accident is caused. It also lets us use different kinds of agents from naive to expert ones. This makes Carla a good choice for collecting data. In addition, we also get information about the vehicle, climate, street objects, traffic levels, speed and more which can be vital in complex systems.

The project is entirely built on Python and TensorFlow Keras.

The project can be divided into three stages: Collecting Data, Creating the appropriate Network and Training.

1 Data Collection

1.1 Carla Challenges

Carla is a graphic intensive software and surprisingly, Carla did not have a headless mode(in windows) because of which the running system gets really slow after a certain point of time and crashed often.

Carla’s graphic setting was changed to the lowest possible resolution by changing the configuration files they have.
The number of frames was reduced to 12 fps decrease the load on the machine and that helped the program to collect the entirety of the data without crashes.

1.2 Custom Scripts

Carla has Python APIs which helps you create custom agents for you to drive around. They also have expert agents to drive around which runs through the map perfectly.

Firstly, we have a naive agent which drives the car around the city and takes photos every 4 frames. We use a naive agent so that we can capture more accidents and violations.
Once an accident occurs or a violation is caused, CARLA raises red flags and from that time step when the accident occurs, we can take the past 15 images of the episode. So, if we look at the series of images, each image has put us closer to the accident as time moves forward.
Collecting the data for uniform driving is easy, we can just use the autopilot API provided by Carla to drive around and take pictures using the same capture rate for consistency.
When a collision is made, the program only takes the last 15-time steps and it automatically deletes all previous images of the episode to avoid data overflow.

1.3 Handling the data:

The collected data had a unique structure and it was hard to handle. The total amount of data collected was around 40GB and it was hard to move around or load the data into memory, not only because of the size of the data but of the number of files (210,000). Also, each sample had 15-time step images before the incident and each sample had an image resolution of 420 X 280 pixels each.

Numpy binary format is used with int as the datatype for efficient storage.
Reduced the required number of time steps from 15 to 8, reduced the resolution to 210 X 140.
The images are stored batches of 8 episodes per file to reduce the number of files and fasten reading time.
This combination of these measures brought down the overall data size to 8 GB and made the logistics faster.

A total of 7000 (episodes) x 2 (classes) = 14000 samples were collected with each sample containing about 8 images for training the network. Data augmentations are not applied because of the nature of the environment. These samples are collected around various towns and different environmental conditions available in Carla to make it robust across different conditions.

2 Model Architecture

It is well known that to tackle problems that are in sequence or in the form of a time series, RNNs work the best and RNN units are replaced by LSTMs for obvious reasons. Now encode the given information in a time series format but this works well only for naive data like number series. We need a complex embedding method to understand complex data forms.

For images, we can use Convolutional Neural Networks. CNNs are proven to be the best method to extract spatial information. We are going to use some of the standard CNN architectures to extract the features from the image. So, for each image in the series, we will get a feature vector of a fixed size which can be passed into their respective LSTM time step cells.
But we don’t need so many CNNs for learning the images. 3D CNNs are good for this kind of data it is better suited for exploring and learning spatial correlation than temporal correlation.
So, we can use the Time Distributed wrapper and wrap our CNN layers to spread it across time steps. This is much better compared to 3D CNNs for extracting the temporal correlation because it just uses a single CNN to learn the features from images across time steps.
Once these features are extracted in the form of embedding vectors, all these features are passed into their respective LSTM cells in their timestep.
Now, these embeddings are taken into the encoder and then passed to the fully connected layers which in turn learns the classification task.
The figure shows the initially proposed network and for the convolutional block, a full VGG network is used to get the image embeddings. CNN network is distributed in time to perform the convolution for each time step. Though the network seemed logically correct initially, it had gone through many corrections to achieve the results shown.

2.1 Training Issues

The training caused memory exhaust errors even though the network is relatively not small because the network is distributed in time and that caused problems in runtime exhausting.
Sometimes, the training will run for a certain amount of time and it will stop after it reaches an upper limit, this occurs due to the data size and time distribution operation happening in each layer of CNN.
So, the batch size is reduced and some of the network parameters like filter size and embedding sizes have been carefully adjusted after a lot of experimentation.

2.2 Overfitting

Overfitting has always been the Achilles heels of deep learning. initially, I was not surprised by the fact it overfitted the data. The network immediately overfitted the data given and the accuracy was 100 per cent even in the epoch1. Obviously, the results in the test data were really bad
So, I went on and did the standard overfitting procedure like checking the data, biases, adding batch normalization etc.
But the problem as it turned out, was very unique to this kind of architecture of the network. The network overfitted because, for each global network step update, the CNN part of the network is updated 8 times(number of time steps).
This means the CNN part of the network is learning too much information about the images while the LSTM layers under it were not able to keep up with it.
The solution to this problem was to reduce the size of the VGG network used. The CNN network is almost reduced to half of the original network and the filters were carefully chosen after repeated experimentation with various filter numbers and combinations.
Also, by adding two fully connected layers and dropouts instead of the proposed one layer gave significant performance improvements with an accuracy of 65–71%.

2.3 Learning Complex Time Functions

Increasing the number of LSTM units in the LSTM layer improved the performance for a bit and after that increasing the number of units does not make any difference.
But adding an LSTM layer to the net-work helped improve the performance of the network and this actually enables the network to learn a more complex time function and the network improved to an accuracy of 86%.

2.4 A Better CNN

The VGG layers worked well for the network so far but it is possible to improve the performance of the network by adding smarter layers instead of the naive VGG layers.

To implement this idea, a couple of Inception modules with carefully chosen filter numbers were employed in place of the VGG layers.
It is important to note that ResNet modules have not been chosen because the issue is not the forgetful network but we just needed a better feature extractor.
Also, a modification in the fully connected layers was required to prevent overfitting and this worked amazingly well leaving the network with the final training accuracy of 93%.

3 Training

The training process is similar to the classification tasks and we can use softmax cross-entropy to find the loss and minimize it using Adam optimizer.

The dimension of the input is (batch size, time, channel, width, height) compared to the traditional (batch size, channel, width, height). LSTMs, in general, takes more time than CNNs to train since we are training both, we can expect the training to be quite time-consuming for a classification task.
It is a decently large network with around 14 million parameters and it required a good GPU for training or it ran into memory exhaust errors. Luckily, I used a GTX 1080 Ti machine in which the training was done within 3 hours. The training and testing metrics are presented in the figures:

The figure shows the accuracy improvement over time and it can also be noted that the network with the inception module performed better than the VGG based CNN by a good margin.
The VGG based network has been oscillating in its losses and even the accuracy is pretty low. On the other hand, the network with the Inception modules performs better and has a stable convergence comparatively as shown in the figure.
For VGG, the batch size was 16 while for inception based one it was 8(due to memory exhaustion). The metrics would have been much smoother if we use a higher batch size for inception model too.

Results

The system is supplied with the video feed and a safety level for the given series of images is obtained. So for every moment in time, a safety flag is raised.

Multiple video clips of a vehicle driving around in various Carla environments is given to the network and it performed extremely well.

Inference 1

The prediction is dependent on the vehicle’s speed and when an actual human drives a car, the speed varies dynamically while the train samples were collected by constant speed piloting.
This has an effect on predicting the future by reducing the time before the incident happens when the vehicle goes fast and it predicts well in advance if the vehicle is going slow.

Inference 2

One important thing to note here is that the network actually learnt to know when an environment is safe or risky. This can be clearly seen when the vehicle backs down from a collision, near misses the status of the network changes from risky to safe.
This particular situation was never in the training samples and it really means the network learned about safe and risky environments in a meaningful way rather than just memorizing the collisions patterns alone.

Usage

The core of the project is to extract the Spatio-Temporal information and use it to understand our environment better for risk prediction, context and scene understanding, action recognition and forecasting etc.
The project can not only be employed in a self-driving car’s decision-making system but also a manual car’s emergency protocol system to prevent extreme events.
The network can be integrated as a part of the car which calls in safety protocols such as applying brakes, opening airbags, deploying safety precautions and maybe even call the emergency services with the location. Predicting a collision seconds before it happens may not be long enough time for a human to react but for a machine, it is a really long time.
This can also be used as a reward generating system for a reinforcement learning task.
The project although is very specific can be only applied to a particular domain, the objective of this project as I see it was to make the neural network understand a scene with a temporal element like we humans do.

Lessons

Always see if you really need the full complexity of data, we reduced resolution and number of time steps to simplify the problem.
If you are using simulation software, check if you reduce the computational power required, automate, speed things up.
Start with your idea, run experiments, fail and modify your network over time by the standard community practices. This helps you build an intuition over time and you can see yourself turning the right knobs of the network leaving random experimentation.
Each problem is different and unique. So, always do the above step even though the problems seems very familiar and already handled.
Overfitting can have different reasons. Check the nature of your problem, model and data to develop insights. For our problem, it was with the nature of the network.
Play around with hyperparameters but remember that tuning this cannot outdo a bad assumption about the nature of the problem. Sometimes it is just a bad assumption from our side.
Have a slight idea about what each layer contributes to the network as a whole, this helps while debugging eg. adding an LSTM layer instead of adding more units.
Try a different approach if possible, just to see if you can make your model better. We tried experimenting with inception modules which gave us phenomenal results.
Having inferences and checking if the network actually learnt the intended objective function is very important. Numbers don’t mean anything if it does not learn the task well.
See the bigger picture. You may not be able to use the project directly or immediately somewhere. But the knowledge you gain from trying an uncharted project can be very valuable in your future expenditures.
We are all in this together, Stay Safe.

YouTube link: https://www.youtube.com/watch?v=5E20U7b_4zQ

The code, external links and references are in this GitHub repo.

Please star the GitHub repo if you find it useful.

My other works,

How I built a Convolutional Image classifier using Tensorflow from Scratch.

(Without using Dogs Vs Cats, From getting images from google to saving our trained model for reuse.)

medium.com

Reinforcement Learning in layman terms.

(Contains Awesome graphical images).