Lessons from a Deep Learning Master

Published in

Towards Data Science

11 min readJul 17, 2020

Yoshua Bengio is a Deep Learning legend and won the Turing Award in 2018, along with Geoff Hinton and Yann LeCun.

In this short post, I want to highlight for you some clever things that Yoshua and his collaborators did to win a Machine Learning competition from a field of 381 competing teams. Perhaps these ideas will be useful for your own work.

In a world where powerful Deep Learning frameworks (e.g., TensorFlow, PyTorch) are a free download away, their competition-winning approach demonstrates nicely that your edge may come from how well you model the specifics of your problem.

(Caveat: This work was done in 2015. Given all the advancements in Deep Learning and computing hardware since then, Yoshua and team would probably solve the problem differently if the competition were held today)

The teams participating in the competition were given a dataset of all the taxi trips undertaken over a full year in the city of Porto in Portugal.

There were 1.7 million trips in the training dataset and for each trip, the important data elements were:

GPS coordinates — latitude and longitude — of the taxi’s location measured every 15 seconds from the start of the trip to the finish. The first latitude-longitude pair is the starting point of the trip and the final latitude-longitude is the destination of the trip. For example, a taxi’s location at the start of a trip, 15 seconds later and 30 seconds later would look like this: [-8.578719,41.156271],[-8.578629,41.157693],[-8.578521,41.159439].
the timestamp at the beginning of the trip
taxi ID
client ID (if the client requested the taxi by phone) or taxi-stand ID (if they got into the taxi at a taxi stand)

The challenge given to the participants is simply stated:

Given a partial trip (i.e., the latitude-longitude of the starting point and the next several consecutive points) and time/ID metadata, predict the latitude-longitude of the final destination.

For example, let’s say a taxi trip started at the Sao Bento Station and ended at the Jardins do Palacio de Cristal, as shown below.

A partial trip would include the origin point and might be something like this:

The test dataset had 320 partial trips. The evaluation metric was the distance between the predicted destination and the actual destination, averaged over the trips in the test dataset.

But the predicted and actual destinations are points on the surface of the earth (not points on a plane), so the distance between them is calculated NOT with the euclidean distance but with something called the Haversine distance:

Looks simple, right? :-)

This is a structured data problem (i.e., no images, audio etc) so if you want to use a neural network approach, a reasonable starting point would be a basic network (an MLP) with a hidden layer and two output nodes, one for the latitude and one for the longitude of the destination.

But complications arise immediately:

Since different trips may have different durations, the number of latitude-longitude pairs in each trip will vary and therefore each training example has a variable number of inputs. For example, a 10-minute ride will have about 40 latitude-longitude pairs while a 30-minute ride will have an input that is three times as long. How do we handle a varying number of inputs?
That Haversine function looks scary. It is differentiable so maybe optimizing it as-is will just work? We will see.
Our two output nodes predict latitude and longitude. Maybe this will work just fine but there are only 320 observations in the test dataset so even a few bad predictions can wreck the evaluation metric. Furthermore, predicting latitude and longitude directly doesn’t take into account the fact that popular destinations (e.g., the Sao Bento station) will occur more frequently in the data and therefore getting them right is very important.

Let’s dive in and see how Yoshua and team solved these problems.

Problem: Varying-length input

(If you are familiar with Recurrent Neural Networks (RNNs), you would have immediately recognized their applicability to this problem. Indeed, in their paper, Yoshua and co-authors explore a few different variants of RNNs to address this issue but their competition-winning model didn’t use RNNs; it used the simple idea described below)

Solution:

The solution that worked best was incredibly simple.

Concatenate the first 5 coordinates and the last 5 coordinates of the input. If the input has fewer than 10 coordinates, still take the first 5 and the last 5 — it is ok that they overlap. Finally, if the partial trip has fewer than 5 coordinates, just repeat the first or the last coordinate till you get to 10 coordinates.

For example, from this ‘raw’ input …

[[-8.611794,41.140557],[-8.611785,41.140575],[-8.612001,41.140566],[-8.612622,41.140503],[-8.613702,41.140341],[-8.614665,41.140386],[-8.615844,41.140485],[-8.61561,41.140683],[-8.614566,41.141088],[-8.614395,41.141979],[-8.613936,41.142942],[-8.612793,41.143851],[-8.611488,41.144787],[-8.610543,41.144391],[-8.610282,41.143536],[-8.610255,41.143401],[-8.608824,41.143239],[-8.608419,41.143149],[-8.606565,41.142348],[-8.605179,41.143446],[-8.604549,41.144796],[-8.604297,41.1453],[-8.603505,41.145561],[-8.602488,41.145633],[-8.601039,41.145759],[-8.600436,41.146443],[-8.599977,41.147289],[-8.598681,41.14827],[-8.598303,41.148423],[-8.598618,41.149467],[-8.597529,41.151294],[-8.596161,41.153679],[-8.594838,41.155983],[-8.594163,41.157135],[-8.593002,41.159187],[-8.591454,41.161608],[-8.589924,41.163453],[-8.589402,41.163309]]

… only the bolded coordinates would be used:

[[-8.611794,41.140557],[-8.611785,41.140575],[-8.612001,41.140566],[-8.612622,41.140503],[-8.613702,41.140341],[-8.614665,41.140386],[-8.615844,41.140485],[-8.61561,41.140683],[-8.614566,41.141088],[-8.614395,41.141979],[-8.613936,41.142942],[-8.612793,41.143851],[-8.611488,41.144787],[-8.610543,41.144391],[-8.610282,41.143536],[-8.610255,41.143401],[-8.608824,41.143239],[-8.608419,41.143149],[-8.606565,41.142348],[-8.605179,41.143446],[-8.604549,41.144796],[-8.604297,41.1453],[-8.603505,41.145561],[-8.602488,41.145633],[-8.601039,41.145759],[-8.600436,41.146443],[-8.599977,41.147289],[-8.598681,41.14827],[-8.598303,41.148423],[-8.598618,41.149467],[-8.597529,41.151294],[-8.596161,41.153679],[-8.594838,41.155983],[-8.594163,41.157135],[-8.593002,41.159187],[-8.591454,41.161608],[-8.589924,41.163453],[-8.589402,41.163309]]

In case you are wondering why they picked 5 rather than another number, I suspect that they thought of this as a hyper-parameter k and tried a few different values; k = 5 may have turned out to be the best.

Lesson learned:

In problems with varying-length inputs, a carefully chosen fixed-length subset of the input may capture the input’s essence.

For a taxi trip, knowing the origin point and the last point of the partial trip is probably all the information you need about the partial trip; knowing the exact path taken by the taxi between those two points is probably unnecessary.

But in other problems, knowing the beginning and end may not be enough; representing the entire path in some way may be necessary. In those cases, sampling the entire path at regular intervals may do the trick. Or sampling the more interesting parts of the path more often and sampling the less interesting parts of the path less often may be the right approach.

These ideas are not foolproof though: if the input is a sentence, we can’t just look at the first few words or the last few words. And sampling a fixed number of words from every sentence won’t work either; omitting a single word (e.g., the word ‘not’) may change the meaning of the sentence.

Nevertheless, Yoshua’s solution demonstrates that you may be able to come up with a simple approach that is good enough for your specific problem if you think about it carefully.

Problem: How do we handle that intimidating Haversine distance function?

Solution:

Turns out that our concern about that distance function was justified. Yoshua and team did run into trouble when they used the Haversine function, so they had to find a simpler alternative.

Lesson learned:

Again, this is a good example of problem-specific thinking.

They didn’t try to devise a universal approximation to the Haversine distance. Given that the problem is set in Porto, they just needed something that worked well at the scale of that city. It didn’t have to work for larger distances.

Once you realize this, a little Googling can lead you to the equirectangular distance, which looks a lot simpler than the Haversine.

If you are familiar with machine learning, you have probably learned the importance of making sure that your loss function accurately captures the real-world objectives you care about for your problem.

But what you may not have learned is that when your loss function is complex (as it often is), you don’t have to find an approximation that’s good everywhere. It just has to be good enough within the scope of your problem.

Problem: Does having two simple output nodes — one for latitude and one for longitude — work?

As the destination we aim to predict is composed of two scalar values (latitude and longitude), it is natural to have two output neurons. However, we found that it was difficult to train such a simple model because it does not take into account any prior information on the distribution of the data.(emphasis mine) Source: https://arxiv.org/abs/1508.00021

By “prior information on the distribution of the data”, Yoshua and team are referring to the varying popularity of different destinations (e.g., the Sao Bento train station will be more popular than a particular residential address).

Let’s see what they did! This is my favorite part of their paper.

Solution:

They ran a clustering algorithm on all the final destinations in the training set and grouped them into a few thousand clusters (3,392 to be exact).

Conceptually, they went from this …

… to something like this.

(This is just for illustration. The actual clusters were probably not all of the same size and shape)

Now, instead of directly predicting the latitude-longitude of the final destination, we can think of this as a multi-class classification problem where the task is to classify the input into one of those 3,392 clusters.

The final layer for a multi-class classification problem is usually a softmax layer, which gives you a probability distribution over all the possible output classes. In our example, the softmax layer will generate a probability for every one of the 3,392 clusters.

It is standard practice in multi-class classification to pick the class with the highest probability as the predicted output. Accordingly, we can pick the highest-probability cluster and use the latitude-longitude of its center point as the predicted destination.

Notice how this transformation neatly takes into account the ‘prior information on the distribution of the data’: the clusters containing popular destinations will occur more frequently in the training set and will therefore, on average, have higher predicted probabilities.

This sounds pretty good, right?

But what if an actual destination is at the corner of a cluster, far from the cluster center? Since we are using the cluster center as the prediction, the distance between our prediction and the actual destination will be non-zero for sure and may be sizable.

One way to get around this issue is to increase the number of clusters we use. By generating (say) 5000 clusters, each cluster gets smaller and and every point in a cluster will be closer to its center. But we now have a multi-class classification problem with many more output classes. Without sufficient training data for every cluster, we won’t be able to train a good model.

Yoshua and team devised a better way.

They multiplied the predicted cluster probabilities (i.e., the output of the softmax) by the coordinates of the cluster centers and added them up to calculate a weighted average latitude …

… and a weighted average longitude.

This (probability-weighted) latitude-longitude pair is the predicted destination.

This means, for example, that if the model thinks that two adjacent clusters are equally likely to be the final destination, the midpoint of their centers will be predicted as the final destination.

It is important to note that this final weighted-averaging step is not a post-processing step. It has to be part of the network — only then, the predicted latitude-longitude pair can be fed into the loss function, which, in turn, can be optimized to train the network.

To make this part of the network, they add a single linear layer after the softmax layer. This, in my opinion, was a master move :-)

The weight matrix of this linear layer is just the cluster centers …

… but with an important twist: the weights are kept fixed during training.

After all, we already know what they are (i.e,, they aren’t randomly initialized weights, they come from the clustering algorithm) and don’t need to learn them.

In summary, Yoshua and team:

first changed the problem from a two-output regression problem to a multi-class classification problem
and then changed it back to a two-output regression problem by adding a final linear layer and two output nodes
by making the cluster centers as the weight matrix for the linear layer but freezing the weights of this layer, they brought the weighted-averaging step inside the network and made end-to-end training of the network possible.

Neat, right?

BTW, if you are curious about which clustering algorithm was used:

The clusters were calculated with a mean-shift clustering algorithm on the destinations of all the training trajectories, returning a set of C = 3392 clusters. Source: https://arxiv.org/abs/1508.00021

Lessons learned:

It is important to consider the prior distribution of the output values when thinking about the output layer. For classification problems, this is usually straightforward (and even automatic) but for regression problems like this one, it requires paying more attention than we normally do.
If the specifics of the problem require a particular kind of computation, define a layer to do it and include it in the network (rather than do it in an ad hoc manner outside the network) so that you can learn its parameters as part of the training process. As long as its derivative can be calculated, it is worth a try.
If the above lesson makes you wonder why Yoshua and team did the clustering outside the network, instead of defining a layer for it in the network and learning the best clusters as part of training the network:

One potential limitation of our clustering-based output layer is that the final prediction can only fall in the convex hull of the clusters. A potential solution would be to learn the clusters as parameters of the network and initialize them either randomly or from the mean-shift clusters. (emphasis mine)
Source: https://arxiv.org/abs/1508.00021

I hope you enjoyed this peek into how a Deep Learning master thinks. If none of these lessons were new to you, congratulations — you are well on your way to Deep Learning mastery!

(If you found this article helpful, you may find these of interest)

Lessons from a Deep Learning Master

Written by Rama Ramakrishnan