Using Unsupervised Learning to plan a vacation to Paris: Geo-location clustering

Hamza B., PhD
Towards Data Science
6 min readMay 21, 2018

--

The Eiffel Tower in the City of Love but there are so many great sights in Paris — how can I help her organise her trip?

When my friend told me she was planning a 10-day trip to Paris, I figured I could help.

Her: “My friend and I are thinking of going to Paris on vacation.”
Me: “Oh sounds fun. Maybe I can join in as well.”
Her: “Yes! We’re planning to enrol in a French language course, and do loads of shopping! We already listed all the shopping malls we want to visit, and …”
Me: “Ah sounds great… so what time do you need me to drive you to the airport?”

Since I’ve been to Paris a few times myself, I figured I could help in other ways like contributing to the list of sights and places to visit. After listing all those sights and attractions, I created a Google map with a pin for each location.

The initial Google Map with pins for sights to visit — but in what order should these sights be visited?

It became quite clear that some scheduling work would be needed to see all that Paris had to offer — but how can she decide what to see first, and in what order? This seemed to me like a clustering problem to me and unsupervised learning method could help solve it.

Algorithms like K-Means, or DBScan would probably do the trick. But first, the data must prepared to facilitate for such algorithm to perform as intended.

Geolocations Collection and Data Preparation

First, I had to gather the Google map pins geo-locations in a 2D format (something that could be stored in a numpy array). Translating those pins into[longitude, latitude] would be perfect.

Since this was a one-off, I looked for a quick way to extract this info from the Google map. This StackOverflow query gave me all that I needed.

Basically, one needs to go to google.com/maps/d/kml?mid={map_id} and download a *.kmz file. Then, manually change the extension of the *.kmz file to a *.zip file. Extract the file, and open doc.kml in a text editor of your choice (SublimeText is my personal go-to).

Then, you may decide to manually CTRL+F to search for <coordinates> fields, or decide to not be so lazy about it and use BeautifulSoup (as shown below)!

Once I extracted the coordinates from the XML file, I stored the coordinates in a dataframe. The total number of landmarks is 26 (stored in <Placemark></Placemark> in the XML file).

Dataframe populated with info from the XML file (first 7 rows only shown)

From the dataframe which stored coordinates and the name of the landmark/sight, I generated a scatter plot.

Google Map Pins Scatter Plot

K-Means Clustering

In general, unsupervised learning methods are useful for datasets without labels AND when we do not necessarily know the outcome we are trying the predict. These algorithms typically take one of two forms: (1) clustering algorithms, or (2) dimensionality reduction algorithms.

In this section, we’ll focus on k-means, which is a clustering algorithm. With k-means, a pre-determined number of clusters is provided as input and the algorithm generates the clusters within the un-labeled dataset.

k-means generates a set of k cluster centroids and a labeling of input array X that assigns each of the points in X to a unique cluster. The algorithm determines cluster centroids as the arithmetic mean of all the points belonging to the cluster, AND defines clusters such that each point in the dataset is closer to its own cluster center than to other cluster centers.

source: https://bit.ly/1z29ZIV

It can be implemented in Python using sklearn as follows:

As one can see in the scatter plot below, I generated 10 clusters — one for each vacation day. But the sights in the center of Paris are so close to each other, it is quite difficult to distinguish one cluster from the other. The resulting predictions were also sorted and stored in a dataframe.

k-means was used to generate 10 clusters (one for each day), the black dots represent cluster centers

Using the 10 clusters generated by k-means, I generated a dataframe that assigns a day of the week to each cluster. This would constitute an example of a schedule.

k-means results as Vacation Days of the Week (first 3 days only shown)

Now, at this stage, I could have simply handed over the sorted dataframe, re-organised the Google map pins by layers (i.e. each layer representing one day), and that’s it — itinerary complete.

But something was still bugging me, and that is that k-means was generating clusters based on the Euclidean distance between points — meaning the straight-line distance between two pins in the map.

But as we know, the Earth isn’t flat (right?) so I was wondering if this approximation was affecting the clusters being generated, especially since we have quite a few landmarks far from the high-density region in the center of Paris.

Because the Earth isn’t flat: enter HDBSCAN

Hence, we need a clustering method that can handle Geographical distances, meaning lengths of the shortest curve between two points along the surface of the Earth.

A density function such as HDBSCAN, which is based on the DBScan algorithm, may be useful for this.

Both HDBSCAN and DBSCAN algorithms are density-based spatial clustering method that group together points that are close to each other based on a distance measurement and a minimum number of points. It also marks as outliers the points that are in low-density regions.

Thankfully, HDBSCAN supports haversine distance (i.e. longitude/latitude distances) which will properly compute distances between geo-locations. For more on HDBSCAN, check out this blog post.

HDBSCAN isn’t included in your typical Python distribution so you’ll have to pip or conda install it. I did so, and then ran the code below.

And ended up with the following scatter plot and dataframe. We see that isolated points were clustered in cluster ‘-1’ which means they were identified as ‘noise’.

HDBSCAN generated 9 clusters and 3 data points as ‘noise’
HDBSCAN results as Vacation Days of the Week (first 3 days only shown)

Unsurprisingly, we end up with several points being flagged as noise. Since the minimum number of points for a HDBSCAN cluster is 2, isolated locations like the Palais de Versailles were categorised as noise. Sainte-Chapelle de Vincennes, and Musée Rodin suffered a similar fate.

The interesting part however is in the number of clusters that HDBSCAN identified which is 9 — one less than the set vacation days. I guess that for the number of sights/data points that we chose, 10 days sounds like it’ll be okay.

Ultimately, the results from k-means were the ones we used to lay out a schedule as the clusters generated by k-means were similar to the ones generated by HDBSCAN and all data points were included.

The clustering method presented here can be improved of course. One possible improvement is in the addition of a weight feature for the datapoints. For instance, the weights could represent the amount of time needed to fully visit a particular venue (e.g. Le Louvre easily takes one full day to appreciate), and hence this would affect total number of data points in a cluster with highly weighted points — something to investigate in future projects.

Jupyter Notebook for this mini-project can be found here.

--

--