Planet Beehive

Ep. 1 : Exploring our planets’ touristic activities

Thijs Bressers
Towards Data Science

--

Our output for today; Make sure to read below how we get there; Calculate our measures and for a interactive visualization

Close your eyes for a second, ignore the rain dripping outside, the beep from a new incoming email and deeply think about ten global activities that are on the top of your To Do list ..

Now be honest, how many of those are on the most visited activities in the world? No harm in that, they are visited for a reason. However, are they actually worth the limited resources we have to travel there? Or might there be alternatives that don’t come to mind immediately?

Those questions are among those we wish to answer today. Can we objectively create a measure that represents not only the most visited but also the best rated global activities and can we visualize these into an, easy to detect new pearls, map?

Our Approach

  1. We will have a look at the current measures available and transform data if required for enhanced downstream analysis and visualization
  2. Create a formula that combines number of reviews as well has their user rating into a new measure that we will use in our algorithm in follow up episodes
  3. Perform exploratory data analysis on regional, country as well as individual activity level to verify our newly created measure
  4. Visualize our data into a interactive chart that we will be amending during the next episodes into an interactive dashboard

The Data

The data that we will be using today includes: the activity metadata, their number of reviews, their user rating and some additional geodata. These we will derive from the following sources initially:

  • TripAdvisor : To collect the activities, number of user reviews and their rating, their collection of activities will be our starting point. Here we focus only on activities that are featured on each country’s ‘top things to do’ list OR activities that have more than 750 reviews.
  • Google : By using the Google Places and Geocoding API we enrich our activity data from TripAdvisor to collect geodata (i.e. coordinates) as well as the average rating from google users.
  • CIA World Factbook : For reference data about countries (size, citizens, etc.) we will use the comprehensive database from the CIA World Factbook. We start with 174 countries in scope

Data Transformation

Our data-set starts with 7'044 activities, of which for 5'971 we could also derive the google user rating. An okay start for today’s exercise.

Let’s see what the data looks like by creating two scatter plots with the 4 measures we care about the most: one with the latitude and longitude and another one with the number of reviews and the average rating per activity

Fig. 1, 2 Scatter plots created with Matplotlib in Python

Some initial key take away’s here:

  • It seems awfully crowded with activities in Europe. Which might be caused by a user bias of using TripAdvisor. However in general our geodata looks clean and we have a proper global diversification.
  • We love to give high ratings! There is hardly any rating lower than 6/10 and clearly the majority of the ratings are higher than a 8/10. The simple average combined rating is even an amazing 8.98/10!
  • When it comes to number of reviews, it seems that the majority of activities have 20'000 or less reviews. While at the same time there are quite some outliers, and a definite highlight activity with way over 120'000 reviews! (can you guess which highlight this is?)
  • Both the number of reviews as well as the average rating measures are not normally distributed

And the latter is our key point here. As it is our goal today to use these measures for comparison reasons, we want to achieve an as much as possible normally distributed measure set in both cases. Fig. 2 tells us a bit about both measures and how they are distributed.

Looking at the number of reviews first, it seems that we are dealing with a log-normal distribution, given the large amount of smaller values ad the very limited amount of very high values. To confirm this observation we will plot two QQ plots, one with a normal axis and another one with a logarithmic axis, a straight diagonal line would confirm a normal distribution:

Fig. 3, 4 QQ plots with regular y-axis (left) and log y-axis (right) produced with Probscale in Python

As expected, the logarithmic scale gives us a much better fit to a normal distributed data-set than the regular scale. Therefore we will apply a LOG10 transformation to the number of review measures. This new measure is what we will use for comparing the reviews going forward.

Then over to the distribution of the average rating. Here the problem does not necessary seem to represent a skewness in the data, but more a less a problem of users giving too high ratings. When simply looking at the rating above an 8, the distribution seems to be rather close to normal.

Fig. 5 Boxplots produced with Pandas in Python

This observation we can confirm with the boxplots on the left. Looking at each of the rating value, and the average of both sources, we can see that the distribution can be considered normal between an 8 and 10 with outliers at any rating below 8.

As a result we will transform the data into a new measure that will have a bottom threshold of 0.8 (meaning that 0.8 equals ≤ 0.8)

As a final step we will apply a MinMax normalization to both measures.

Now we can compare our transformed measures’ distribution with the original measures’ distribution in the jointplot comparison below:

Fig. 6, 7 Combined distribution plots Before and After respectively created with Seaborn in Python

From this we can create a new measure that should represent the best combination of both measures. Multiplying them will give us a value that represent the weighted score over the number of reviews (i.e. the more reviews, the more reliable its rating):

Activity Score:

norm. floored avg. visitor rating *

(norm. log10 review count / avg(norm. log10 review count))

Exploring our Activities

Now that we have created our Scoring Measure, we can start exploring the data.

Let’s start by comparing global regions’ activity rating with the average number of reviews per activity:

Fig. 8 Scatter plot generated with Seaborn in Python

Now this is getting interesting! Some very particular facts here that might just go against your preconceptions.

First of all, at a high level, there is clearly a positive relation between these two variables. Which would naturally make sense, the better the activity is rated, the more people that will visit. Until of course you reach a the point where the number of visitors become a negative variable in the rating:

Nobody goes there anymore. It’s too crowded.

Secondly, looking at the top 3 regions in average number of reviews, these seem in an order many of us could have guessed them. Western-, Southern Europe and Northern America contain many well known touristic hot spots receiving many visitors and therefore also many reviews on average per activity.

The same goes for the top 3 average rated regions: Australia & New Zealand, Central America and Northern America are often praised regions. The most surprising region from my personal opinion is Eastern Europe that takes the fourth place, ahead of any other European region!

Finally, looking at it from a continent perspective: the Americas and Europe are performing at the top of the list, while Asia, Oceania and Africa show much more divided continents. Especially Western and Middle Africa are at the absolute bottom of both measures, while Southern and Northern Africa are not too far behind the rest.

Let’s move down towards a country view and plot the top 10 and bottom 10 countries based on their weighted average rating and see what we can make of it:

Russia claims the top spot! Probably quite unexpected for some of us, but at the same time we see Ukraine up there in the top 5 as well, completely in line with regional overview before, and another 2 Eastern European coutries in the remaining top 15 (Belarus and Slovenia)

I guess all the other countries in the top 15 will be quite obvious for most of us, each one of them are very well known holiday destinations. Probably most surprising is to see is Chili beating all other Southern American countries.

It’s good to see that the top 15 is represented with countries from 5 out of 7 continents. But the bottom 15 is looking a lot less diversified and is almost completely dominated by African countries (mainly west- and middle Africa to be more precise). North Korea is the only country from another continent that is able to blend into this not so prestigious list.

To make these country hot spots a bit more visible on a global scale, we can create a global heatmap, plotting the weighted average activity score on a country by country basis:

Interactive Chart #1. Choropleth Map created with Plotly

This plot confirms largely what we already discovered before. There are clear hot spots including Europe (especially Eastern and South-Western Europe), North-America (especially Canada and Central America), Oceania and South America, although Paraguay and some smaller nations in the north are negative outliers there.

The ‘cold’ spots are most obvious in middle and western Africa. While Southern, Northern and as well some parts in Eastern Africa get rated slightly better. Asia is showing a very mixed picture with some countries belonging to the absolute cold spots (North Korea, Papua new Guinea, Kyrgyzstan) while other countries are rated very high (Thailand, Turkey, Yemen).

So far we have quite a good pictures how regions as well as countries get rated from fellow travelers, now it is time to look at individual activities. After all our goal today is to map out each activity on the map and making it comparable on the average rating it receives as well as the number of reviews it got.

Therefore in the following plot we see the top 15 activities purely based on number of reviews and then the top 15 activities based on our newly created scoring measure:

What an interesting differences! Some key takeaways:

  • None of the activities in the top 15 in number of reviews are probably very surprising to anyone; completely dominated by the famous activities in Europe and North America (actually New York).
  • While the top 15 activities in our newly created weighted score measure reflect a high diversity of continents present (again 5 out of 7) and definitely include some new bucket list entries for me personally.
  • It surprises me that none of the top activities based on number of reviews also made it to weighted score top list. This must mean that their average rating received online is significant less than the top 15 of the weighted score activities.
  • While the top chart is dominated with European activities, the second chart only has two activities from Europe included. So while they receive a lot of reviews (and visitors), their ratings falls a bit behind other (less visited) activities.

The concluding chart

Now it’s time to enable all of us to explore all activities ourselves while being able to clearly distinguish the rating as well as the number of reviews for each of them. We will now plot all activities with a marker fill being the floored average rating while we will make the size of the marker the number of reviews. All detail for each activity will be in the label when you hover over.

Also below you will find the top 100 activities based on our newly created measure for further reference.

Be sure to zoom into your home country (don’t blame the map for having a bit less details) or your next travel destination as you just might find some real great catches!

Interactive Chart #3 scatter plot with hex bins as markers with Plotly

--

--