Putting “Uber Movement” Data into Action — Machine Learning Approach

Published in

Towards Data Science

15 min readSep 25, 2018

“Everything is related to everything else. But near things are more related than distant things.”

First Law of Geography, Waldo R. Tobler

Uber launched its Uber Movement service at the beginning of 2017. It consists of billions of pieces of trip data and provides access to the summary of travel times between different regions of the selected city. It has created huge enthusiasm, yet aroused suspicion at the same time among researchers, mobility experts, and city planners. So, is Uber democratizing data and providing a free tool to access its huge database? Maybe not so much, but it has still made a huge effort to aggregate and visualize that amount of data for different cities around the world.

So what do we have here? We have a nice example of isochrone mapping for travel times based on the selected origin. The travel times are also segmented for the different times of the day. Let’s specify just weekdays and morning peaks.

Most importantly for data scientists and analysts, this data can be downloaded in CSV format. But the catch is that the data that can be downloaded is not segmented for “time of day.” So, you can download all origins to all destinations travel time data for a quarter of the year but the available aggregations are limited to monthly, hourly, and daily for a certain day of the week. We will come to that later.

Mobility Assessment as a Service

Mobility is the catchy term for Smart City projects and location intelligence. Think of it as a service that gives you an estimated travel time in the city that you live based on the origin and destination pair of your travel and time of the day. The time of day can even cover seasonality since you wouldn’t expect the same travel times in the thick summer afternoon heat and the grey winter morning fog. Moreover, the selection of a precise date can yield endless searches: morning peaks just on weekends in May, since you are traveling to Barcelona at that time of the year, or summer middays around 15:20 pm on Fridays, since you have a summer internship in London. Uber Movement data is just the beginning.

The potential of such a service means a lot for companies which try to incorporate location intelligence into their service. Retail and wholesale businesses are trying to evaluate catchment areas in their region, real estate companies want to assess locations with their accessibilities, and logistics and cargo companies want to (and need to) know travel times because they transport goods.

The Challenge: Travel Time Prediction

So, in order to offer such services and assess locations based on the access times to different regions, what do we need to know? Put simply, travel times. We either need a travel time matrix or the ability to create one on the fly. And the machine learning approach is to train your model based on a large enough historical travel time dataset so that it will predict the travel time accurately for a new travel query with a source location, destination location, and date. Basically, we need a huge dataset within a given city and a proper machine learning model.

As a tech company, Uber refers to this question as a billion-dollar question. Travel time, or — in their lingo — ETA (estimated time of arrival), is one of the key performance indicators for their business.

Not time travel! Travel time! (That’s also a billion-dollar question, though.)

Machine learning enthusiasts might already remember this challenge from a couple of Kaggle competitions such as this one on identifying an NYC taxi trip duration and more recently, this one on NYC taxi fare prediction. Both of them were broadly focused on New York City covering mostly Manhattan and Brooklyn.

My question is what is the key challenge for Uber Movement Data that we should build our model on? Why are we motivated to model it rather than just query?

Motivation

First, we need to define our perspective. Our mobility assessment needs to be able to create highly accurate travel time predictions with monthly, daily, and even hourly precision for a city of interest. We are using a machine learning approach, so we need a large dataset.

However, there are several issues with Uber’s dataset:

It does not cover all source and destination pairs for each time interval. There are gaps in it since sometimes there are not enough trips on a given route for them to aggregate and add them to their CSV.
It does not provide data aggregated for a specific date-time range in a downloadable format. This means you have an average travel time from origin region to destination region for all Mondays or for 1 pm averaged for 3 months.
It is aggregated for districts. It does not have the location (longitude/latitude) of trip start and end points.
And of course, it only covers selected cities of the world. It might not cover the one that interests you.

There are also important issues with Kaggle / NYC trip data:

It mainly covers Manhattan which is a much smaller region than a big metropolitan area like London. It is not enough to understand city-wide dynamics.
Manhattan has Manhattan distance! It involves a relatively simple calculation model. Calculating the distance between two locations is not that easy while building the model for the majority of the cities in the world. You need routing software.

Finally, the main issue while creating travel time predictions for a large city is historical data and filtering it smartly. In this domain, data is really valuable, big and hard to reach. You need to downsize it in order to even model it. Think of a specific route and the travel times on that route. If you want to have hourly precision your data is multiplied by 24. For weekly precision, it’s multiplied by 7 and for daily precision for one quarter, it’s multiplied again by 90. It expands exponentially. Then, think of this for the tens of thousands of routes possible in a large city.

The exponential growth of spatiotemporal data

It may quickly occur to you that you’ll need to model this data, rather than storing each of these combinations in a database. And even for modeling, you can downsize your data by selecting specific origin and destination points because there are almost infinite combinations of different routes in a city.

Uber Movement Data and modeling it comes into play at this point:

Using Uber Movement Data can help you to understand selecting your OD (origin-destination) Matrix pairs in a large city so that there are minimum of combinations of routes that define the maximum travel time (thus traffic congestion) variability.

1- Ask yourself: How many origin locations do you need to select?
2- Then ask: How many destinations?
3- Finally, you can: Optimize your selection for different parts of the city.

Uber Movement Data used in this way can help you to understand the real flow and mobility of people in a large city.
For modeling it, when you have the model, you can just search for a specific location pair as a route and disregard the missing data in the dataset. Machine learning will already cover that for you.

And Action!

Enough with the introduction, I’ll summarize the steps I’m about to show you:

1- Download and explore the weekly aggregated dataset for London. (Because it is a large enough dataset, and I like London!)

2- For the data preparation, Integrate and format the data.

3- Choose a model and apply it. Evaluate the accuracy metrics.

4- Visualize the prediction errors on the map. Compare it with the findings in data exploration.

5- Compare some travel time results between google maps and the model.

6- Comment on possible improvements in the model. Include a what we’ve learned section.

Data Understanding

We’ll first go to the Uber Movement website and navigate our way to London. Then we’ll download the CSV file for “Weekly Aggregate.” In this case, we'll choose the latest quarter as of now: 2018 Quarter 1.

We’ll also need the geographical boundaries file to set regional coordinates.

We’ll summarize the data. As you can see, there are close to 3 million records there! From the origin region to the destination region, we can find the mean travel time for each day of the week (dow) coded as 1 to 7.

str(my_london_mv_wa)'data.frame':	2885292 obs. of  7 variables:
 $ sourceid                                : int  705 137 131 702 
 $ dstid                                   : int  167 201 261 197 131 232 157 451 302 137 ...
 $ dow                                     : int  7 7 7 7 7 1 7 7 
 $ mean_travel_time                        : num  1699 1628 3157 
 $ standard_deviation_travel_time          : num  600 541 688 206 
 $ geometric_mean_travel_time              : num  1621 1556 3088 
 $ geometric_standard_deviation_travel_time: num  1.34 1.34 1.23

We’ll need to map “sourceid” and “dstid”s to regions. And we’ll read the geoJSON file. It has the definition of 983 regions in London. For each of them, there is a bounding polygon that defines the region. Polygon means a list of road segments that define a boundary. Each segment has a start and an end point defined by longitude and latitude.

library(jsonlite)library(geosphere)# load objects
my_london_regions<-jsonlite::fromJSON("london_lsoa.json")# check your region list
head(my_london_regions$features$properties)# polygon coordinates for each region
str(my_london_regions$features$geometry$coordinates)

Now let’s do the trick and then explain what happened here.

my_london_polygons=my_london_regions$features$geometry$coordinates
my_temp_poly<-my_london_polygons[[1]]
poly_len<-length(my_temp_poly)/2
poly_df<- data.frame(lng=my_temp_poly[1,1,1:poly_len,1], lat=my_temp_poly[1,1,1:poly_len,2])
my_poly_matrix<- data.matrix(poly_df)
temp_centroid<-centroid(my_poly_matrix)

We save the polygon coordinates into an object. Start the procedure for region 1 as a demo. Since our shape is a polygon, we can define that polygon by its centroid. This is because we need a single location coordinate for each region. Our centroid function from “geosphere” package can calculate it. We give the input in the required format. At last, we have the centroid for that region. Let’s visualize and see what we did. We are going to use Leaflet package.

leaflet(temp_centroid) %>% 
 addTiles() %>% 
 addMarkers() %>% 
 addPolygons(lng= poly_df$lng, lat=poly_df$lat)

Geometrical center of the polygon is centroid.

Now, this is what was expected. There is a bounding polygon for region 1 and we’ve already calculated the centroid of it. Now region 1 is defined by this location center: centroid latitude and longitude. It is easy to repeat this procedure for each region and prepare a final list:

head(my_london_centroids_450,3)
   id         lng      lat
1 212 -0.16909585 51.49332
2  82 -0.04042633 51.39922
3 884 -0.02818667 51.45673
...

Data Preparation

Our final data set needs to have a source location, destination location, date, and distance. We’ve already prepared centroid coordinates of regions in the previous section to see our regions on the map. All that is left is to choose a subset of regions and then calculate the distance between each origin and destination pair. Let’s do it.

a- Subsetting

Why don’t we just use all 983 regions? If you’ll recall the quote at the beginning of the article, near things are more related. Nearby regions have similar travel times. And if we subset regions, our final dataset will have a smaller size and our modeling time will drop. We also subset these regions because calculating distance is costly and subsetting will result in a lower number of route combinations to calculate.

Now, our dataset has 983 different regions and on average, they have around 450 destinations. After a couple of subsetting and modeling trials (to evaluate the accuracy), we’ll randomly select 450 origin regions and select 150 random destinations for each of them. Let’s see what that looks like:

Randomly selected 450 origin region centroids inside London city boundaries.

Here are 150 trips (destinations) from region 1. The circle has a 15km radius.

So, most of the trips are 15 km in radius and some trips to Heathrow Airport are included as well. We can easily say that — by checking other regions as well — our model will be good enough to predict the travel time of trips (1) around 15 km in distance and (2) to airports.

We are also interested in the density distribution of our 450 origin regions. R has powerful geospatial packages to help us with this.

library(spatstat)
## create a point pattern object
my_london_centroids_450_pp<-ppp(my_london_centroids_450[,2],my_london_centroids_450[,3],c(-0.4762373 , 0.2932338) ,c(51.32092 ,51.67806))#visualize densityplot(density(my_london_centroids_450_pp))

The density of our origin locations (regions).

The rasterized density image on top of London map

So, the density of our origin locations is higher in the center and decreases on the outskirts. Again, we can make a wild guess before modeling and can say that the prediction error will be less in the center since there are many more origin locations (regions).

Again using the powerful “spatstat” and “geosphere” packages, we can analyze details about distances to destinations further. They can easily provide us kth of nearby neighbors using the point pattern object, “nnwhich” and “geoDist” functions:

# closest first 5 neighbor distance to destination idshead(my_london_centroids_450_nd3[,c(1,5,6,7,11,12)])   id       gd1       gd2       gd3      gd4      gd5
1 212  573.0708  807.4307  710.0935 1694.490 1325.124
2  82 1086.2706 1370.0332 1389.9356 3018.098 2943.296
3 884  641.8115  767.1245 1204.1413 2428.555 2320.905
...

b- Distance Calculation

Let’s consider one last thing before modeling: We need to calculate the distances between our origin and destination pairs. We cannot rely on Manhattan distance or as the crow-fly distance. It needs to be the real distance that one takes with a car, so we need a routing software that can calculate the distance between two points based on a specific route in the city.

For this, we have a couple of options. Paid options like Google Maps API can be costly (hundreds of dollars) since we will have around 67500 routes (450 origin*150 destinations). The most comprehensive free software for routing is OSRM (Open Source Routing Machine) which is used by OpenStreetMap.

library(osrm)#calculate distance
my_route_d<-osrmRoute(src=my_london_centroids_450[my_r1,] , dst=my_london_centroids_450[my_r2,], overview = FALSE)# route segments if needed to draw a polyline
my_route<-osrmRoute(src=my_london_centroids_450[my_r1,] , dst=my_london_centroids_450[my_r2,])

Calculations from OSRM results are drawn with Leaflet

The OSRM package uses the demo OSRM server by default, and it is restricted to reasonable and responsible usage. We do not want to create bottlenecks in the demo server by sending tens of thousands of requests. We need our own routing server!

There is a neat tutorial here that describes how to set your own OSRM server on an Ubuntu machine. The good news is that you don’t need to be a Unix guru to set it up. Rookie-level familiarity is enough. Personally, I used Amazon EC2 Instance of type m4.xlarge with Ubuntu Xenial 16.04.

Once we are done, we can set the OSRM server options to our new server IP:

options(osrm.server = "http://xxx.xx.xx.xx:5000/")

Now we are ready to calculate distances for each route combination of our origin and destination pairs.

Modeling

Finally, we are ready for the fun part. Let’s look at our data set after the preparation:

head(my_london_sample_wa450150)lng_o    lat_o      lng_d    lat_d     dow distance   travel_time
-0.374081 51.5598 -0.4705937 51.54076   5    10.92        1048.46
-0.399517 51.4904 -0.4705937 51.54076   3    27.86         947.94
-0.368098 51.5900 -0.4705937 51.54076   4    14.30         1550.46
...

We have the origin/destination coordinates, the day of the week, distance and travel time in seconds. This dataset has 421,727 rows.

In travel time prediction there are a couple of favored algorithms. Again in the same presentation, Uber lists them as:

The four preferred algorithms by Uber in Travel Time Prediction

We are going to try Random Forest. It’s an out-of-the-box algorithm which requires minimum feature engineering. After creating a training set for the 70% (around 290K rows) of the data set:

modFitrf<-randomForest(travel_time ~ dow+lng_o+lat_o+lng_d+lat_d+distance,data=training_shuf[,c(3:9)],ntree=100)# resultrandomForest(formula = travel_time ~ dow + lng_o + lat_o +      lng_d + lat_d + distance, data = training_shuf[, c(3:9)],      ntree = 100) 
               Type of random forest: regression
                     Number of trees: 100
No. of variables tried at each split: 2Mean of squared residuals: 18061.1
                    % Var explained: 96.84

This one takes around 2 hours in Amazon EC2 Instance of type m4.2xlarge.

Training error is 3.2% and test error is 5.4%. We also have a holdout dataset which refers to the regions that were not included in our model while subsetting them. For the holdout dataset, the error rate is 10.8% with 100 randomly selected regions from the remaining ones.

Model Evaluation

Let’s visualize our prediction errors spatially on our London map.

Our intuition has turned out to be correct. We have fewer errors for the regions in the center, and Heathrow Airport also has a black dot on it, which means it has a lower error rate.

Numerically, we can calculate correlation as well. The correlation between the distance to center and prediction error is a fair one.

cor(my_london_centroids_450_hm$distc,  my_london_centroids_450_hm$testprc)[1] 0.5463132

We can again use our spatial package “spatstat” to visualize our error rates. This time, we are going to smooth our error rates on the 2-dimensional space with spatial interpolation.

library(spatstat)# assign corresponding prediction errors to our coordinates in 2-d
marks(my_london_centroids_450_pp) <-my_london_centroids_450_hm$testprc## apply inverse distance weighting / spatial interpolation
plot(idw(my_london_centroids_450_pp))

Our interpolation graphic is worth the effort. Again we can observe the regions with a low error rate

We can now try our London travel time predictor. Let’s do a random comparison. We’ll choose random two points in Google Maps:

Google Maps calculates 21 minutes travel time on Monday evening

Then we’ll calculate the distance between the very same points with OSRM and pass the required parameters to our model to predict the travel time.

library(osrm)lon_o<-0.054089 ; lat_o<-51.591831
lon_d<-0.114256 ; lat_d<-51.553765# calculate distance
my_distance<-osrmRoute(src = data.frame(id=1,lon= lon_o , lat=lat_o) ,
                       dst= data.frame(id=2,lon=lon_d  , lat=lat_d), overview = FALSE)[2]# get route
my_route<-osrmRoute(src = data.frame(id=1,lon= lon_o , lat=lat_o) ,
                       dst= data.frame(id=2,lon=lon_d  , lat=lat_d))# calculate travel time with our model for monday
travel_time<-predict(modFitrf, data.frame(dow=1, lng_o= lon_o,lat_o=lat_o,
                                     lng_d=lon_d ,lat_d=lat_d,distance=my_distance) )

Let’s visualize the results:

Prediction from our model visualized on Leaflet

Even in a region that was not close to the center, our model made a fair enough prediction missing the Google Maps prediction by just a few minutes.

We can find our expected test error rate on the origin location by using our interpolated test error rates.

my_image[list(x= 0.054089,y=51.591831)]
[1] 0.06567522

So based on the distribution of test error rates in 2-dimensional space we expect around 6% error for the travel times in that region.

Naturally, we face a higher error rate since the Uber Movement dataset that we’ve used does not have hourly precision. We just made an overall prediction for Monday and could not capture peak or off-peak times. Also, beware that we used first quarter data which means we’ve mostly made predictions for winter, but this comparison was made in September 2018. We did not really capture the seasonal variation.

Final Notes

Based on the Uber Movement Data for London covering the first quarter of 2018 we made a travel time predictor with machine learning using the Random Forest algorithm.

Possible improvements:

Hourly aggregated data can be analyzed. With that data, we can then combine our model with the model from the hourly aggregated data to have more precise results capturing daily variation in traffic congestion.
Now that we have the first results, subsetting can be done more strategically. It is obvious that we need more regions on the outskirts to further reduce error rates.
We can create separate models for the center and the outskirts. Models can be tuned separately for different models. Like more tree depth for the outskirts in Random Forest.
We can try other algorithms (KNN, XGboost or neural networks) and combine them into ensembles.
The same modeling can be done for the other quarters of the year to capture seasonality.

What we can learn from this:

Spatial analysis is required since we have spatiotemporal data.
Interpolation is a powerful transformation tool to explore and use such data.
Selection of origin and destination regions is kind of an optimization problem. We are trying to capture the most variability in travel time prediction while holding the origin and destination numbers at the minimum.

I would like to hear your comments and suggestions!

Please feel free to reach out to me on LinkedIn and Github.