Photo by Viktor Laszlo taken from Pixabay

Big Data; Uber Forecast In New York

Converting 4.5 million records into insights and predictions

Aviv Yaniv
Towards Data Science
6 min readJul 14, 2020

--

In this story, I describe the prediction method I've developed during the Big-Data course I took at Tel-Aviv University, as an Economics student.

I’ve received a dataset of 4.5 million Uber rides orders in New-York City, which connects drivers and those who are interested in a ride.
The dataset contained observations from months April-July 2014 and included locations and time of order.
Based on the dataset, I’ve been asked to predict the number of orders in a future month, September 2014, on each 15-minutes interval.

First, I’ll describe the research on the data to find patterns and connections, then based on the research insights additional data will be cross-checked.
Finally, the model that predicts future trends for rides will be elaborated.

Data Research

First, I’ve cleaned the data, based on project specifications; narrowed the rides in a 1-kilometer radius from New York Stock Exchange, and from 17 PM till midnight, and made sure there are no missing data cells.
Then, orders have been divided into 15-minutes time slots windows.
I’ve created the following graph that describes the number of times, in a 15-minutes time slot windows, there have been a given number of orders.
(i.e. The number of times there have been 20 orders [x-axis], in the 15-minutes time slots windows is on [y-axis]).

Number of rides distribution

As you witness, this distribution peak is close to it’s mean and afterward drops drastically.
Distribution tail’s composed mostly of anomaly time slots windows, in which there has been a soaring demand for rides.
This pattern remains consistent among the other months as well.
This phenomenon, in which there has been an extraordinary amount of orders triggered my curiosity — so I’ve organized in the following table which describes major events and weather conditions.
The table is ordered in descending number of rides and proposed to discover if there is a correlation between major events and rainy weather and the number of pick up numbers.
As you witness, in 52% of the time slot windows with above 40 rides and in 48% of the time slot windows with between 30–40 rides there indeed has been cold weather of major event — that may describe the soaring demand for rides.
Although this can explain some of the anomalies in time slot windows, these reasons (such as future weather or major events) cannot be predicted, and special attention has been taken to handle them when devising the model.

Data research has been continued in creating a “heat-map” that describes the number of orders in every round hour in each day of the week.

Heatmap of rides number for each hour in each day

As you witness, there is a rise in demand from 17 PM — 19 PM during workdays (probably explained by commuting from work), as well as a drop in demand during late-night hours of workdays.
This pattern remains consistent among the other months as well.

Besides, the trend of the total amount of orders in different months demonstrates Uber's growth in months April-July with an incline during May.

Uber’s growth in the number of orders during April-July, 2014

In addition, I’ve researched demand-areas and their patterns.
The following heatmap describes the areas with the highest orders counts with warmer colors, during different days of the week.

Dynamic heatmap for each day; warmer colors are areas with a higher number of orders

As you witness, demand changes during different days of the week and most significantly between workdays and the weekend.

However, the “warmer” areas remain stationary between different months and demonstrate lower entropy in comparison to the daily heatmaps.

Dynamic heatmap for different months; warmer colors are areas with a higher number of orders

These warm areas are correlated with attraction points and interest points mentioned in Manhattan.
Another interesting observation is, that the warm areas are not close to train-stations, who pass during those hours frequently.
It is reasonable to believe, that trains are substitute goods for Uber rides in some cases.

Lastly, a correlation matrix has been created.

Building the Model

The developed model is a clustering model.
Each order has been assigned to a cluster, and cluster centers were in the centers of the warmest areas.
Such division is meant to learn patterns on each area independently, as different areas get warm on different days and hours — yet the centers of the clusters AKA ‘centroids’ are almost stationary.
To choose the right amount of clusters, I’ve created a Total Within Cluster Sum of Squares graph.
This graph is used to determine a reasonable amount of clusters (denoted by K), using the “elbow method” heuristic; the cutoff point where diminishing returns are no longer worth the additional cost (stop adding new clusters, or raising K, when the amount of explained data is inconsiderable).

Total Within Cluster Sum of Squares graph

The chosen amount of clusters (configurable in code) is K=8, whose centroids matched the warmer areas mentioned above.

To build the model, after dividing orders into clusters, a designated table of tables has been built using the dplyr library developed by Hadley Wickham for the clusters.
Each row in the main table matches data for a cluster and a linear regression has been applied to it.
In this way, the model is trained to learn patterns for each cluster independently, and regression would yield different coefficients for each cluster based on the unique characteristics of each cluster.
The desired prediction, for the future 15-minutes time slot windows, is the total sum of prediction of all clusters.

The linear regression model is:

pick_num ~ minute + hour + day_in_week + hour*day_in_week

The interaction between the hour and day in a week has been added to grasp the effects of different combinations of them.

In addition, as described above, in every month there are anomaly time slot windows with an extraordinary amount of orders;
to tackle this issue — and cancel the mal side effects of weather or unpredictable events — a threshold on orders amount has been introduced of threshold=9 for each cluster, which is configurable by code.
Cutting by the threshold in a cluster-based manner is beneficial to disable anomaly in one cluster without side effects on the other and more agile in comparison to setting a global threshold.

Summary and other models

To sum up, I’ve started with data research and recognizing patterns in different days and hours, and continued in researching patterns in different months and warm areas.

I’ve found out that between different months, the warm areas remain almost stationary.
However, on different days of the week and especially when comparing workdays and weekends the warm areas shifted.
Then, an anomaly in time slot windows has been researched, in which orders amount soared.
Correlation between cold-weather, major events, and those anomalies has been proved.
Armed with those insights, I’ve developed a cluster model that matched the warm areas and learned those patterns for each of the clustered independently.
In addition to the developed model, simple linear models (although it is clear they cannot grasp the whole picture), as well as random forest models have been tested (and combinations with the cluster model) — yet none of those exceeded the achieved by the model described above.
Finally, a model that divides the city into interest areas, and learning for each of them unique coefficients, and minimizes the bad effect of anomalies has been presented.

Enjoyed this article? Feel free to long-press the 👏 button below 😀

--

--

Senior Software Development Engineer 🖥️ Economist 📈 Beer Brewer 🍻 Photographer 📷 ~ “Curiosity is my drug”