Traffic in Phnom Penh, Cambodia

AI For SEA Traffic Management: Feature Engineering (Part 1/2)

Create relevant model features and handle data gaps

Kilian Tep
Towards Data Science
6 min readJun 17, 2019

--

Also read:

AI for SEA Traffic Management: Modeling (Part 2/2)
AI For SEA Traffic Management: Window LSTM for Multi-Step Forecasting (Epilogue)

I’ve decided to participate in Grab’s AI for SEA challenge recently. Grab’s presented 3 different challenges: Traffic Management (the one I’ve chosen), Computer Vision, and Safety.

The process was quite fun and required quite a bit of work so I decided to write these posts in order to detail how I’ve tackled these problems. Hopefully, this will be helpful to the assessors and to the people who have also participated in the challenge. In this article, I will introduce the challenge and share how I’ve transformed the original training set provided by Grab (feature engineering, filling data gaps, etc).

All my code can be found on GitHub.

Traffic Management Challenge

Problem statement from the website:

Economies in Southeast Asia are turning to AI to solve traffic congestion, which hinders mobility and economic growth. The first step in the push towards alleviating traffic congestion is to understand travel demand and travel patterns within the city.

Can we accurately forecast travel demand based on historical Grab bookings to predict areas and times with high travel demand?

In this challenge, participants are to build a model trained on a historical demand dataset, that can forecast demand on a Hold-out test dataset. The model should be able to accurately forecast ahead by T+1 to T+5 time intervals (where each interval is 15-min) given all data up to time T.

  1. Understanding the dataset

The original dataset provided by Grab looks like this:

First few rows of original dataset
Description of the meaning of each column. More info can be found here.

Each demand record corresponds to 15 minutes interval. The dataset comprises 61 days in total.

The tricky part to understand is that each geohash6 has a unique set of days and timestamps. This information is crucial since we are dealing with a time series problem. This will make the sequencing of the dataset much easier.

In total, there are around 1,300 unique geohash6 codes. If you do a quick aggregation, you can see that the codes should have a number close to 61 days * 24 hours * 4 quarters = 5856.

The first step I’ve done was to use pandas to create a column ‘timestamp_hour’, which basically creates a new timestamp converting the day and hour to an hourly format. This will help us analyze the data in a better way. The following code will help you understand:

Then I aggregate the data in order to confirm my aforementioned statement:

Count of number of timestamps per geohash6 code

We are pretty close to having a complete records for these geohash6 codes, as we’re only missing 10 timestamps to get a complete time series for the first four geohash6 codes. Unfortunately, it is not the case for all geohash6 codes. Some are incomplete (i.e 3000 timestamps) and others just have 1 record:

agg_geohash.tail()
The tail of geohash6. Some codes only have one instance. Because we’re dealing with time series, these geohash6 will have to be dropped.

2. Getting an intuition of the time series pattern

I wanted to get an intuition of the behavior of demand over time for one particular geohash. It turns out that the sample I have picked follows a very typical time series behavior. I have tried with different geohashes (complete ones) and they all display a somewhat stationary behavior:

Scatter plot of demand over time for selected geohash6 code ‘qp03wz’ over 15 days

We can clearly see a stationary pattern for this geohash6. High peaks are around the same hours of the days. Weekends are pretty low. We should be able to model accordingly with time-series features using previous values of demand. Now let us move on to the feature engineering so that we get a training set we can actually build a model upon.

3. How to create time lags while handling the incomplete geohash6 codes?

As mentioned above, the dataset is incomplete. For some particular timestamps, the provided dataset has no recorded value. Lucky for us, the Grab team noticed and told us in the FAQ to just assume no demand (a value of 0) for missing timestamps. This assumption will be particularly useful.

Given the problem, I will need to create Demand at T+1 until Demand T+5. As for my model features, I have decided to use Demand T-1 down to Demand T-5. I also decided to include the latitude, longitude, and corresponding timestamps (in a decimal hourly format). Since I’ve decided to use a LSTM (see next post), I had to normalize these features using Min-Max Scaling in order to avoid the problem of exploding gradient.

The preprocessing code can be found here. The functions I called in the preprocessing code can be found on this link.

I will not go into details on how I created these lags as the code is a bit dense. However, I’d like to spend some time explaining how I’ve handled the missing timestamps in order to fill the incomplete demand. Basically, once we order the dataset by geohash and timestamp, the time difference between a current timestamp and its predecessor should be of the # of lags * 15 minutes. Similarly, the time difference between a current timestamp and its successors should be the # of steps * 15 minutes. The code consists in replacing the demand for the previous/next timestamps that do not satisfy these conditions.

See code below:

I basically process each geohash6 code one by one in order to use the function pd.Series.shift() to get my time lag/step. However, since the dataset is not complete, I have no guarantee that shifting always works. I need to conduct checks by looking at previous/next timestamps. The way I do is the following (pseudo code)

time_delta = timestamp_lag - timestamp_hourif time_delta != 0.25 * lag:
return 0
else:
return demand

If the code looks a bit complicated to you, let us look at the following example which will help you understand in a very simple way.

The d_t_minus_1 and d_t_plus_1 correspond to the observed demand at T-1 and T+1 respectively. ts values correspond to shifted timestamps. The time difference (tdelta) is correct here for lag 1 and step 1 (1 * 0.25 = 1). Thus we do not replace the value by 0
Here, the tdeltas are incorrect for lag 2 and step 2. They should be equal to 2 * 0.25 = 0.5, thus we replace the values with 0.0

Once the feature engineering is done, we obtain the following demand features, which I have shortened with the letter “d”.

*note that d_t and demand are the same values. I just created another column for consistency.

Other features I will use for modeling is the normalized decimal hourly timestamp, from 0.0 (midnight) until 23.75 (11:45 PM):

ts_d is short for timestamp_decimal. The scaling was obtained through Min-Max scaling

I will also use the the normalized latitude and longitude:

Latitude/Longitude values with the Min-Max scaled values

Hope you’ve enjoyed! If you’re interested in seeing how I modeled the problem, please take a look at the second part of the challenge here:

AI for SEA Traffic Management: Modeling (Part 2/2)

Kilian

--

--

Hungry Data Scientist based in Singapore. Trying to be less of a noob everyday.