AI For SEA Traffic Management: Feature Engineering (Part 1/2)
Create relevant model features and handle data gaps
Also read:
AI for SEA Traffic Management: Modeling (Part 2/2)
AI For SEA Traffic Management: Window LSTM for Multi-Step Forecasting (Epilogue)
I’ve decided to participate in Grab’s AI for SEA challenge recently. Grab’s presented 3 different challenges: Traffic Management (the one I’ve chosen), Computer Vision, and Safety.
The process was quite fun and required quite a bit of work so I decided to write these posts in order to detail how I’ve tackled these problems. Hopefully, this will be helpful to the assessors and to the people who have also participated in the challenge. In this article, I will introduce the challenge and share how I’ve transformed the original training set provided by Grab (feature engineering, filling data gaps, etc).
All my code can be found on GitHub.
Problem statement from the website:
“Economies in Southeast Asia are turning to AI to solve traffic congestion, which hinders mobility and economic growth. The first step in the push towards alleviating traffic congestion is to understand travel demand and travel patterns within the city.
Can we accurately forecast travel demand based on historical Grab bookings to predict areas and times with high travel demand?
In this challenge, participants are to build a model trained on a historical demand dataset, that can forecast demand on a Hold-out test dataset. The model should be able to accurately forecast ahead by T+1 to T+5 time intervals (where each interval is 15-min) given all data up to time T.”
- Understanding the dataset
The original dataset provided by Grab looks like this:
Each demand record corresponds to 15 minutes interval. The dataset comprises 61 days in total.
The tricky part to understand is that each geohash6 has a unique set of days and timestamps. This information is crucial since we are dealing with a time series problem. This will make the sequencing of the dataset much easier.
In total, there are around 1,300 unique geohash6 codes. If you do a quick aggregation, you can see that the codes should have a number close to 61 days * 24 hours * 4 quarters = 5856.
The first step I’ve done was to use pandas to create a column ‘timestamp_hour’, which basically creates a new timestamp converting the day and hour to an hourly format. This will help us analyze the data in a better way. The following code will help you understand:
Then I aggregate the data in order to confirm my aforementioned statement:
We are pretty close to having a complete records for these geohash6 codes, as we’re only missing 10 timestamps to get a complete time series for the first four geohash6 codes. Unfortunately, it is not the case for all geohash6 codes. Some are incomplete (i.e 3000 timestamps) and others just have 1 record:
agg_geohash.tail()
2. Getting an intuition of the time series pattern
I wanted to get an intuition of the behavior of demand over time for one particular geohash. It turns out that the sample I have picked follows a very typical time series behavior. I have tried with different geohashes (complete ones) and they all display a somewhat stationary behavior:
We can clearly see a stationary pattern for this geohash6. High peaks are around the same hours of the days. Weekends are pretty low. We should be able to model accordingly with time-series features using previous values of demand. Now let us move on to the feature engineering so that we get a training set we can actually build a model upon.
3. How to create time lags while handling the incomplete geohash6 codes?
As mentioned above, the dataset is incomplete. For some particular timestamps, the provided dataset has no recorded value. Lucky for us, the Grab team noticed and told us in the FAQ to just assume no demand (a value of 0) for missing timestamps. This assumption will be particularly useful.
Given the problem, I will need to create Demand at T+1 until Demand T+5. As for my model features, I have decided to use Demand T-1 down to Demand T-5. I also decided to include the latitude, longitude, and corresponding timestamps (in a decimal hourly format). Since I’ve decided to use a LSTM (see next post), I had to normalize these features using Min-Max Scaling in order to avoid the problem of exploding gradient.
The preprocessing code can be found here. The functions I called in the preprocessing code can be found on this link.
I will not go into details on how I created these lags as the code is a bit dense. However, I’d like to spend some time explaining how I’ve handled the missing timestamps in order to fill the incomplete demand. Basically, once we order the dataset by geohash and timestamp, the time difference between a current timestamp and its predecessor should be of the # of lags * 15 minutes. Similarly, the time difference between a current timestamp and its successors should be the # of steps * 15 minutes. The code consists in replacing the demand for the previous/next timestamps that do not satisfy these conditions.
See code below:
I basically process each geohash6 code one by one in order to use the function pd.Series.shift() to get my time lag/step. However, since the dataset is not complete, I have no guarantee that shifting always works. I need to conduct checks by looking at previous/next timestamps. The way I do is the following (pseudo code)
time_delta = timestamp_lag - timestamp_hourif time_delta != 0.25 * lag:
return 0
else:
return demand
If the code looks a bit complicated to you, let us look at the following example which will help you understand in a very simple way.
Once the feature engineering is done, we obtain the following demand features, which I have shortened with the letter “d”.
Other features I will use for modeling is the normalized decimal hourly timestamp, from 0.0 (midnight) until 23.75 (11:45 PM):
I will also use the the normalized latitude and longitude:
Hope you’ve enjoyed! If you’re interested in seeing how I modeled the problem, please take a look at the second part of the challenge here:
AI for SEA Traffic Management: Modeling (Part 2/2)
Kilian