The world’s leading publication for data science, AI, and ML professionals.

Time Series for Climate Change: Origin-Destination Demand Forecasting

Mining floating car data to tackle climate change

This is Part 8 of the series Time Series for Climate Change. List of articles:


Floating Car Data for Mobility Modelling

Mining floating car data is a key task in intelligent transportation systems. Floating car data refers to data collected by vehicles equipped with GPS devices. These provide information about the location and speed of vehicles.

Understanding mobility patterns within cities is an important task in transportation. For example, it helps reduce congestion and overall transportation activity. Less time in traffic means fewer greenhouse gases are emitted. So, accurate models have a positive impact on climate change.

The widespread of GPS devices produced many data sets related to mobility. But, learning from GPS data is a challenging problem. Spatial dependencies are tricky but fundamental to capture. There are also temporal dependencies, for instance, rush hours. Mobility patterns also differ on whether it’s a weekday or not.

Origin-destination flow count estimation

Floating car data offers many possibilities for mobility modeling. One of these possibilities is the origin-destination (OD) flow count problem.

OD flow count refers to the estimation of how many vehicles traverse a given sub-region to another in a given period. This task is relevant for several reasons. Taxi companies can allocate their fleet dynamically according to the expected demand in a particular zone.


Hands-on: Forecasting OD Demand in San Francisco

In the rest of this article, we’ll forecast taxi passenger demand in San Francisco, USA. We’ll tackle this problem as an OD flow count task.

The full code used in this tutorial is available on Github:

Data set

We will use a data set collected by a taxi fleet in San Francisco, California, USA. The data set contains GPS data from 536 taxis over a period of 21 days. In total, there are 121 million GPS traces split across 464045 trips. You can check reference [1] for more details.

At each time step and for each taxi, we have information about its coordinates and whether a passenger occupies it.

Defining the problem

Our goal is to model where people are moving to given their origin. OD flow count estimation can be split into four sub-tasks:

  1. Spatial grid decomposition
  2. Selection of origin-destination pairs
  3. Temporal discretization
  4. Modeling and forecasting

Let’s dive into each problem in turn.

Spatial grid decomposition

Spatial decomposition is a common preprocessing step for OD flow count estimation. The idea is to split the map into grid cells, which represent a small part of the city. Then, we can count how many people traverse each possible pair of grid cells.

In this case study, we split the city map into 10000 grid cells as follows:

import pandas as pd

from src.spatial import SpatialGridDecomposition, prune_coordinates

# reading the data set
trips_df = pd.read_csv('trips.csv', parse_dates=['time'])

# removing outliers from coordinates
trips_df = prune_coordinates(trips_df=trips_df, lhs_thr=0.01, rhs_thr=0.99)

# grid decomposition with 10000 cells
grid = SpatialGridDecomposition(n_cells=10000)
# setting bounding box
grid.set_bounding_box(lat=trips_df.latitude, lon=trips_df.longitude)
# grid decomposition
grid.grid_decomposition()

In the code above, we remove outlying locations. These can occur due to GPS malfunctions.

Getting the most popular trips

After the spatial decomposition process, we get the origin and destination of each taxi trip when they’re occupied by a passenger.

from src.spatial import ODFlowCounts

# getting origin and destination coordinates for each trip
df_group = trips_df.groupby(['cab', 'cab_trip_id'])
trip_points = df_group.apply(lambda x: ODFlowCounts.get_od_coordinates(x))
trip_points.reset_index(drop=True, inplace=True)

The idea is to reconstruct the data set to contain the following information: origin, destination, and origin timestamp of each passenger trip. This data forms the basis for our origin-destination (OD) flow count model.

This data allows us to count how many trips go from cell A to cell B:

# getting the origin and destination cell centroid
od_pairs = trip_points.apply(lambda x: ODFlowCounts.get_od_centroids(x, grid.centroid_df), axis=1)

For simplicity, we get the top 50 OD grid cell pairs with the most trips. Taking this subset is optional. Yet, OD pairs with only a few trips will show a sparse demand over time, which is difficult to model. Besides, trips with low demand may not be useful from a fleet management point of view.

flow_count = od_pairs.value_counts().reset_index()
flow_count = flow_count.rename({0: 'count'}, axis=1)

top_od_pairs = flow_count.head(50)

Temporal discretization

After finding the top OD pairs in terms of demand, we discretize these over time. This is done by counting how many trips occur in each hour for each given top pair. This can be done as follows:

# preparing data
trip_points = pd.concat([trip_points, od_pairs], axis=1)
trip_points = trip_points.sort_values('time_start')
trip_points.reset_index(drop=True, inplace=True)

# getting origin-destination cells for each trip, and origin start time
trip_starts = []
for i, pair in top_od_pairs.iterrows():

    origin_match = trip_points['origin'] == pair['origin']
    dest_match = trip_points['destination'] == pair['destination']

    od_trip_df = trip_points.loc[origin_match & dest_match, :]
    od_trip_df.loc[:, 'pair'] = i

    trip_starts.append(od_trip_df[['time_start', 'time_end', 'pair']])

trip_starts_df = pd.concat(trip_starts, axis=0).reset_index(drop=True)

# more data processing
od_count_series = {}
for pair, data in trip_starts_df.groupby('pair'):

    new_index = pd.date_range(
        start=data.time_start.values[0],
        end=data.time_end.values[-1],
        freq='H',
        tz='UTC'
    )

    od_trip_counts = pd.Series(0, index=new_index)
    for _, r in data.iterrows():
        dt = r['time_start'] - new_index
        dt_secs = dt.total_seconds()

        valid_idx = np.where(dt_secs >= 0)[0]
        idx = valid_idx[dt_secs[valid_idx].argmin()]

        od_trip_counts[new_index[idx]] += 1

    od_count_series[pair] = od_trip_counts.resample('H').mean()

od_df = pd.DataFrame(od_count_series)

This leads to a set of time series, one for each top OD pair. Here’s the time series plot for four example pairs:

The time series show a daily seasonality, which is mostly driven by rush hours.

Forecasting

The set of time series that results from the temporal discretization can be used for forecasting. We can build a model to forecast how many passengers want to make the trip relative to a given OD pair.

Here’s how this can be done for an example OD pair:

from pmdarima.arima import auto_arima

# getting the first OD pair as example
series = od_df[0].dropna()

# fitting an ARIMA model
model = auto_arima(y=series, m=24)

Above, we built a forecasting model based on ARIMA. The model forecasts passenger demand in the next hour given the recent past demand. We use an ARIMA method for simplicity, but other approaches such as deep learning can be used.


Going further with Graph Neural Networks

The abovementioned approach is a simple but effective way to solve OD flow count problems. But, it considers each OD pair as a separate time series.

In reality, each pair is correlated with the neighboring OD pairs or surrounding roads. Because of this, graph neural networks have been increasingly used to forecast traffic conditions. The road network is modeled as a graph, and neural networks can capture complex interactions therein. You can check this Keras example to learn how to implement this kind of method.


Key Takeaways

  • Mobility modeling is an important task in intelligent transportation systems;
  • OD flow count models can help reduce traffic within cities, thereby decreasing the emission of greenhouse gases;
  • You can tackle OD flow count problems with an approach based on spatial decomposition and temporal discretization. This results in a set of time series for each OD pair, which can be used for forecasting.

Thank you for reading, and see you in the next story!

References

[1] Dataset of mobility traces of taxi cabs in San Francisco, USA. (License CC BY 4.0)

[2] Moreira-Matias, Luís, et al. "Time-evolving OD matrix estimation using high-speed GPS data streams." Expert systems with Applications 44 (2016): 275–288.


Related Articles