The world’s leading publication for data science, AI, and ML professionals.

Generate Synthetic Mobility Data

Naive solution proposal for synthetic mobility data generation

Mobility Data is the geographic locations of a device, passively produced through normal activity. It has important applications ranging from transportation planning to migration forecasting. As mobility data is rare and hard to collect, researchers have begun exploring solutions for synthetically generating it.

In this article, I will discuss a naive solution for generating synthetic mobility data. This synthetic data can be used for research purposes and for training / fine-tuning algorithms. For example, one can synthetically generate tagged mobility data, and train a model to forecast urban traffic congestion. Then, the trained model can be applied to real-life data.

The code can be found here and you can use this colab notebook to try it yourself.

The Data

The data to be synthetically generated will represent location data records that were collected from cell phone devices. Normally, such data contain the following attributes:

  • phone_id – unique identifier of the cell phone
  • phone_type – cell phone operating system (iOS / Android)
  • timestamp (In epoch time)
  • latitude
  • longitude
  • accuracy (in meters)

Methodology

Part A – Get Public Location Data

Pick a location in the USA and create a bbox (bounding box) of x meters. Next, get public data sets:

Create a bounding box

Get ArcGIS Residence Locations

Use the arcgis_rest_url to get buildings’ polygons within your bbox. *Limited to a sample of 2000 polygons.

Get Kaggle POIs data sets

Use Kaggle API to download POIs data sets. Then parse it, load it to geopandas and filter the data set to points within bbox only.

Get OSM roads from Overpass API

Part B – Generate Synthetic Timeline

Now we have all we need to create a phone timeline – Residence locations (will be used for stays at home, family, and friends’ homes), POIs locations (will be used for stores’ visits), and roads (will be used for drives between stays). Before generating the actual mobility data we will generate a synthetic timeline that holds the phone stays and their timeframes.

Synthetic Timeline Logic

The synthetic timeline logic will iterate over all days between the start date and the end date and randomize stays in the workplace, residence locations, and POIs. To promise normal human behavior, the logic will produce work stays on weekdays only and will ensure the user is getting back home for nighttime.

Before running the logic, make sure to:

  • Set random home & work locations
  • Set a timeframe (start date & end date)
  • Set max POIs and max residence locations to be visited on a given day

    The below gif shows the first day in our synthetic timeline

Part C—Generate Synthetic Mobility Data (Signals)

Our synthetic timeline is ready, and a new logic is needed to translate it to synthetic signals. The first event in our timeline is a homestay (00:00 -> 08:00), so let’s start with generating signals for this stay.

Static Mode Signals

The following script will produce a data frame of signals between the stay start and the stay end. The sampling rate (time intervals between adjacent signals) is a configurable parameter. I’ve set it to 600 sec (5 minutes). Each signal’s lat,lng will be noised with a random"noise factor"

Applying the logic on the first stay will result in the following output:

Drive Mode Signals

The next event on our timeline is a stay at "Residence 1290", but before generating signals for this stay, we need to generate signals for the drive that brings our phone from its origin (home) to its destination ("Residence 1290").

To do that, we will use the roads graph and look for the shortest path from origin to destination. Then, we will randomly generate signals upon the ordered road segments with a sampling rate of 60 seconds.

That’s how the synthetic drive signals look on a map:

Full Synthetic Mobility Data Generation

In our final step, we will iterate over all of our synthetic timeline. For each stay, we will generate static mode signals, and between every two stays, we will generate drive mode signals.

Boom! We now have full synthetic mobility data, produced by open-source packages and free data


Conclusion

Generate synthetic mobility data is doable and no special resources are needed. All the data we need is out there and free to use. Having said that, there is still room for improvement:

  1. Logic’s naiveness – The logic is quite simple so the output does not really represent full human behavior (e.g traveling abroad, migration, etc..)
  2. Logic’s efficiency – The logic takes some time to run. the major bottleneck is the drives generation part (shortest path calculations)
  3. Logic’s coverage – The logic supports the USA only and that’s because I used USBuildingFootprints which is very accurate, but unfortunately covers the US only.

As mentioned before, the code can be found here and you can use this colab notebook to try it yourself.

*All images unless otherwise noted are by the author.

Thank you for reading!


Related Articles