Visualize the Crowdedness of Dutch Trains with Open Data and Kepler

Open data provided by Dutch public transport makes it possible to visualize crowdedness over the day.

Leo van der Meulen
Towards Data Science

--

Kepler visualization of train occupancy

One of the greatest developments in Dutch public transport is the ongoing growth of publicly available data. Time schedules have been available for several years, including real-time updates and crowdedness information was added a couple of months ago. All this open data makes it possible for developers to create apps, both for the great public and for niche markets. In this article I will show you how to create a great view of the crowdedness of trains in the Netherlands.

The data sources used for this product are OVInfo for time schedule information and NDOV, for the crowdedness information. The time schedule is available in GTFS format and the crowdedness in a dedicated format (CSV). Visualizations are made with Kepler and are based upon GeoJSON files.

The reason to create this visualization is the inspiration I felt after reading the articles by Ozan Kara and Abdullah Kurkcu with their work on visualizing Istanbuls bus traffic and Denvers bus traffic. They did some excellent work on data engineering and visualization.

The complete code of this article can be found on github.

We will start with the specified time schedule from the GTFS specification. It is also possible to use the realized timetable information from the GTFS feed but for now we will use the planned schedule, also since the available crowding information is a forecast.

The required data structure for visualizing the timetable with crowdedness looks like this:

This contains the following columns:

  • timestamp — Unix timestamp, record for every minute a trip is underway (time between arrival at first station and departure at last station
  • Ritnumber — Tripnumber, specifies a trip from the GTFS
  • Sequence — A trip consists of several sections. A section is the part of a trip between two consecutive stops or the time spent stationary at a stop.
  • lat, lon — The location of the train fulfilling trip ritnumber at time timestamp
  • classification — A crowdedness classification for the train at the time of departure at the last stop. Is a value between 1 and 5.
  • passengers — a rough estimate of the number of passengers in the train

With this dataset, we are able to create an animation of all trains during a day where the crowdedness between two stops can be used to colorize the train location.

Obtaining data

The first step is importing the required libraries (pandas, numpy and geojson are the most important non-standard libraries) and set some parameters.

Then we import the crowdedness information from NDOV. This data is avaiable in the form of the compressed CSV files.

If the file has not been downloaded yet, we download the compressed CSV file. The file is imported with the CSV reader for Panda DataFrames. After renaming some columns, an estimation is made of the number trainseats. Wikipedia is a great source for finding the seat capacity of train coaches. Since there are different subtypes we can take a rough estimate of the average number of seats per coach. The NDOV data specifies the coach type and the number of coaches so we can estimate the total number of seats in the train.

Importing the GTFS data is a bit more work. First we download the GTFS file if it isn’t already present (It is over 200MB in size so preventing an unnecessary download saves time and bandwidth).

Then we import the GTFS files one by one from the downloaded zip-files and filter them on operator. The GTFS files contains the time schedule for all modalities and operators in Dutch public transport. Filtering it on train operators greatly reduces data size. For this example we only use the data from NS, operating the majority of trains in the country, but filtering on all the train operators, or the provider of the transport for one area or city, is also possible by changing the agencies.

Starting with the agencies, step by step, everything is filtered data for the specified operator. First agencies, then routes for the specified agencies, trips for these routes and stoptimes for these trips. Note that stops in GTFS are platforms in real life and stop- areas represent stations. The arrival and departure times are translated from hh:mm to minutes since midnight, so it is easier to use down the road.

Data processing

A real important next step is determine the location of the train for all trips of the day, between arrival at the first station and departure from the last station. The route of the trip is specified as a GTFS shape consisting of a sequence of LatLon coordinates with the traveled distance from the first station. The following function interpolates the location between two LatLon locations based upon the travelled distance. After finding the previous/last stop (ps) and the next stop (ns) we can use a simple straight line interpolation because high accuracy is not needed and the distances are relatively short.

The parameter tripshape is the set of points from the GTFS shapefile specifying the route of the trip and dist is the distance travelled along the tripshape. After finding the surrounding two LatLon locations for the specified distance, the location is interpolated between these two based upon the distance (see the figure below). The travel speed between the two steps is here by constant.

The following function interpolates the location for every minute of one trip in a single day:

After determining the stoplocations of this trip (tripstops), we iterate over all these stops and perform two actions. First of all, all the minutes between the departure at the last stoplocation and arrival at this stoplocation are interpolated. Then all minutes are created at which the train is located the stoplocation. These are determined by the arrival and departure time of the stoplocation. Each group of interpolated points is assigned to a sequence for later use.

The function returns a dataframe with the LatLon location of this trip for every minute, including information on the last stoplocation the train visited. Obtaining a dataframe with this information for all trips is then straightforward by iterating over all the trips on the day we are visualizing:

When we add all the crowdedness information by merging this dataframe with the NDOV data, we have the dataset of all the train locations during the day along with crowding information:

The number of passengers is estimated based upon the number of seats and the classifications. This is also a very rough estimation of amount of passengers but with the available data it would not be possible to get a better indication. The geoJSON file requires a UNIX timestamp so this is calculated and the elevation column is added since it is needed for the geoJSON file.

Data export and visualization

The created dataset contains all the information described in the introduction (and a little bit more). With this dataset we can create GeoJSON files to use with Kepler. In order to display an animation of a dot following a path, Kepler requires the following GeoJSON structure:

The coordinates of the LineString contains 4 elements [longitude, latitude, elevation, timestamp]. The numerical properties of the feature can be used to colorize the shape. Since crowding only changes at stoplocations, a feature is created for each path between two stations. These paths are specified by the sequencenumbers added in the previous step.

By iterating over all trips and all sequence in a trip, the total seat of features is constructed, These features are stored as a feature collection in a GEOJSON file.

To improve the usability of the map there is also a GeoJSON created within the networklayout. This layout is created by simply adding one trip per route to a GeoJSON file.

And finally it is time for the visualization part which will be performed with Kepler. Navigate to the Kepler site and press Get Started. On the next page drag and drop the two created files and Kepler will prepare the appropriate layers. One static layer with the network layout and one with the animation of the trains.

By fiddling with the linewidth and color a nice presentation of the train movements over the day can be created. We see all trains during the day, including the burst of the morning rush hour. The trains are colored according to the number of estimated passengers:

Link to the created video on Github showing one day of NS train travels.

Final thoughts

The presented code is not optimized for speed. Running it on an average machine takes up to 30 minutes so improvements are welcome but for this demonstration purpose that is not required.

Kepler is a very powerful tool for visualizing geospatial data, including animations over time. Adding the increasing amount of publicly available data opens the doors for fantastic insights in public transport with stunning graphs and animations. I hope this example inspires you to create new visualizations and insights on public transport.

Disclaimer: The views and opinions included in this article belong only to the author.

--

--

Dutch open data and public transportation enthousiast. Working for over 15 years in public transport. LinkedIn: https://www.linkedin.com/in/leovandermeulen/