Making Sense of Big Data

Just over a hundred years ago, two ships collided in Halifax Harbour in Nova Scotia, Canada. The Mont Blanc was weighted down with explosives destined for the war in Europe, and when she collided with the Imo, the cargo ignited. It was the largest man-made explosion before the advent of nuclear weapons, flattening part of the city and killing more that 1,700 people in wartime Halifax.
Collisions are rarely this dramatic, but even as recently as 2017, lives were lost to these accidents.
The Automatic Identification System was designed to prevent this. Ships outfitted with AIS transponders broadcast their position, course and identity as they move along the water.
Designed to be an aid to RADAR, these low power broadcasts are picked up by neighbouring vessels, shore receivers and even low altitude satellites. Collecting AIS data from a worldwide network of terrestrial and space sensors can provide a global picture of ship traffic.
The data has been used to gauge the impacts of ice-free summers on ship traffic in Canada’s northern archipelago, estimate the noise impact of ships on whale habitats, and plan the development of rail and road services to port communities.
With this in mind, my goal is to structure AIS storage so that extracts on time, position, identity and destination can be created easily, but to do this, I need to pull some essential information from AIS’s cryptic encoding scheme.
The Data
The data that streams out of an AIS receiver is only vaguely readable. AIS transmitters use a vocabulary of 27 messages. There’s more than ship movements being broadcast- AIS message can also transmit information on weather, crew, search and rescue activities and navigation hazards.
I use Eric Raymond’s AIVDM/AIVDO protocol decoding guide to navigate the AIS sentence structure. Since I’m working with data that has already been stored in compressed files, my first step is to load each file into pandas and split each line on the exclamation mark into two columns- the prefix and suffix.
The prefix contains metadata about the message in a series of tagged blocks. Regular expressions can extract these interesting bits. Satellite sourced AIS can be buffered in orbit, and some vendors provide both the time the sensor picked up the broadcast (_reporttime) and the time the data was received by the ground station (_receivedtime). The download time is not found on every sentence, so this data is parsed when it exists.
Regular expressions can also be used to isolate values from the suffix- is the message split over more than one sentence? If so, is this the first or second sentence? What’s the fragment id that ties the two sentences together. Does the payload require padding.
The most important part of the AIS message is found in the string of random-looking characters that makes up most of the second half of the message.
The AIS payload is encoded in six-bit ASCII. To make it readable, we have to convert this part of the message to binary, pull out the requisite bits of data and cast the data into the proper type- string, float, signed or unsigned int.
The last step in constructing the payload involves appending the second fragment to the first for multipart sentences. I enlist a pandas "outer" merge for this job.
Now we have a string of 1’s and 0’s that we can carve up to find the ship’s identity and position.
Carving up bits
The first step is to tag each line with the sentence’s message type by pulling the first six bits of the message into an unsigned integer.
With the message type, we can start extracting ships’ positions. Messages 1, 2, 3, 4 and 18 have latitude and longitude.
Each ship with an AIS transmitter is identified with a Maritime Mobile Service Identity (MMSI) carried in bits 8 to 37 of the payload.
Message type 5 holds the ships’ destination, if declared.
Using the examples above and the detailed specifications from Raymond’s AIS guide, it’s straightforward to parse other attributes from payload. At this point, I have enough elements to meet my goal of creating extracts on time, position, identity and destination.
How to store the data?
I mostly see AIS data stored in CSV formats- I get it, CSV is ubiquitous, it’s understood by most tools, but data reads are slow. Since this archive will only be read, it’s ideal to store the data in a column store format like Apache Parquet.
There’s lots of information on Parquet on the intertubes, so I won’t go into it here. I’d rather get a look at how this format compares as a reading format to CSV.
Writing our dataframe to Parquet is straightforward. To help with time based searches, I partition the table on year, month, day and hour.
To test the impact of Parquet on search performance, I created a large repository of AIS data in both snappy compressed Parquet and uncompressed CSV formats, and ran some queries using Apache Drill. For each timing, I record the time value of the second run to give optimiser a chance to cache metadata.
A row count of around 334 million reports took 90 seconds to execute with CVS stored data. With Parquet, the count took 0.121 seconds. A group by and sorting query to find the top 100 referenced destinations took 87 seconds for CVS, 4.5 seconds for Parquet.
A frequency count of all unique MMSIs executed on CSV stored data in 89 seconds- with Parquet it took 19.1 seconds. And finally, a search for an individual MMSI took 82 seconds for CSV storage and 6.2 seconds for data
Code can be found here.