Creating sea routes from the sea of AIS data.

Published in

Towards Data Science

7 min readMay 15, 2019

Maritime routes are important characteristics of maritime transportation. Sometimes they are clearly defined by the official guidelines “traffic separation schemes”, sometimes they are more of the recommendations. International Maritime Organization is responsible for the routeing systems, including traffic separation schemes, and they are published in the IMO Publication, Ships’ Routeing — currently 2013 Edition. Unfortunately, they are not digitised.

Some commercial and freemium products are available and they provide distance and sometimes route between ports on the per route basis. At the time of the writing, we were not able to find a comprehensive library of routes that will be available off-line. So we decided to construct our own.

Automatic identification system (AIS) is using transponders installed on almost all ships. Unique identification, position, course, and speed of the vessel are usually transmitted with regular intervals. These signals can be received by the nearby vessels, satellites and terrestrial antennas. There are several companies that collect, cleanse and sell this data.

Mariquant has hourly data from one of the AIS data providers for the bulkers and tankers from the beginning of 2016 to the middle of 2018. This data has information from around 19 000 unique ships with records, occupying approximately 100 Gb as uncompressed parquet file.

We have a library of the polygons with approximately 8 000 port harbours and 20 000 anchorage and waiting areas. Most of the time anchorage polygons lay outside of the port polygons, and we used graph algorithms to determine which anchorage belongs to which port. Having this data we started to construct routes.

Anchoring and loitering.

We will be using the notion of the distance between trajectories to construct the best representation of the route and to separate different trajectories on the same route. Most of the distance calculation algorithms are using the distance between points on the trajectories. We found that some noise prevented us from obtaining meaningful results for the calculations of the distances between trajectories. Looking at the data we found that ships spend a significant amount of time (leading to a large number of points on the trajectories) either anchoring or moving nearby the port harbour.

To find at least some of these points on the trajectories, we used scikit learn[1] implementation of the Random Forest Classifier. There are various approaches to finding different clusters on the vessel trajectories, see for example [2] and references therein. We preferred to use a somewhat simplistic approach as it was easy to implement and provided us with sufficiently accurate results. We used distance to the port, distance to the previous port, speed of the ship, radial velocity of the ship (as if circling the port) and speed in the direction to the port, as the features of the classifier. We manually constructed training and testing sets separately marking ‘loitering’, anchoring, port approaching and general voyage points. We had around 6 400 points in the training and 1600 points in the testing set. Anchoring points are less represented in the sets, as they create fewer problems in the distance calculations. The difficulties in manual markup cause this small size of the set.

The Confusion matrix shows that the loitering, approach and general voyage points are well defined. Even if the approach and general voyage points are misidentified, the misidentification is between two of them, not points of interest.

Confusion matrix of the Random Tree Classifier

Precision and recall

Precision and recall, together with F1 score show that the results are good, but further improvements may be made for the classifier. We have to repeat that anchoring points are less troublesome for the calculation of the distance between trajectories and we will continue with these results.

We found that distance to the port, distance to the previous port, speed of the ship are the most important features, having a score of 0.42, 0.21 and 0.28 correspondingly and radial speed and speed in the direction to the port have a score of 0.04 and 0.05.

One can use standard Scikit-learn functions to calculate the scores mentioned above.

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)clf = RandomForestClassifier(n_estimators=20, max_depth=6, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(clf.feature_importances_)

We found that irrespectively from the scores of the classifiers, additional cleanup is required for our large dataset. Sometimes small parts of the port approach trajectories were mistakenly identified as loitering parts, and we had to apply conservative check requiring anchoring and loitering parts of the trajectories to be long enough and either be self-intersecting, or the distance travelled in the direction to the port should be smaller compared to the total length travelled.

After all these checks we average the position of the vessel for the loitering and anchoring parts of the trajectories.

Route construction. Theory.

Our approach was inspired by the work of the Philippe Besse, Brendan Guillouet, Jean-Michel Loubes and Franc ̧ois Royer [3]. In the original work, authors proposed a new method called ‘Symmetrized segment-path distance’ for distance calculation between trajectories to clusterize taxi voyages to predict final destination point as part of the Kaggle competition [4]. Article [3] has an in-depth description of the various methods for the trajectory distance calculation.

In short, our approach is

to calculate the distance between trajectories using some measure
to clusterize trajectories based on their distances.
to select the “best” trajectory in the cluster.

Route construction. Practice.

We tested different methods for the distance calculations and decided to use Edit Distance with Real Penalty (ERP) [5]. This method is a warping distance method and allows comparison between trajectories of different length, aligning trajectories during the computation of the distance between them. It runs in O(n²) time. The implementation of this method is part of the trajectory_distance package created by the authors of [3]. We slightly modified this package to speed-up python part of the calculations and added DASK support for the parallel computations. Due to the memory and calculation time restrictions we select at max 50 random trajectories, as the computation on the popular route may deal with hundreds of them. This leads us to the 43847 distinct routes that will be reconstructed from almost one million trips.

Different trips along the same route can have completely different trajectories. So we need to clusterise these trajectories if needed. Affinity propagation clustering algorithm is the natural choice for this task as it requires neither the initial knowledge of the number of clusters, no triangle inequality for the distance (not all methods for the distance calculations generate distances that satisfy this inequality). We used scikit learn[1] implementation of the Affinity propagation algorithm.

Reconstruction of the trajectories

Missing points are usual for the data, collected by the satellites, due to the restrictions of the AIS protocol. We had to find a solution to this problem.

First, we use the cost function that will take into account both the distance between the trajectories (we use simple average) and the number of missing points in the trajectory to define the “best” trajectory.

Second, we update the trajectories iteratively:

We find the “best” trajectory using only available data
We enhance trajectories with the missing points using the “best” trajectory points. Nearest points from the “best” trajectory are found, and the “best” trajectory segment is added to the enhanced trajectory.
We iterate 1 -2 with the additional data until the “best” trajectory remains stable and the value of the cost function diminishes.

After the enhancement of the trajectories, we split trajectories set in the different clusters. “Best” trajectory within each cluster is found using the iterative approach described above. We use these trajectories as the routes between ports.

Results

The results are available as the Amazon s3 bucket http://worldroutes.s3.amazonaws.com. You can get and use them for any purpose under the Creative Commons Attribution 4.0 License. You can get information about the license here.

The data consists of four files:

Port-to-port distances calculated along the routes (distances.csv).
Port-to-port routes as discussed above (routes.csv).
We can’t provide you with the port polygons, however, we provide the reference data consisting of our internal index, World Port Index INDEX_NO, and PORT_NAME if any (or any available name we have) and a tuple of coordinates of the port polygon representative point. Representative point of the harbour polygon is obtained using geopandas representative_point() method (ports.csv).
HTML file with the world map (WorldRoutes.html)

About Mariquant

Mariquant is a company formed with a strong focus on the development of analytical tools. We believe there is a tremendous value yet to be realised from consistent and broad adoption of data-driven analytics in the maritime industry. However, poor data quality, coverage and the sheer amount of data presents a hurdle for many, leading them down the path of top-down analysis frequently with the significant manual effort involved.

At Marquant we challenge those barriers by adopting cutting-edge technology developed by the likes of Amazon and Google, but keeping the sharp focus on the needs of the maritime industry. Introduction of the fully automated, data-driven analytics allows scaling to the magnitude of individual cases found in the maritime allowing timely, accurate and measured commercial decisions.

References

Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825–2830, 2011.
Sheng, Pan, and Jingbo Yin. “Extracting Shipping Route Patterns by Trajectory Clustering Model Based on Automatic Identification System Data.” Sustainability 10.7 (2018): 2327.
P. Besse, B. Guillouet, J.-M. Loubes, and R. Francois, “Review and perspective for distance based trajectory clustering” arXiv preprint arXiv:1508.04904, 2015.
ECML/PKDD 15: Taxi Trajectory Prediction (I)
L. Chen and R. Ng, “On the marriage of lp-norms and edit distance,” Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004, pp. 792–803.