Data for Change

About half a year ago I embarked on a very interesting quest: Finding refugee boats using Satellite Data. I work with the NGO Space-Eye, a non-profit engaged in this quest. A lot of other initiatives are interested and working on this topic as well. I have a ML background and some programming experience, but tackling a problem like this from the very start was/is quite a challenge and intriguing at the same time.
We pondered a lot of questions in the beginning – and to most of those we come back to every once in a while – like: What kind of satellite data is available? Is the resolution good enough to detect the boats? How can we verify our predictions? Is the system scalable?
After a lof of research and super interesting discussions we realized we need to do a feasibility study first. The team here in Regensburg opted for visual satellite imagery from planet.com. Pedro in Berlin is doing something similar for radar imagery and is getting some great results. SearchWing in Augsburg is working on detection algorithms for their own UAV. But the planet data seemed more fit for our non-geo eyes and my experience in computer vision. Other than radar data, it just looks like normal images and pictures like we’re used to.
In this post, I’ll describe the process we went thorugh to create a ship dataset for the feasibilty study. The code to recreate this can be found on my github — you will need your own access to planet.com to query the images.
Phase 1 – Conceptualization
The satellite imagery we decided to use comes from a PlanetScope 4-band satellites which provide 4 channel images (RGB+NIR) in a 3m3m resolution. We opted for this one, because it has the best coverage/resolution trade-off. They do offer imagery with 0.8m0.8m resolution, but this has to be ordered in advance for a certain position and costs quite a lot more. The PlanetScope satellites have been up there for a while and so we have some historic data to look at. Planet has mostly only published satellite imagery from the shore – so everything on land and 7km into the ocean. This is partially because the ocean is pretty vast and not many people are interested in its imagery (not as many as in land pictures at least). On the other hand planet has had some problems geolocating the satellite images they take on the ocean. While on land they can locate themselves and their pictures using landmarks, those are not available in ocean imagery. This leads to a positioning error and therefore a lower quality. But sometime this year they improved their algorithm a lot and now provide higher quality ocean imagery.

With all this information we still had no solution for our reference system. We don’t know where those refugee boats are – that’s why we started the whole thing.
To train and verify our ship detection model we need some ground truth data – at least for a supervised model. We probably won’t be able to find enough refugee boats by visual inspection to train anything. So we looked for other ship detection possibilties. We found the obvious kaggle ship detection dataset, but those ships are way bigger than what we needed. They typically span a few 100 pixels whilst refugee boats on planet data only measure about 3 pixels. And they did not have the NIR channel which seemed super valuable to us. So we decided to go for a model trained on original PlanetScope data, but on non-refugee boats. But how to find normal boats on pictures? This would implicate a lot of handywork…
After getting together with some more experienced boat people I learned about AIS. AIS is a system which all ships bigger than a certain size have to sign up for. They have a transmitter on board which broadcasts their position constantly, they also have a receiver which tells them where all those other ships around them are. That’s basically a system to make sure they don’t run into each other – at least from what I understood. This data can be super useful for machine learning in general. There are some great visualization projects and ship tracking applications. WWF for example is currently working on a tracking system to prevent illeagal oil spills using AIS. Well our refugee boats don’t have an AIS sender. But still this can help us create a ground truth training set for our application.
There are websites where one can request current AIS data to log it yourself, but that seemed too much of a hustle. The US provides old AIS log files from its coasts on marinecadastre.com. Some other countries have something similar, but none we found matched the resolution of the US data. They have a one minute resolution in their files which should be enough for our application.
So now we have everything to automatically create a ship/non-ship dataset of some place in the US by intersecting AIS log files with planet satellite imagery. The conseptual phase is over and we could finally start implementing the dataset creation. At the start is seemed pretty straight forward.
Step 2 – Intersecting the two datasources
First we need to intersect those two datasources. I selected the region between Miami and Palm Beach as my region of interest. To do so, I had to create a .geojson file (not too hard using geojson.io).

Since the AIS file from marinecadastre is extremely big, I wanted to remove all entries which are not inside my region of interest. Geopandas was a great help there, even though it drove me mad at times. This code can be found in the function reduce_ais_csv() in the jupyter notebook linked at the top.
Now we need to find out where the satellite took its pictures. There’s no all time coverage but the satellite only passes one to three times a day to take a picture. To communicate with planet I used the porder API which is greatly explained in this article. I asked planet for an id-list of images within my geojson with less than 50% cloud coverage and at least 80% overlap with my region of interest. From the id-list we can extract the exact times when the imagery was taken. Now from our AIS dataframe we can select those entries which show the same timing as the corresponding satellite image. (Well not the exact same, but +/- 30 seconds since the logging speed of the AIS file is every minute.)
porder idlist --input "infile.gejson" --start "day.month.year" --end "day.month.year" --item "PSScene4Band" --asset "analytic" --cmax 0.5 --outfile "outfile.csv" --overlap 80
Now we have a list of candidate ships that we can try to download as satellite image files.
Step 3 – Downloading the ship imagery
This step sounds easy, but it has taken us some afternoons and a lot of braincells. Probably it isn’t hard for someone who’s experienced in geological data. But we suffered from the great problem, that the earth is round.
Yes, the earth is round. Therefore a simple rectangle in coordinate space is not a rectangle in satellite imagery and vice versa. To make this very long story short, the final conclusion was to download small snippets centered around the expected boat position in coordinates. Those are specified as rectangular. The images that come back are twisted diamonds embedded in black, undefined pixels. But if we specify the rectangle big enough we can just cut a rectangle (in pixel space) in the middle of this mess to use for our training system.

Using the porder API we were able to simply order small (coordinatewise) rectangular snippets around the boat positions and download those images.
Oh, and did you know that there’s hundreds of coordinate systems? The one normal people use is called EPSG-4326 and there is another one especially for the mediteranean… And since the earth is round, the conversion is not straightforward. – this revelation took another few hours but turned out to be unnecessary for the project.
Step 4 – Finding corresponding non-ship imagery
We now have successfully created our positive samples. Those that contain a ship. Now we also need negative examples. Satellite images that do not contain any ships. This should not be too complicated since the ocean is extremely vast. But I did not want to just take any satellite image but I wanted a dataset as balanced as possible.
So I tried to fetch one image, not containing ships, from the same satellite image that I used that contained a ship. This automatically balances the weather and lightning conditions as well as clouds and ocean conditions. Sounds pretty nice, doesn’t it?
To find a snippet that fullfills my requirements I created a search radius around the ship I currently wanted to supplement. I randomly selected a region from this search radius and checked that there are no ships inside this region using the AIS data. Next I try to download the image using the porder API. I repeate this process several times in the hope of getting a non-ship image for every ship image. Sometimes it fails – either because there are to many ships traveling in the region at that time or because it’s close to the boundary of the geojson I defined. If that happens, I just redo the process on a ship that already was used to create a non-ship. Since this only was the case for around one in 100 images, I considered it a good approach.
Step 5 – Putting it all together
After I sucessfully downloaded all data for the selected region in the year 2016, I just had to organize it better. I subsumed all ships in one .csv file and all non-ships in another. They contain all the metadata that is available in the AIS database like speed, name, and so on. Next I pushed all ship images into one base folder and all non-ships into another (while remembering their original name and AIS entry!). Finally we add some documentation so whoever is working on it does not get completeley lost. So some shining on the dataset 😉
After all it’s been quite a journey and I’m very much looking forward to get to the fun part of creating models and evaluating them on this dataset. Let’s hope we can find boats that are as small as 3 pixels, so we can finally start to tackle the real task!
Disclaimer: This is not only my work. There was a lot of help from colleagues and other organizations. To mention only a few: Stephan, Simon and Pedro helped me whereever possible. Thank you!
I’ll publish updates from experiments on the dataset here on medium (and if it works maybe even somewhere else). 🙂 So stay posted!
Known issues: Visual inspection showed that we have quite some false negatives – so a lot of non-ship pictures include ships…