Deep learning in Space

How AI and machine learning can support spacecraft docking.

Nick Evers
Towards Data Science

--

Magellan space probe meant to map the surface of Venus, simulated in orbit around earth using Unity (credits and info at the end).

Artificial intelligence is everywhere. Home appliances, automotive, entertainment systems, you name it, they are all packing AI capabilities. The space industry is no exception.

In the past few months I have been working on a machine learning application that assists satellite docking from a simple camera video feed. If you want to know how deep learning, neural nets and Tensorflow are useful for satellite docking, keep reading.

In this blog post I’ll guide you through the approach, working principles, results and lessons I learned. I don’t pretend to challenge the state of the art in object recognition at all. But, when tracing back on my steps I realize I learned so much. Therefore, I hope my story is useful to you and it inspires you to create your own machine learning applications.

If you prefer to skip the reading and dive straight into the code, everything is available on GitHub: https://github.com/nevers/space-dl

Please note that this post assumes the reader is familiar with the basic principles of Tensorflow, Convolutional Neural Nets (CNN) and how to train them using back propagation.

Credits

  • A big thanks to Rani Pinchuk for the support, the countless discussions, resulting insights and the time spend on tediously labelling all training data.
  • Training and evaluation images of the OS-SIM facility by courtesy of DLR.
  • Magellan space probe, simulated using Unity by Xavier Martinez Gonzalez.

Index

  1. Dataset preparations
  2. Model principles
  3. The loss function
  4. Transfer learning
  5. Results
  6. Lessons learned the hard way
  7. Hardware
  8. Conclusions and next steps
  9. Credits

Intermezzo — object segmentation, detection and localization.

Intermezzo — Tensorflow estimators versus manual abstractions.

Dataset preparations

Knowing the detailed dimensions of the satellite, the goal is to create an algorithm that can accurately predict its pose and relative distance from the camera. The dataset for this project was created from a life-size mockup of a satellite mounted on a robotic arm at the DLR OS-SIM facility. The arm simulated various movements whilst a camera recorded a video feed.

The satellite mockup captured by the video camera on the robotic arm. Source: OSM-SIM facility DLR.

I decided to focus my efforts on finding the tip of the satellite. If I could accurately locate it, I was confident I could do the same for at least two other tags on the model. (The ‘tip’ of the satellite is actually part of its docking mechanism.) Given those 3 (or more) points and the 3D model of the satellite, I could then reconstruct the pose of the satellite and relative position with respect to the camera.

The camera recorded 14,424 unlabelled images that I wanted to use for training and evaluating a neural network. One of my worries was that I would have to spend ages on manually labelling the tip on each of those images. Luckily, I learned about OpenCV’s excellent image tagging tool: CVAT.

With CVAT you can bulk-import all the images you want to annotate, play them back as a movie and interpolate annotations that are many frames apart. It also allows the work to be split up amongst multiple people and it even has a nice docker-compose file that allows you to run it at a click of a button.

CVAT saved tons of time and work: it only took a few hours to annotate the tip on each of the 14,424 images. (Actually, I can’t take credit for this work.) For linear motions of the satellite, we simply had to annotate the start and end position and CVAT would interpolate and add all the labels in between. If you ever need a video or image annotation tool, CVAT comes highly recommended.

Annotating the tip of the satellite using boxes in OpenCV’s CVAT.

There are, however, some opportunities for improvement, or rather, features I would love to have. For example, CVAT does not support interpolation between points. As a work-around, boxes had to be used instead of points for all annotations. (The top-left coordinate of the box was used to match the position of the tip.) Also, any frames not annotated, i.e. frames where the tip was not visible, were not included in the XML output.

XML output from CVAT after annotating the images.

In order to make this XML file suitable for training and evaluating the model, it had to be post-processed into the right format. Funny thing is: this seemingly trivial task, actually took quite some iterations to get right. I often had to go back to fix labels, add new labels, update the output format, etc. To me, there’s a lesson to be learned here.

The code that converts the raw data and annotations into a dataset suitable for training and evaluation is an important part of the codebase. It is not just a bunch of obscure one-off terminal commands. You should treat it with respect, as it is part of the playbook that allows you to reproduce your results and your documentation. Version your code, review it, use semantic versioning for your dataset releases and most importantly, make it easy for others working on the same problem to use the dataset by zipping it up and offering it for download.

Once I had a baseline for the dataset build script, colleagues were able to reuse it and share their changes. I have introduced Nexus in our company, and we now use it to distribute all our code artifacts from Java, Python and Docker, including datasets and more.

The dataset build script also allowed quick experiments with different versions of the dataset:

  • Applying data augmentation: rotation, blurring, sharpening.
  • Experimenting with different training and evaluation splits.
  • Tailoring the data to a representation that fits your model.
  • Converting and packaging the data into a suitable format.

This last point deserves a bit more attention. Because I’m using Tensorflow, I wanted to use the TFRecords data format. Not only because it nicely integrates into the TF Dataset API, but mostly because I assumed that this binary data format would read much more efficiently from disk. Here’s a code excerpt on how I converted the images and labels to a TFRecords file using Python multi-processing. (I wanted to use multi-threading but… in Python-land threads are not cool and GIL said so.)

Convert images and labels to a TFRecords file using multi-processing in Python.

After creating the TFRecords file, I created this script to benchmark and compare the time it takes to read the 13,198 training images from the TFRecords file versus simply reading each image from disk and decoding them on the fly. Surprisingly enough, the TFRecords data format did not really improve the speed of reading the training dataset. The timing outputs below show that sequentially reading from a TFRecords file is slower than reading each image from disk and decoding them on the fly. The difference is marginal but I definitely expected TFRecords to be faster.

If you really want to improve the performance of your data import pipeline, consider parallel processing and prefetching of the data. By simply setting the tf.data.Dataset.map num_parallel_calls argument when parsing the dataset, parallel reading of those very same images from the TFRecords file is 2 times faster than its sequential counterpart. Reading each image from disk and decoding them on the fly is even 3 times faster. However, in the parallel example, reading the TFRecords file is almost 2 times slower than reading the images on the fly. Again not what I expected. I would be happy if someone could point out the issue and share their experiences with TFRecords.

In the end, combining parallel parsing and prefetching allowed me to remove any CPU bottlenecks during training and increased the average GPU utilization from 75% to more than 95% as measured with the nvidia-smi command.

Here are the timing outputs from the script when run on my old 2011 iMac (2,7 GHz Intel Core i5):

Sequential parsing of 13198 images:

  • TFRecords dataset: 50.13s
  • Plain PNG files dataset: 49.46s

Parallel parsing of 13198 images:

  • TFRecords dataset: 26.78s
  • Plain PNG files dataset: 15.96s

Model principles

Recently, I accomplished Andrew Ng’s Deep Learning Specialization on Coursera. (Ng is pronounced a bit like the n-sound at the end of ‘song’.) These five courses cover the core concepts of Deep Learning and neural networks including Convolutional Networks, RNNs, LSTM, Adam, Dropout, BatchNorm, Xavier/He initialization, and more. The course also details practical case studies from healthcare, autonomous driving, sign language reading, music generation, and natural language processing. Beyond Andrew’s amazing track record, he’s also a great teacher and I must say it was a wonderful experience. I can recommend this course to anyone who wants to get into Deep Learning.

In the 4th course, covering Convolutional Neural Networks, he gave an excellent explanation on object detection using the YOLO algorithm (You Only Look Once). This algorithm performs real-time object detection as you can see below.

Object detection using YOLO. Source: https://pjreddie.com/darknet/yolo/

“YOLO is one of the most effective object detection algorithms that encompasses many of the best ideas across the entire computer vision literature that relate to object detection.” — Andrew Ng

With that, I just couldn’t resist implementing my own naive version of the algorithm. I won’t be explaining the full working principles and details of the original YOLO paper in this post as there are so many excellent blog posts out there doing just that. (Like this one for example.) Instead, I will focus on how I used YOLO to solve my specific localization problem.

Intermezzo — object segmentation, detection and localization.

There’s a difference between object segmentation, detection and localization. Object segmentation aims to find segments of various shapes that give a pixel-wise description of the contours of the object to be detected in the image. Object detection is the process of finding a rectangular bounding box around one or more objects in a given image. Object localization is about finding the position of one or multiple objects.

Object segmentation, detection and localization from left to right.

The main principle of the algorithm is simple: take the input images and apply an abundant amount of convolutional layers each with their own set of filters. Every set of convolutional layers will decrease the feature space or the resolution of the image. Remember that convolutions preserve spatial locality because each layer has a local connectivity pattern to its adjacent layers. Therefore, each element in the output layer represents a small region of the original image at the input. The filters per convolutional step may vary from 64 to 1024 or even 4096. In the final output layer, however, the amount of filters is reduced to 3. In other words, the output layer has 3 channels and each of them will be trained to activate for a different purpose for that specific region in the image:

  • Channel 1 — the predictor bit: represents the chance between 0 and 1 that the satellite tip is present in that region of the picture.
  • Channel 2 — the relative X-position: vertical position of the tip (if available), relative to the top-left origin of that region.
  • Channel 3 — the relative Y-position: same same as channel 2, but different.

Have a look at the image below, it’s my attempt to depict the concept.

The input image in the top layer is dimensionally reduced to the output layer at the bottom (omitting the convolutional layers in between). The grey lines between the input and output layer show how each neuron along the depth dimension (or per channel) is dedicated to a specific region of the image. Per region, the output volume predicts whether a tip is visible and its X and Y coordinate relative to the origin of that region. In an ideal scenario, a prediction would have all elements set to zero except for the highlighted volume where the tip is visible.

In my first and naive version of the algorithm, I didn’t spend a lot of time figuring out what would be the perfect CNN model architecture to solve my problem. Instead, I wanted to focus on the parts in the code that would allow me to train and evaluate the model. As a result, I simply implemented the same model layout as that of the architecture picture from the original YOLO paper (below).

YOLO v1 CNN model (source: https://arxiv.org/pdf/1506.02640.pdf)

This is how my simple mind interpreted those layers into code.

My naive interpretation of the Yolo v1 model.

I’m quite happy I didn’t spend too much time on the model since setting up the loss function and training/evaluation proved to take up much more time. Besides, these days there are so many great models out there that are just too difficult to beat. Have a look at the Tensorflow Hub for example, or at the models available in Keras. For that reason, I didn’t care too much about the performance of the model. Instead, my main target was to get all the moving parts of the algorithm working: the dataset pipeline at the input, the trainable model, the loss function and the evaluation metrics.

The loss function

To calculate the loss, my first step was to convert all labels (which are basically x, y positions of the tip of the satellite) into an output volume as shown in the picture above. This is what I came up with. Or if you prefer skipping the bulk of the code, just have a look at the simple example at line 20 of the script.

Code excerpt that parses a given label (i.e. the x and y position of the tip of the satellite) into a volume similar to the output of the model.

The next step is to compare a given parsed label to the output of the model and setup a loss function that would allow gradient descent to optimize the model parameters. I tried many alternatives, fiddling around with the mean squared error and the mean squared logarithmic error. In the end, I settled on using cross entropy loss (or log loss if you want) because it is especially effective for classification tasks whose probability value is between 0 and 1, like the prediction loss.

The loss function itself is the weighted sum of two parts:

  • Prediction loss: how well did the model predict that there is a satellite tip or not per box in the output volume. I gave this prediction loss a weight of 5 because it is the main contributor to having a correct prediction.
  • XY-loss: how well did the model predict the position of the tip if there is one in that box. If there is no tip available in the picture, this part of the loss function should be zero so that only the prediction loss determines the final loss. I gave this prediction loss a weight of 1.

Have a look at the code implementing this loss function below. With that, I was ready to train the model using an Adam optimizer, hooray!

The loss function of the model.

In hindsight, when writing this, I realize that this loss function can still be improved a lot. If there is a tip in the image, the XY-loss is calculated for every box in the output volume. This means that the XY-loss is also taken into account for all boxes with no tip visible and that’s not what I intended. Consequently, the XY-loss is mostly trained to detect background rather than the satellites tip… oops. Furthermore, the XY-loss is not a classification task like the prediction loss is. Therefore, it is probably better to calculate it using mean squared error or a similar strategy. The funny thing is: this loss function performed great. So, that’s actually good news: it can perform even better :)

Transfer learning

Once I had the model and the loss function up, running and training properly, I wanted to swap out my naive interpretation of the YOLO model for a battle-tested and pre-trained version. Since I only had a limited dataset, I assumed that transfer learning would be needed to solve the problem.

One option was to simply pick a model from the Tensorflow Hub. However, TensorFlow makes it too easy to use those models and I wanted to take a more challenging route so I could learn more. I decided to use the latest version of the YOLO model from the original author because it was written for Darknet and I wanted to learn how that model could be imported into Tensorflow.

When I started to look into the latest YOLO model definition I quickly realized that I needed to parse and map each of the sections in that definition file to the right Tensorflow layer. Perhaps I should have been more careful what I wished for because it is tedious and time-consuming work. Luckily, I found this script that converts the YOLO model definition into a Keras model, which I could load using Tensorflow.

Transfer learning is about reusing part of the layer weights of an existing model pre-trained on a different but similar dataset and retraining only the remaining layers. Once I had all of the 252 model layers loaded up, I had to figure out what layers (and their associated weights) I wanted to keep constant, what layers I wanted to retrain and each of their dimensions. To that extent, I wrote a short script that plots the Keras model to an image, and calculates the dimensions from a given list of layers.

Using this script, I could simply preview the full model layout, including all the layer names. I then hand-picked a layer at the very middle of the model: “add_19”. In my implementation, using the layer.trainable property, all the weights of the layers in the first half are kept constant and all weights of the layers in the second half are retrained. The last layer in the model is “conv2d_75” an it has 255 channels. I added one additional convolutional layer with kernel/filter size 3 to reduce and fit the model output to the final dimensions I was aiming for.

Loading the YOLO model in Keras, enabling transfer learning and matching the output layer dimensionality to match the labels and loss function.

Results

First, let’s check how transfer learning impacts the results. Have a look at the image below. Reusing the first half of the YOLO model and retraining the second half makes a huge difference. In fact, the results are beyond comparison. Without transfer learning, the loss function stagnates around 80 whereas with transfer learning, the loss function immediately drops to almost zero.

Model loss function output per training step. The dark blue line shows the loss without transfer learning or simply a randomly initialized model. The light blue line shows the loss when half of the model reuses weights from the YOLO model.

The following picture visualizes the model output when not using transfer learning. Notice how the model is able to filter out the background and focus on the tip, but is never able to make an accurate prediction.

Model prediction output when not using transfer learning. Each output volume of the model is shown as a semi-transparent box with a color that ranges from (very transparent) blue (indicating a low chance of a tip present in that box) to green and then to red (indicating a high chance).
The model prediction output visualized for the same image over 42 training epochs when not using transfer learning. Notice how the model learns how to filter out the background, but never succeeds into narrowing down on the tip.

This is what that looks like for the whole evaluation dataset, still without transfer learning.

Video animation of the model predictions for the evaluation data without transfer learning.

However, this is what that looks like for the whole evaluation dataset, with transfer learning enabled.

Video animation of the model predictions for the evaluation data using transfer learning.

It is clear that transfer learning has a huge impact on the results. Therefore, the remainder of this article and results assume transfer learning is enabled.

Beyond the output of the loss function, the performance of the model was measured in 4 ways:

  1. metrics/dist_mean: for all samples where the model correctly predicts the presence of the tip, what is the average distance from the prediction to the label in pixels.
  2. accuracy/point_mean: for all samples where the model correctly predicts the presence of the tip, what percentage of those samples are within 10 pixels from the labeled tip.
  3. accuracy/prob_mean: how accurately can the model predict the presence of the tip. I.e. the predictor bit must be higher then 0.5.
  4. accuracy/overall_mean: what percentage of the samples are predicted correctly. I.e. if there is no tip, the model also predicts the same and if there is a tip, it is within 10 pixels from the label.

Here are the evaluation results from the evaluation dataset of 2885 samples, after training the model for about 8 hours.

  1. metrics/dist_mean: 1.352px
  2. accuracy/point_mean: 98.2%
  3. accuracy/prob_mean: 98.7%
  4. accuracy/overall_mean: 98.7%

Below you can see those numbers plotted over time in Tensorboard. In simple words, the algorithm is off by one pixel on average.

Four evaluation metrics and the loss function calculated after every training epoch during a training period of 8 hours.

Out of the 2885 evaluation samples, 32 pictures had a prediction that was off. When I looked at them, 28 are pictures where the position of the tip was detected quite accurately, but the model simply wasn’t confident enough to say there is a tip at all. I.e. the predictor bit didn’t exceed 0.5, but the correct box was selected. Here is an example.

The model predicts the tip within 10px but the confidence level is just below 0.5 and therefore it is marked as an erroneous prediction. It’s so close to 0.5 that when rounding the predictor bit, it yields exactly 0.50.

The four remaining negative predictions are more interesting. They are mostly mislabeled or at least ambiguous. When the tip is hidden behind an object, yet still easy for a human to localize, some of the images are labeled inconsistently. This is exactly what the model caught. Two examples are shown below: the tip is hidden behind an object and labeled as not having a visible tip. Whereas the model predicts there is a tip and at the correct position.

Examples of a negative prediction where the tip of the satellite is hidden behind an object. These images are labeled as not having a visible tip (hence the label -1, -1), whilst the model is still able to predict a correct position.

Intermezzo — Tensorflow estimators versus manual abstractions.

Tensorflow includes estimators to shield developers from boilerplate code and to guide their code into a structure that should easily scale up to multiple machines. I always used estimators and assumed my loyalty to them would be rewarded with great efficiency, clean code and free features. In Tensorflow 1.12 part of those assumptions are true but still I decided to create my own abstraction. Below I explain why.

In order to ensure consistency, estimators reload the model from disk every time you call estimator.{train(), predict(), evaluate()}. (The train_and_evaluate method is simply a loop for calling estimator.train and estimator.evaluate.) If you have a big model (which is quite common) and you want to interleave training and evaluation on the same machine, reloading the model really slows down the training process.

The reason estimators reload the model is to ensure consistency when distributing it. It’s a big part of the design rationales behind them, but as you can read here, the slowdown does cause frustrations. Also, not everyone has the need nor the luxury of having an army of GPU’s at their disposal or, even more importantly, the time to make their model concurrent since this requires careful (re)design and effort. Tensorflow does have an InMemoryEvaluatorHook to overcome this problem. I tried it and it works fine, but it feels more like a workaround than a real fix.

Additionally, when I tried loading my Keras model from within the estimator model function, it took me some time to realize that one has to manually clear the Keras model after every train or evaluate call. That’s awkward.

These things are not really showstoppers, but together with the urge to learn how Tensorflow works, they were enough to convince me to create my own micro-abstraction.

With the advent of Tensorflow 2.0, I believe most of the things I struggled with will be resolved. Keras will be integrated into the very core of Tensorflow and become one of its primary interfaces. Estimators remain the preferred choice. If you want to know more about Tensorflow 2.0 check this blog and this video.

Lessons learned the hard way

I cannot believe how many mistakes I made when working on this. Some of them are just silly and easy to get, but some of them were really hard to spot. Here are some of the lessons I learned that might be useful to you:

  • Double, triple and quadruple-check the semantics, interpretation and correctness of your evaluation/training metrics. My model, for example, scores 100% accuracy from the very beginning. This isn’t because the model is super accurate, but because this metric only takes into account those samples where the model correctly predicted there is a tip. If only 5 samples out of 10000 have a correct tip detected, a 100% accuracy still means only 5 images were detected within 10px.
  • The tf.metrics API in particular fooled me more than once. Use tf.metrics wisely. They are meant for evaluation, i.e. to aggregate results over multiple batch operations and over the full evaluation dataset. Be sure to reset their state at appropriate times.
  • If you use batch norm in Tensorflow do not forget to update the moving mean and variance during training. These update operations are automatically stored in the tf.GraphKeys.UPDATE_OPS collection, so don’t forget to run them.
Two code examples on how to update moving mean and variance when performing batch norm in Tensorflow.
  • Write unit tests as a sanity-check or at least keep your quick and dirty test scripts into a separate file for later reference. It especially makes sense to thoroughly test the loss function.
  • Every time you train the model, make sure that all output metrics and other data is kept into a unique, time-tagged directory. Additionally, store the git tag (e.g. heads/master-0-g5936b9e). That way, whenever you mess up the model it will help you to revert back to a previous working version.
Example code on how to write the git description to a file.
  • Write your metrics to Tensorboard, both for training and evaluation. It’s well worth the effort because the visualization gives you key insight into the performance of your work. There are some challenges to it, but in return you can iterate and test your ideas faster.
  • Keep track of all your trainable variables in TensorBoard to help you detect exploding or vanishing gradients early. Here’s some inspiration on how you could do that.
Example code on how to visualize the mean value and a histogram for each trainable variables in the model and an overall histogram for all trainable variables.
  • Try to automatically and periodically pause the training process to evaluate the model. Be sure to render both training and evaluation curves to the same graph in Tensorboard. That way, you can visualize the performance of the model on data that it has never seen during the training process and stop as soon as you notice a problem. Be aware that you cannot show multiple summaries in the same plot by simply reusing the same tag. Tensorboard will automatically make those summaries unique by adding “_1” to the tag name and thus forcing them to show in a separate plot. If you want to work around this limitation, you can generate the Summary protocol buffers yourself, and then manually add them to the summary.FileWriter(). Awkward, but it works.
Example on how to save a metric with the tag “metrics/loss” during evaluation whilst, during training, a metric with the very same tag was used. This allows having both the training and evaluation curves shown on the same graph in Tensorboard.
  • Monitor the GPU utilization and memory consumption and try to get the GPU utilization as high as possible. If you use an NVIDIA graphics card, you can use the nvidia-smi command to do that. You can also monitor CPU and memory consumption using htop.

Hardware

  • NVIDIA Geforce RTX2080TI (11GB, 4352 Cuda cores, 600W, INNO3D GAMING OC X3)
  • Supermicro X9DRG-QF dual CPU motherboard
  • 2x Intel Xeon E5–2630 (12 cores)
  • Samsung 860 EVO SD (500G)
  • 128G RAM

Conclusion and next steps

When I started to write this post, I was aiming for a short but informative message. Little did I know, it was going to be such a monster of a post, sorry for that. This project has allowed me to touch many aspects of supervised learning algorithms which might explain the size.

With the given evaluation data, the model is able to pinpoint the tip of the satellite 98% of the time, off by one pixel. I’m pretty happy with those results. However, if I want to take this model into production, I still have a lot of things to do. The current dataset is too limited to train a robust model: the images are very similar to each-other and only a small amount of different poses are covered. As I didn’t have the opportunity to obtain more data, a colleague decided to help and render satellite images using Unity. The picture of the Magellan space probe at the top of this post is a sample of that dataset. I realize that this will not produce a realistic model, but it will allow continuing my learning process.

Also, having only one point of the satellite recognized is not enough to accurately calculate the pose and distance relative to the observing camera. In theory at least 3 points must be tracked and in practice the model should predict many more points for a reliable output.

You’ve made it to the end of this post, thanks for reading! I hope it inspires you to work on your own AI projects and challenges.

--

--