How Starship robots see the world

Published in

Towards Data Science

7 min readDec 4, 2020

A 6-wheeler and an 8-wheeler. Image by the author.

Hey folks! Today we are going to peek into the technological stack of Starship’s delivery robots.

Starship Technologies is an amazing startup by a co-founder of Skype - Ahti Heinla. I’ve had the honor to work at Starship in 2018 and hold the warmest memories of those times. Starship develops very cute wheeled robots that deliver your lunch, groceries, or parcels to your front door in a matter of minutes. To make this possible Starship’s team embarked on solving many challenging technological problems. Let’s make a deep dive into the one my team had been solving at that time.

First off, the robots are driving the sidewalks as regular delivery people would do. This involves navigating around pedestrians, being able to stay on the sidewalk, and spotting incoming cars when crossing a road at a zebra. Yes, you are right, the robots are crossing roads right as we urban inhabitants do with the only difference that robots never jaywalk.

As a matter of fact, robots are officially introduced into Estonian traffic rules. They are granted their own third status apart from pedestrians and cars — the lowest priority ever. That means that robots must yield to absolutely everyone on the street. Consider a situation: pedestrian on an unregulated crosswalk. Any car has to yield to the pedestrian. However, if a robot wants to cross the same road on a crosswalk, it must yield to the car. This gradually leads us to the understanding that vision is essential for robots. They cannot make their way to the destination without looking around with their cameras and understanding where pedestrians and especially cars are and what they are going to do.

So basically what I’ve been doing in Starship as a part of the Perception team: developing and iterating over the neural network that detects surrounding dynamic objects. A robot has around 10 cameras covering 360 degrees around the robot. The images from cameras are stitched together and fed into a convolutional neural network (CNN) that frames all dynamic (moving and stationary) objects with axis-aligned bounding boxes. Starship already had a baseline implementation by the time I joined and that’s what I did to improve it.

Children like Starship’s robots and enjoy playing with them. Image by the author.

First of all deep learning is called deep for a reason. The deeper the network is — the more sophisticated patterns it is able to learn. Why not just stacking 2X-3X more layers to the baseline net? This would proportionally increase the amount of compute required, increase the latency, and consequently robot’s reaction time. We do not want to make a robot lag behind reality for 1–2 seconds, right? A car may have already rushed right in front of the robot’s nose and disappeared behind the horizon by that time. Luckily we can interleave 3x3 and 1x1 Conv2D layers and juggle with channel sizes to match the inference time of the baseline much like Resnet or Darknet base blocks do. One nice consequence of increasing the depth of the network is an increase in the weight count. Why on Earth you would like to have a heavier model, you may ask? The reason is that the task at hand has its own intrinsic complexity, the number of degrees of freedom. And a neural network must have enough capacity to incorporate all the knowledge about the task to achieve maximal accuracy.

Honey, I’ve shrunk the robot. Image by the author.

An important thing to make when you have a CNN is to analyze its receptive field. For every pixel of the output feature map, a receptive field is the area of the original image that is visible to it.

Analysis of the receptive field size. Image by the author.

It turned out that even the deeper architecture had a very limited receptive field and did not give detections a chance to analyze the global context of the scene. Luckily academia has already proposed a way to increase the receptive field: Feature Pyramid Network (FPN). You can take a look at the link to the paper down below. FPN chops away some amount of compute, so to incorporate it into the architecture I had to shrink channels throughout the main backbone layers. However, eventually, the network showed a couple of percent increase in mean average precision (mAP) for the same latency cap: a clear win of the FPN-enabled architecture.

Feature Pyramid Networks for Object Detection

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent…

arxiv.org

Snow? Robots love it! Image by the author.

Wait, Feature Pyramid Network. Isn’t it a rip off of the hourglass architecture UNet proposed for semantic segmentation on medical images in 2015? Looks like it.

U-Net: Convolutional Networks for Biomedical Image Segmentation

There is large consent that successful training of deep networks requires many thousand annotated training samples. In…

arxiv.org

And… seems that we’ve already got an architecture for coarse semantic segmentation of the scene, but do not train the network with this target. After talking to the localization team I discovered that indeed there is a need for semantic segmentation. Visual localization extracts visual features from the scene and matches them to the HD map, allowing the robot to understand its pose (position + orientation) in the world. However, if a robot drives along a row of parked cars, it detects features on the cars and may accidentally make a mistake in its positioning. A car’s bounding box won’t work for the purpose of designating an ignore area — it captures too much of the background. Semantic segmentation would do the thing.

Tallinn is a beautiful city full of contrasts. Images by the author.

Another direct application of semantic segmentation is to distinguish between a sidewalk and a road. There is no way to do this without semantic analysis of the entire scene since both the sidewalk and the road are made of visually the same asphalt.

The image has been created by the author based on an image from nuScenes autonomous driving dataset (https://www.nuscenes.org/, CC BY-NC-SA 4.0)

This source of knowledge about the terrain is able to not only augment and cross-validate localization in a manner of Arguing Machines but allows autonomous driving in the exploratory mode i.e. into the area which is not yet mapped. Check out the paper about Arguing Machines below.

Arguing Machines: Human Supervision of Black Box AI Systems That Make Life-Critical Decisions

We consider the paradigm of a black box AI system that makes life-critical decisions. We propose an "arguing machines"…

arxiv.org

One misconception that I encountered: “if one wants to have both object detection and semantic segmentation capability at the same time, they need two separate neural networks”. This is a 2X increase in compute consumption which is unaffordable especially for an embedded platform. Fortunately, there is a solution to this problem: multitarget training. The idea is to share most of the computation in a common backbone while having two lightweight heads for different targets.

Interleaved training. The image is made by the author in Inkscape.

It is pretty straightforward to implement this kind of neural network in your favorite deep learning framework (mine is PyTorch), however, how would we train such a hydra? Oftentimes academic competitions for multitarget training like COCO provide an exhaustive annotation of an image: both bounding boxes and segmentation masks are available for the same image. But what if you only have a separate set of bbox-annotated images and a separate one with semantic labels? Furthermore what if the resolutions of these groups of images are different? No worries, you can easy-peasy do it with interleaved training! Just form a batch of detection images and annotations and perform a forward-backward cycle through the backbone + detection head and loss. Then form a batch of segmentation images and annotations and perform another forward-backward through the same backbone (keep it shared) + this time segmentation head and loss. Voila! You train for the two targets at the same time without any transfer learning or network surgery. Don’t forget to accurately weigh losses to balance accuracy between targets.

Working at Starship was a hell of a journey for me and tons of deep learnings (pun intended). If you’d like to get connected, below is my LinkedIn link.

Dmitrii Khizbullin - Principal Engineer - Huawei | LinkedIn

View Dmitrii Khizbullin's profile on LinkedIn, the world's largest professional community. Dmitrii has 5 jobs listed on…

www.linkedin.com

If you want to learn more, check out the videos below where Ahti and the team share lots of nitty-gritty about Starship’s technology.

Fresh webinar and Q&A by Ahti Heinla

Personally, I find robo-vision to be sci-fi-esque

By the way, Starship is actively expanding and hiring at the moment, so feel free to reach out to them once you’ve got interested: https://www.starship.xyz/careers/

Dear reader! I would like to thank you for your interest in robotics and deep learning and hope you enjoyed the story. Don’t forget to smile and have a good day.

How Starship robots see the world

Feature Pyramid Networks for Object Detection

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent…

U-Net: Convolutional Networks for Biomedical Image Segmentation

There is large consent that successful training of deep networks requires many thousand annotated training samples. In…

Arguing Machines: Human Supervision of Black Box AI Systems That Make Life-Critical Decisions

We consider the paradigm of a black box AI system that makes life-critical decisions. We propose an "arguing machines"…

Dmitrii Khizbullin - Principal Engineer - Huawei | LinkedIn

View Dmitrii Khizbullin's profile on LinkedIn, the world's largest professional community. Dmitrii has 5 jobs listed on…

Written by Dmitrii Khizbullin