How to Use the STAPLE Algorithm to Combine Multiple Image Segmentations

This code example shows how to use the STAPLE algorithm on a medical image dataset

Published in

Towards Data Science

7 min readAug 4, 2021

Medical images contain a wealth of information that helps us understand patient health. To unlock that information, the first step is usually to segment, or trace, important structures.

Segmentation is the most important step in medical image analysis — and it’s often overlooked. For more information on the process of segmentation, please check out my introduction to segmenting the pancreas.

Defining the Ground Truth

One of the biggest challenges facing researchers in the field of medical imaging — and data science in general — how do we define what is a “true” segmentation?

On the surface, it may seem like a simple problem. We ask the smartest person in the room (usually a radiologist) to do their best job segmenting the pancreas, and this is our ground truth, right?

There’s a few problems with this approach. Firstly, doctors are incredibly busy. Getting an expert radiologist to trace out a few hundred (or even a few dozen) scans is a big task. Secondly, if there are multiple radiologists working on the same project, how do we say which one is the most “expert”? Is an expert radiologist at the end of a long day of work really that much more accurate than a medical student armed with a pot of coffee?

Incidentally, this is one of the ways that Artificial Intelligence (AI) can improve radiology. Studies have shown that, although AI models may not always be more accurate than human experts, in most cases they are more precise, meaning that an AI will give the same prediction every single time, when human experts tend to vary in the amount of time and energy they dedicate to a project.

If only there were a democratic way to create our ground truth segmentations. If multiple tracers segment the same scan, there may be small errors or differences between individuals, but these mistakes average out into a decision that’s better than the work of any single person.

Enter the STAPLE Algorithm, one of my favorite algorithms and possibly the single most useful bit of code in the field medical imaging.

In this post, I’ll give a brief description of STAPLE, and show an example of how to use it in Python.

The STAPLE Algorithm

STAPLE stands for “Simultaneous Truth and Performance Level Estimation” and was published in 2004 by Warwick et al. STAPLE is widely accepted in the field of medical imaging for its sound theory and ease of use.

Let’s take our CT pancreas segmentation task as an example. As a reminder, this is from a publicly available dataset available through The Cancer Imaging Archive. Let’s say we have an abdominal CT scan where the pancreas has been segmented 3 times, by 3 different radiologists. Each pixel of the pancreas has been labeled “1” (Pancreas) or “0” (No Pancreas). Figure 1 shows an example of one radiologist’s segmentation.

**Figure 1**. Example of three radiologists’ segmentation of the pancreas in abdominal CT. Blue is labeled “1” (“pancreas”). Background (transparent) is labeled “0”. Image by author.

At first glance, all three segmentations may look about the same. But there are differences, if you look closely. The lower left radiologist is a bit eager to paint over the whole abdomen (even painting into the dark fat tissue), where the lower right radiologist might be missing part of the pancreas body. STAPLE is going to help us combine these into one tracing.

How STAPLE works

STAPLE is a weighted voting algorithm. As an initial step, the algorithm will combine all three into a test segmentation, by simple voting on each pixel. STAPLE will rate the accuracy of each of the 3 radiologists compared to this initial test segmentation. It will then re-draw a new, second test segmentation by weighting the votes of the 3 radiologists according to their accuracy. If Radiologist 3 is found to be 90% accurate, and Radiologist 1 and 2 are only 30% and 40% accurate, then Radiologist 3 will be weighted much higher than the others.

STAPLE is an iterative process; this cycle of estimating the accuracy and redrawing the test segmentation will repeat until the test segmentation converges (stops changing). STAPLE is guaranteed to converge in a reasonable length of time, and usually runs for no more than a few seconds. STAPLE will return this final test segmentation as the “ground truth”.

These weights are re-calculated on a per-image basis. So, for example, if one radiologist tends to be more accurate, but happens to have a bad case, that case won’t necessarily be rated higher based on prior accuracy.

As you might expect, STAPLE becomes more accurate the more raters are used to vote. In cases of 10 or 12 raters, the STAPLE-generated segmentation can be considered independent of the individual contributions, which makes it a great tool for estimating the accuracy of any given rater.

STAPLE will work with two raters, but I would recommend that you use at least three, since two raters are not sufficient to estimate rater accuracy which is the core strength of using STAPLE.

Drawbacks of the algorithm

STAPLE is an excellent algorithm and the gold-standard in the field, but there are a few limitations. Firstly, because STAPLE relies on majority voting, it can tend to underestimate the edges of the structure that you’re trying to trace. This can exaggerate the “false positive” rate of your tracers compared to the STAPLE-combined ground truth.

Secondly, STAPLE is a naïve voting algorithm, it doesn’t take into account semantic or underlying image properties. If multiple tracers label something, it will show up in the final version regardless of the actual anatomy. This can result in small pixel-wise errors when random variations or mistakes are picked up by multiple tracers (trust me, it happens).

These problems tend to coincide when using STAPLE for multilabel segmentations. In one my colleague’s publications labeling brain anatomy, STAPLE tended to leave small gaps between adjacent structures, not taking into account the fact that the brain is continuous. This resulted in a STAPLE segmentation that, although technically accurate, looked very different than the segmentations that were used to generate it.

Because of this, it may be appropriate to apply some postprocessing to your STAPLE ground truths. For example, you may choose to preserve only the largest single structure for a given class, throwing out small pixel mistakes. Or, you may choose to define a “background” class to fill in any gaps. For example, in abdomen organ segmentation, we define “fat” as the background so any gaps in the organs will be filled in (which usually matches the actual case very closely).

Just be sure to report such postprocessing steps in your final paper, so others can learn from your shared knowledge!

Example of using STAPLE

In Python, STAPLE is available through SimpleITK. Below is a code snippet which illustrates how to use STAPLE to combine three segmentations. I’m using SimpleITK version 2.1.0 with Python 3.7.8.

Example 1. Using STAPLE to combine three segmentations into a single ground truth

Before we move on, some real quick notes on this Simple ITK implementation, which can be finicky.

SimpleITK requires that the input arrays be integer, either int16, int32, or int64. The arrays must all be cast to the same type
STAPLE can only be run on 1 class at a time. The class label is defined as “foreground value” (second operator) and all other pixels are treated as background.
For multi-class segmentations, you must run STAPLE on each class individually and then recombine them into a single array, which is a limitation of package.

Now that we’ve generated our STAPLE segmentation, we can plot it side-by-side with our three raters using matplotlib:

Example 2. Plotting our STAPLE segmentation using matplotlib

Figure 2. Example of using STAPLE to combine 3 pancreas segmentations into a single ground truth (far left). Image by author

If you squint you can tell a difference between our tracers and the STAPLE ground-truth, but it’s hard to see. Let’s highlight the differences between each rater and STAPLE:

Example 3. Plotting the differences for each rater, compared to STAPLE

Figure 3. The overlap (true-positives) between each rater and the STAPLE is shown in red. Differences are in light blue. Image by author

Here, the differences really pop. There’s a little variation at the edge of the pancreas, but the biggest difference is that STAPLE was able to correct a mistake that Rater 2 made in segmenting the body of the pancreas.

Summary

In this tutorial, I’ve introduced the STAPLE algorithm and shown how to combine simple segmentations from multiple tracers into a single ground truth.

This tutorial shows a simple example, but STAPLE holds up for large tasks as well. For one project, we used it on a dataset consisting of 35 organs, involving 17 raters. Besides interrater reliability, it’s also my preferred method for combining ensemble predictions, where I’ve used it to combine the outputs of up to ten models.

I hope this article has been helpful! If you have any questions, please comment and I will do my best to answer.

Other links

https://nipy.org/nibabel/ — Link to the Nibabel package for reading/writing NIFTI files

https://simpleitk.org/ — Link to the SimpleITK implementation of STAPLE

Pypi.org/project/staple — Another implementation of STAPLE I found when researching this article

A more in-depth tutorial on segmentation with SimpleITK—

34_Segmentation_Evaluation

Evaluating segmentation algorithms is most often done using reference data to which you compare your results. In the…

insightsoftwareconsortium.github.io

Links to some of my other articles on medical imaging —

Understanding DICOM

How to read, write, and organize medical images

towardsdatascience.com

A Python script to sort DICOM files

This script will help you understand and organize your dataset of medical images