Development of a Benchmark Dataset Integrating the fast.ai DataBlock API with Label Studio

I was excited to use Label Studio because it had so much buzz in the MLOps community, and there were some impressive visuals on their marketing webpage. The use case we’re working on is to generate detailed segmentation masks for a small number of images that can be used as a benchmark dataset as we train a model to identify the "Canvas" part of a picture frame. In another blog article we will discuss the actual training data, which is using a synthetic data approach. The benchmark dataset will help give us an "in the wild" evaluation of the model created with synthetic data. Label Studio enables the labeler to describe regions of an image. This demo shows how one can export the labels from the Label Studio server to a segmentation data loader that can be recognized by FastAI and PyTorch.
This effort was conducted under a project called wallshots.co which is developing live-streaming tools for digital art collectors. The creators of this demo are Aaron Soellinger and Will Kunz
Objective
The objective is to reduce the lead time and improve accuracy of a segmentation mask creation and the label maintenance operation such as adding new categories, re-labeling regions, fine-tuning labels. This would be to evaluate Label Studio as a labeling toolkit that would enable improved labeling speed and accuracy. I was looking for the following characteristics:
- Well thought out dependencies, fits into my architecture
- Fast to set up
- Simple to maintain
- Good GUI for region labeling
- Configurable labels
- Product targeted at segmentation masks
- Interoperable with Python (nice to have)
- Interoperable with PyTorch ecosystem tools (nice to have)
The Demo
Installation
docker run -it -p 8080:8080 -v `pwd`:/label-studio/data heartexlabs/label-studio:latest
For this demo I used Docker installation. It was easy to get the server up and running but I am familiar with Docker deployment. Altogether, it took less than 30 minutes from install to initial configuration of my modeling environment which is impressive.
Labeling
The labeling process was intuitive. Before implementing Label Studio, I used a combination of draw.io and Mac’s Photo Viewer. That setup allowed me to see the pixel indexes for image points, but the process is far from enjoyable. It was extremely tedious and error prone. It’s unfathomable to scale up a data labeling operation around this kind of approach. That’s why I needed a labeling tool like Label Studio.
The Label Studio GUI made it possible for me to greatly improve from the draw.io+Mac Photo Viewer process. My favorite features of Label Studio were:
- The setup and editing of the configuration was easy. I added new labeling categories, then went back to alter regions I had previously labeled one way, to the new category. Configuration was seem-less, and I was amazed.
- It kept track of lead time, labeler identity attribution, tracked project state (e.g. Completed, Not Completed, etc…)!
- It enables data ingest and output to cloud buckets.
Some issues I encountered:
- Not clear how to label non-rectangular regions effectively
- The interface zoom, drag features, while functional, were clunky. I found myself labeling in 2 steps. I would label the approximate vertices first, then follow up to fine tune dragging the approximate vertices into place. That worked.
- Overlapping regions caused some issues because it was not possible to add /create a point over the top of an existing labeled region. However, one can drag an existing point over the top of an existing labeled region. Eventually we found a solution but this is definitely an area of future improvement
- Currently Label Studio only delivers the vertices of the polygon, and the user is meant to draw the region themselves. At best, that requires a tight communication loop between the labeler and the data scientist. That includes a communication of all standards as well as any edge cases that occur in the labeling activities. Problems are complicated further if the labeled regions are non-convex sets. In that case, reconstructing the region requires the list of vertices to also be ordered. Better than providing the ability to order the points would be to just go the "last mile" and deliver the region. That reduces the reasons in which the data scientist needs to delve deeply into the specific of the labeling activities, thus reduces delays in delivering trained models.
Getting the Labels into the Modeling Runtime
I learned that the best way to source the latest and greatest labels from the server is by using the LabelStudio API. It was confusing because there was a file generated from the labeling activity automatically that was not continually updated. That file was stored in the export/
directory. The recommended solution is to use the Label Studio REST API where HTTP requests need to be made from my training runtime, where the data was needed, to the label server. That required me to tweak my environment since both my training runtime and my label servers were running in Docker containers that didn’t necessarily communicate with one another. I solved it by adding the -net=host
functionality to the training docker container. To illustrate, see the question mark (?) below:

We use the Label Studio API, which is implemented as a REST api. It can be accessed on localhost:8092
which is where the Label Studio server is running. Here’s the code used to source the labels from the Label Studio server. Label Studio API will cache results automatically, which is not desirable. Note: I found the token in the Account Settings page of the Label Studio GUI.
# Docs: https://api.labelstud.io/
ls_export = "http://localhost:8902/api/projects/{id}/export?exportType=JSON"
proj_id = "1"
# got the token from account settings in the gui
# http://localhost:8902/user/account
token = "***"
bm_data = requests.get(
ls_export.format(
id=proj_id),
headers={
'Authorization': 'Token {}'.format(token),
'Cache-Control': 'no-cache'
}
).content
bm_data = json.loads(bm_data)
PreProcessing for the PyTorch Segmentation DataLoader
As always, the preprocessing of the data to load into the data loader is the hardest and most time consuming part of the entire process. (That is actually why I wanted to write this article, to share my code)
Vertices to Polygons using numpy:
By using numpy.linspace
it is possible to draw the vertical lines, then loop over each row of the mask and fill within them. This is the "from scratch" solution. The issue is that it suffers from a bunch of edge cases. This is code that I don’t typically like to maintain so I look for someone who already solved this in the open source community. Here is the working solution:
import requests
import json
from pathlib import Path
class NonConvexityDetected(Exception):
pass
def label_path_to_docker_path(label_path:str, rawdatadir:Path):
"""
Converts the path used in Label Studio server to local path.
Uses: instance.data.image from bm_data
"""
return Path(label_path.replace('/data', str(rawdatadir)))
def key_points_to_pixels(key_points, width, height):
"""Converts proportion space to pixel indices."""
out = np.floor(np.matmul(
key_points,
np.array([[width, 1],[1, height]])
)).astype('int')
bounds_checker = lambda x,x_max: x_max if x>x_max else x
bounds_checker = np.vectorize(bounds_checker)
out[:,1] = bounds_checker(out[:,1],height-1)
out[:,0] = bounds_checker(out[:,0],width-1)
return out
def apply_polygon_horizontal_bounds(maskbase, label_data, label_num):
"""Labels the bounds in each row, there will be 0, 1, or 2 bounds."""
width, height = label_data['original_width'], label_data['original_height']
closed_poly = label_data['value']['points']
closed_poly.append(closed_poly[0])
props = np.array(closed_poly)
out = key_points_to_pixels(
key_points=props*0.01,width=width, height=height,
)
# draw lines
last = out[0]
for i, pair in enumerate(out):
if i == 0: continue
start_y = last[1]
end_y = pair[1]
# for each row, label the bounds of y
min_pts = 3 if not start_y == end_y else 2
label_pts = np.floor(
np.linspace(
start=last,
stop=pair,
num=max(min_pts, abs(end_y-start_y)+1)
)
).astype(int)
for pt in label_pts:
maskbase[int(pt[1]),int(pt[0])] = label_num
last = pair
def fill_rowwise(maskbase, fill_label):
"""Looks for bounds drawn in a previous step and fills between them."""
for i in range(len(maskbase)):
xs = np.where(maskbase[i] == fill_label)[0]
if len(xs) == 0:
# print('Region not present', xs)
continue
if len(xs) == 1:
print(
'Could be a local max/min in row={}, found {} in {}'
.format(
i, len(xs), xs
)
)
# it's already labeled
continue
bounds = [[i,xs[0]],[i,xs[1]]]
maskbase[i, bounds[0][1]:bounds[1][1]] = fill_label
def delete_duplicates(l):
out = []
for e in l:
if e not in out:
out.append(e)
return out
def prep_mask_for_combining(mask):
"""Prepares elements for combining."""
mask = mask.astype(str)
mask = np.where(mask == '0', '', mask)
return mask
def combine_masks(masks):
"""Combines a set of masks."""
apply_multi_label_rules = np.vectorize(
lambda x: multi_label_rules[
''.join(
sorted(
''.join(delete_duplicates(list(x)))
)
)
]
)
outmask = prep_mask_for_combining(masks[0])
for mask_i in masks[1:]:
outmask = np.core.defchararray.add(
outmask,
prep_mask_for_combining(mask_i)
)
return apply_multi_label_rules(outmask)
def compute_segmask_from_labelstudio(instance, rawdatadir, labels_map, multi_label_rules):
"""Processes the labeled region export from LabelStudio into a segmentation mask."""
t = instance
raw_fp = label_path_to_docker_path(
label_path=t['data']['image'],
rawdatadir=rawdatadir
)
baseimg = MaskedImg()
baseimg.load_from_file(fn=raw_fp)
imgname = Path(raw_fp).name
maskbase = np.zeros(shape=baseimg.img.shape[:2], dtype=int)
i = 0
masks = []
for label_data in t['annotations'][0]['result']:
mask_i = maskbase.copy()
print('######', i)
label = label_data['value']['polygonlabels'][0]
label_num = labels_map[label]
apply_polygon_horizontal_bounds(
maskbase=mask_i,
label_data=label_data,
label_num=label_num
)
fig = plt.figure(figsize=(9,9))
plt.title('Drawn row-wise bounds')
plt.imshow(baseimg.img)
plt.imshow(mask_i,alpha=0.25)
plt.show()
fill_rowwise(
maskbase=mask_i,
fill_label=label_num
)
fig = plt.figure(figsize=(9,9))
plt.title('Filled between bounds')
plt.imshow(baseimg.img)
plt.imshow(mask_i,alpha=0.25)
plt.show()
masks.append(mask_i)
print('nn#########')
final_mask = combine_masks(masks)
fig = plt.figure(figsize=(9,9))
plt.title('Final Mask')
plt.imshow(baseimg.img)
plt.imshow(final_mask, alpha=0.25)
plt.show()
return imgname, baseimg, final_mask
For me, the specific implementation is defined:
rawdatadir = Path('/ws/data/wallshots-framefinder/benchmark_data/1/media')
# A place to define ones own labels corresponding to named regions in LabelStudio
labels_map = {
'Background': 0,
'Canvas': 1,
'Obstruction': 2,
'PartialCanvas': 3
}
# TODO: Automate this from labels_map
# This object defines new labels based on the overlapping regions.
# E.g. '12' is when region 1 and 2 are both present in a pixel.
multi_label_rules = {
'': 0,
'1': 1,
'2': 2,
'3': 3,
'12': 4,
'13': 5,
'23': 6
}
saveto = Path('benchmark')
saveto.mkdir(exist_ok=True, parents=True)
(saveto/'scenes').mkdir(exist_ok=True, parents=True)
(saveto/'masks').mkdir(exist_ok=True, parents=True)
import imageio
for instance in bm_data:
imgname, img, mask = compute_segmask_from_labelstudio(
instance=instance,
rawdatadir=rawdatadir,
labels_map=labels_map,
multi_label_rules=multi_label_rules
)
basename = imgname.split('.')[0]
fn = '{}.jpg'.format(basename)
img.save(saveto=saveto/'scenes'/fn)
fn = '{}.tif'.format(basename)
im = Image.fromarray(
mask.astype(np.uint8)
)
im.save(saveto/'masks'/fn)
This will print out your images and show the segmentation masks overlaid on them. To avoid that, you can just comment out the plt
lines. For each labeled region, this code will plot one chart for the vertical lines and anothere for the filled region. Then after it shows all the individual regions plotted separately, it will combine them all into the final segmentation mask (the one loaded into the model training). Here’s a video of how Label Studio can be used in concert with the training runtime to update as new labels are added or fine-tuned.
Here’s an example that sequentially shows how the mask creation algorithm works:



This setup works with multiple labeled regions in the same base image, and even if they’re overlapping. See the example below, which contains non-square rectangles and overlapping regions. The overlapping regions are hard to see, but the plants that block the picture frames in the image are labeled separately as "Obstructions".

Create the FastAI/PyTorch DataLoader for the Benchmark Set
The data from Label Studio comes as a list of points, which are the vertices or corners of our labeled regions. In our case, the polygons are labeled as one of several categories. For example, "Canvas", "Obstruction" and "PartialCanvas". We read in our region vertices data, obtain the raw image, and transform it into a "segmentation mask" that can be read by the segmentation data loader. The benefit of loading our benchmark dataset as a DataLoader is that we get the data validation, and logging facility that is integrated into the interaction with the data loader. Eventually, the in the wild images may replace the synthetic training data we create, or at least it will augment that synthetic dataset.
def get_y_fn(x):
return str(x)
.replace('scenes', 'masks')
.replace('jpg', 'tif')
size = 250
dls = SegmentationDataLoaders.from_label_func(
saveto,
bs=12,
fnames=[
name
for name in saveto.iterdir()
if not name.is_dir()
],
label_func=get_y_fn,
item_tfms=[Resize((size,size),)],
batch_tfms=[
Normalize.from_stats(*imagenet_stats)
]
)
dls.show_batch()

Conclusions
Label Studio provides a powerful tool set for labeling data in segmentation problems. There are other types of labeling tasks that are supported by Label Studio as well, but here we are looking at the segmentation masks capability. Future work will show what happens when we introduce an Active Learning loop. We will move the benchmark data to the cloud, which enables integration with the core product user flow.