The world’s leading publication for data science, AI, and ML professionals.

Development of a Benchmark Dataset with an interface to the FastAI DataLoader using Label Studio

A demo using Label Studio segmentation mask capabilities to develop a benchmark dataset for model viability testing.

Development of a Benchmark Dataset Integrating the fast.ai DataBlock API with Label Studio

Sequential Illustration of the Creation of a Segmentation Mask (Image by Author)
Sequential Illustration of the Creation of a Segmentation Mask (Image by Author)

I was excited to use Label Studio because it had so much buzz in the MLOps community, and there were some impressive visuals on their marketing webpage. The use case we’re working on is to generate detailed segmentation masks for a small number of images that can be used as a benchmark dataset as we train a model to identify the "Canvas" part of a picture frame. In another blog article we will discuss the actual training data, which is using a synthetic data approach. The benchmark dataset will help give us an "in the wild" evaluation of the model created with synthetic data. Label Studio enables the labeler to describe regions of an image. This demo shows how one can export the labels from the Label Studio server to a segmentation data loader that can be recognized by FastAI and PyTorch.

This effort was conducted under a project called wallshots.co which is developing live-streaming tools for digital art collectors. The creators of this demo are Aaron Soellinger and Will Kunz

Objective

The objective is to reduce the lead time and improve accuracy of a segmentation mask creation and the label maintenance operation such as adding new categories, re-labeling regions, fine-tuning labels. This would be to evaluate Label Studio as a labeling toolkit that would enable improved labeling speed and accuracy. I was looking for the following characteristics:

  • Well thought out dependencies, fits into my architecture
  • Fast to set up
  • Simple to maintain
  • Good GUI for region labeling
  • Configurable labels
  • Product targeted at segmentation masks
  • Interoperable with Python (nice to have)
  • Interoperable with PyTorch ecosystem tools (nice to have)

The Demo

Installation

docker run -it -p 8080:8080 -v `pwd`:/label-studio/data heartexlabs/label-studio:latest

For this demo I used Docker installation. It was easy to get the server up and running but I am familiar with Docker deployment. Altogether, it took less than 30 minutes from install to initial configuration of my modeling environment which is impressive.

Labeling

The labeling process was intuitive. Before implementing Label Studio, I used a combination of draw.io and Mac’s Photo Viewer. That setup allowed me to see the pixel indexes for image points, but the process is far from enjoyable. It was extremely tedious and error prone. It’s unfathomable to scale up a data labeling operation around this kind of approach. That’s why I needed a labeling tool like Label Studio.

The Label Studio GUI made it possible for me to greatly improve from the draw.io+Mac Photo Viewer process. My favorite features of Label Studio were:

  1. The setup and editing of the configuration was easy. I added new labeling categories, then went back to alter regions I had previously labeled one way, to the new category. Configuration was seem-less, and I was amazed.
  2. It kept track of lead time, labeler identity attribution, tracked project state (e.g. Completed, Not Completed, etc…)!
  3. It enables data ingest and output to cloud buckets.

Some issues I encountered:

  1. Not clear how to label non-rectangular regions effectively
  2. The interface zoom, drag features, while functional, were clunky. I found myself labeling in 2 steps. I would label the approximate vertices first, then follow up to fine tune dragging the approximate vertices into place. That worked.
  3. Overlapping regions caused some issues because it was not possible to add /create a point over the top of an existing labeled region. However, one can drag an existing point over the top of an existing labeled region. Eventually we found a solution but this is definitely an area of future improvement
  4. Currently Label Studio only delivers the vertices of the polygon, and the user is meant to draw the region themselves. At best, that requires a tight communication loop between the labeler and the data scientist. That includes a communication of all standards as well as any edge cases that occur in the labeling activities. Problems are complicated further if the labeled regions are non-convex sets. In that case, reconstructing the region requires the list of vertices to also be ordered. Better than providing the ability to order the points would be to just go the "last mile" and deliver the region. That reduces the reasons in which the data scientist needs to delve deeply into the specific of the labeling activities, thus reduces delays in delivering trained models.

Getting the Labels into the Modeling Runtime

I learned that the best way to source the latest and greatest labels from the server is by using the LabelStudio API. It was confusing because there was a file generated from the labeling activity automatically that was not continually updated. That file was stored in the export/ directory. The recommended solution is to use the Label Studio REST API where HTTP requests need to be made from my training runtime, where the data was needed, to the label server. That required me to tweak my environment since both my training runtime and my label servers were running in Docker containers that didn’t necessarily communicate with one another. I solved it by adding the -net=hostfunctionality to the training docker container. To illustrate, see the question mark (?) below:

The relevant parts of my development server architecture (Image by Author)
The relevant parts of my development server architecture (Image by Author)

We use the Label Studio API, which is implemented as a REST api. It can be accessed on localhost:8092 which is where the Label Studio server is running. Here’s the code used to source the labels from the Label Studio server. Label Studio API will cache results automatically, which is not desirable. Note: I found the token in the Account Settings page of the Label Studio GUI.

# Docs: https://api.labelstud.io/ 
ls_export = "http://localhost:8902/api/projects/{id}/export?exportType=JSON"
proj_id = "1"
# got the token from account settings in the gui
# http://localhost:8902/user/account
token = "***"
bm_data = requests.get(
    ls_export.format(
        id=proj_id), 
    headers={
        'Authorization': 'Token {}'.format(token),
        'Cache-Control': 'no-cache'
    }
).content
bm_data = json.loads(bm_data)

PreProcessing for the PyTorch Segmentation DataLoader

As always, the preprocessing of the data to load into the data loader is the hardest and most time consuming part of the entire process. (That is actually why I wanted to write this article, to share my code)

Vertices to Polygons using numpy:

By using numpy.linspace it is possible to draw the vertical lines, then loop over each row of the mask and fill within them. This is the "from scratch" solution. The issue is that it suffers from a bunch of edge cases. This is code that I don’t typically like to maintain so I look for someone who already solved this in the open source community. Here is the working solution:

import requests
import json
from pathlib import Path
class NonConvexityDetected(Exception):
    pass
def label_path_to_docker_path(label_path:str, rawdatadir:Path):
    """
    Converts the path used in Label Studio server to local path.
    Uses: instance.data.image from bm_data

    """
    return Path(label_path.replace('/data', str(rawdatadir)))
def key_points_to_pixels(key_points, width, height):
    """Converts proportion space to pixel indices."""
    out = np.floor(np.matmul(
        key_points,
        np.array([[width, 1],[1, height]])
    )).astype('int')
    bounds_checker = lambda x,x_max: x_max if x>x_max else x
    bounds_checker = np.vectorize(bounds_checker)
    out[:,1] = bounds_checker(out[:,1],height-1)
    out[:,0] = bounds_checker(out[:,0],width-1)
    return out
def apply_polygon_horizontal_bounds(maskbase, label_data, label_num):
    """Labels the bounds in each row, there will be 0, 1, or 2 bounds."""
    width, height = label_data['original_width'], label_data['original_height']
    closed_poly = label_data['value']['points']
    closed_poly.append(closed_poly[0])
    props = np.array(closed_poly)
    out = key_points_to_pixels(
        key_points=props*0.01,width=width, height=height,
    )
    # draw lines
    last = out[0]
    for i, pair in enumerate(out):
        if i == 0: continue
        start_y = last[1]
        end_y = pair[1]
        # for each row, label the bounds of y
        min_pts = 3 if not start_y == end_y else 2
        label_pts = np.floor(
            np.linspace(
                start=last, 
                stop=pair, 
                num=max(min_pts, abs(end_y-start_y)+1)
            )
        ).astype(int)
        for pt in label_pts:
            maskbase[int(pt[1]),int(pt[0])] = label_num    
        last = pair

def fill_rowwise(maskbase, fill_label):
    """Looks for bounds drawn in a previous step and fills between them."""
    for i in range(len(maskbase)):
        xs = np.where(maskbase[i] == fill_label)[0]
        if len(xs) == 0:
            # print('Region not present', xs)
            continue
        if len(xs) == 1:
            print(
                'Could be a local max/min in row={}, found {} in {}'
                .format(
                    i, len(xs), xs
                )
            )
            # it's already labeled
            continue

        bounds = [[i,xs[0]],[i,xs[1]]]
        maskbase[i, bounds[0][1]:bounds[1][1]] = fill_label

def delete_duplicates(l):
    out = []
    for e in l:
        if e not in out:
            out.append(e)

    return out

def prep_mask_for_combining(mask):
    """Prepares elements for combining."""
    mask = mask.astype(str)
    mask = np.where(mask == '0', '', mask)
    return mask
def combine_masks(masks):
    """Combines a set of masks."""
    apply_multi_label_rules = np.vectorize(
        lambda x: multi_label_rules[
            ''.join(
                sorted(
                    ''.join(delete_duplicates(list(x)))
                )
            )
        ]
    )
    outmask = prep_mask_for_combining(masks[0])
    for mask_i in masks[1:]:
        outmask = np.core.defchararray.add(
            outmask,
            prep_mask_for_combining(mask_i)
        )

    return  apply_multi_label_rules(outmask)
def compute_segmask_from_labelstudio(instance, rawdatadir, labels_map, multi_label_rules):
    """Processes the labeled region export from LabelStudio into a segmentation mask."""

    t = instance
    raw_fp = label_path_to_docker_path(
        label_path=t['data']['image'],
        rawdatadir=rawdatadir
    )
    baseimg = MaskedImg()
    baseimg.load_from_file(fn=raw_fp)
    imgname = Path(raw_fp).name
    maskbase = np.zeros(shape=baseimg.img.shape[:2], dtype=int)
    i = 0
    masks = []
    for label_data in t['annotations'][0]['result']:
        mask_i = maskbase.copy()
        print('######', i)
        label = label_data['value']['polygonlabels'][0]
        label_num = labels_map[label]
        apply_polygon_horizontal_bounds(
            maskbase=mask_i, 
            label_data=label_data, 
            label_num=label_num
        )
        fig = plt.figure(figsize=(9,9))
        plt.title('Drawn row-wise bounds')
        plt.imshow(baseimg.img)
        plt.imshow(mask_i,alpha=0.25)
        plt.show()
        fill_rowwise(
            maskbase=mask_i, 
            fill_label=label_num
        )
        fig = plt.figure(figsize=(9,9))
        plt.title('Filled between bounds')
        plt.imshow(baseimg.img)
        plt.imshow(mask_i,alpha=0.25)
        plt.show()
    masks.append(mask_i)
    print('nn#########')
    final_mask = combine_masks(masks)
    fig = plt.figure(figsize=(9,9))
    plt.title('Final Mask')
    plt.imshow(baseimg.img)
    plt.imshow(final_mask, alpha=0.25)
    plt.show()
    return imgname, baseimg, final_mask

For me, the specific implementation is defined:

rawdatadir = Path('/ws/data/wallshots-framefinder/benchmark_data/1/media')
# A place to define ones own labels corresponding to named regions in LabelStudio
labels_map = {
    'Background': 0,
    'Canvas': 1,
    'Obstruction': 2,
    'PartialCanvas': 3
}
# TODO: Automate this from labels_map
# This object defines new labels based on the overlapping regions.
# E.g. '12' is when region 1 and 2 are both present in a pixel.
multi_label_rules = {
    '': 0,
    '1': 1,
    '2': 2,
    '3': 3,
    '12': 4,
    '13': 5,
    '23': 6
}
saveto = Path('benchmark')
saveto.mkdir(exist_ok=True, parents=True)
(saveto/'scenes').mkdir(exist_ok=True, parents=True)
(saveto/'masks').mkdir(exist_ok=True, parents=True)
import imageio
for instance in bm_data:
    imgname, img, mask = compute_segmask_from_labelstudio(
        instance=instance, 
        rawdatadir=rawdatadir,
        labels_map=labels_map, 
        multi_label_rules=multi_label_rules
    )

    basename = imgname.split('.')[0]
    fn = '{}.jpg'.format(basename)
    img.save(saveto=saveto/'scenes'/fn)
    fn = '{}.tif'.format(basename)
    im = Image.fromarray(
        mask.astype(np.uint8)
    )
    im.save(saveto/'masks'/fn)

This will print out your images and show the segmentation masks overlaid on them. To avoid that, you can just comment out the plt lines. For each labeled region, this code will plot one chart for the vertical lines and anothere for the filled region. Then after it shows all the individual regions plotted separately, it will combine them all into the final segmentation mask (the one loaded into the model training). Here’s a video of how Label Studio can be used in concert with the training runtime to update as new labels are added or fine-tuned.

Here’s an example that sequentially shows how the mask creation algorithm works:

Shows the vertical lines that define the region that will be filled (Image by Author)
Shows the vertical lines that define the region that will be filled (Image by Author)
Shows the filled region (Image by Author)
Shows the filled region (Image by Author)
Shows the final mask, in this case there is only one region. (Image by Author)
Shows the final mask, in this case there is only one region. (Image by Author)

This setup works with multiple labeled regions in the same base image, and even if they’re overlapping. See the example below, which contains non-square rectangles and overlapping regions. The overlapping regions are hard to see, but the plants that block the picture frames in the image are labeled separately as "Obstructions".

Illustration of segmentation mask with multiple regions, some of which are overlapping (Image by Author)
Illustration of segmentation mask with multiple regions, some of which are overlapping (Image by Author)

Create the FastAI/PyTorch DataLoader for the Benchmark Set

The data from Label Studio comes as a list of points, which are the vertices or corners of our labeled regions. In our case, the polygons are labeled as one of several categories. For example, "Canvas", "Obstruction" and "PartialCanvas". We read in our region vertices data, obtain the raw image, and transform it into a "segmentation mask" that can be read by the segmentation data loader. The benefit of loading our benchmark dataset as a DataLoader is that we get the data validation, and logging facility that is integrated into the interaction with the data loader. Eventually, the in the wild images may replace the synthetic training data we create, or at least it will augment that synthetic dataset.

def get_y_fn(x):
    return str(x) 
        .replace('scenes', 'masks') 
        .replace('jpg', 'tif')
size = 250
dls = SegmentationDataLoaders.from_label_func(
    saveto, 
    bs=12,
    fnames=[
        name 
        for name in saveto.iterdir()
        if not name.is_dir()
    ],
    label_func=get_y_fn,
    item_tfms=[Resize((size,size),)],
    batch_tfms=[
        Normalize.from_stats(*imagenet_stats)
    ]
)
dls.show_batch()
From the FastAI data loader, we see the segmentation masks combined with the original images. This is proof that the data is processed correctly.
From the FastAI data loader, we see the segmentation masks combined with the original images. This is proof that the data is processed correctly.

Conclusions

Label Studio provides a powerful tool set for labeling data in segmentation problems. There are other types of labeling tasks that are supported by Label Studio as well, but here we are looking at the segmentation masks capability. Future work will show what happens when we introduce an Active Learning loop. We will move the benchmark data to the cloud, which enables integration with the core product user flow.


Related Articles