By: Aaron Soellinger & Will Kunz @ WallShots.co

Manual labeling of data is expensive and tedious. One emergent approach to massively reduce the lift necessary to label a sufficiently large dataset for segmentation is to use synthetic data generation. In this article, we create a benchmark model using a synthetic dataset and a small traditionally labeled dataset. In our case, no off-the-shelf training dataset existed, so we had to create it.
The goal of the model we’re building is to identify any picture frames in an image. Specifically, we want to identify the regions of the picture frames that contain the art (not the padding or the frame itself). This article describes our solution architecture, and how we use it to compare the baseline models created with two different approaches to forming the training dataset. First, we generate a simple synthetic dataset that is comprised of colored rectangles. We compare that with another approach, the so-called "traditional approach", in which "in the wild" are found and labeled using our labeling tool (Label Studio).
Our stack for this problem is Label Studio for labeling with a Python Fast.ai training environment. In this article, the experiments are based on the unet architecture with a resnet34
backbone.
In the end, we will show a comparison of two baseline models trained with different approaches to developing a training dataset. Our hope is that this article will be an interesting case study for those implementing segmentation models.
For the traditional labeling task, we used Label Studio. An in depth description of our labeling setup can be found here:
Datasets
Synthetic Dataset
Treatment 1: Trained on 2000 synthetically generated images with the following characteristics:
- Single rectangle i.e. "the frame" placed on a background "the scene"
- Randomly sized within bounds
- Randomly colored within bounds
- Randomly placed within bounds inside background rectangle of fixed size (224×224)

## Primary synthetic data generator method, amount of desired data can be adjusted ##
def _random_select_valid_tl(scene:MaskedImg, frame_template:dict, mask:MaskedImg):
"""
TODO: support multiple frames, currently could have overlapping frames
"""
sx,sy,sz = scene.img.shape
vertices = np.array(frame_template['label_data'])
fwidth = abs(
max(vertices[:,0]) -
min(vertices[:,0])
)
flength = abs(
max(vertices[:,1]) -
min(vertices[:,1])
)
tlx,tly = (
np.random.randint(0, sx-fwidth-1),
np.random.randint(0, sy-flength-1)
)
return tlx,tly
def add_frame_to_scene(scene:MaskedImg, frame_template:dict, mask:MaskedImg, plot:bool):
"""
In:
scene: np.array, mutable, frame gets written on top
frame: np.array, list of vertices
"""
# adjust the frame coords
tlx,tly = _random_select_valid_tl(
scene=scene, frame_template=frame_template, mask=mask
)
frame = np.array(frame_template['label_data']).copy()
frame[:,0] = frame[:,0] + tlx
frame[:,1] = frame[:,1] + tly
# create the "filled scene"
vertices_to_region(
mask_i=scene.img, # gets muted
label_data=frame.tolist(),
label_num=frame_template['color'],
plot=plot
)
# update the mask to reflect the frame added to the scene
vertices_to_region(
mask_i=mask.img, # gets muted
label_data=frame.tolist(),
label_num=frame_template['label_num'],
plot=plot
)
# plt.imshow(mask.img)
# plt.show()
def select_random_color(color_list:list):
sc = color_list[np.random.randint(0,len(color_list))]
return color_list[
np.random.randint(0,len([x for x in color_list if x != sc]))
]
def generator(scene_shapes, frame_templates, color_list, masks_path, scenes_path, plot=False):
errors = 0
for i in range(len(scene_shapes)):
scene_shape = scene_shapes[i]
frame_template = frame_templates[i]
# instantiate the scene
scene = MaskedImg()
scene.load_from_parameters(
shape=scene_shape,
value=select_random_color(color_list),
value_space='rgb'
)
# instantiate the mask
mask = MaskedImg()
mask.load_from_parameters(
shape=scene_shape[:2],
value=0,
value_space='int'
)
try:
add_frame_to_scene(
scene=scene,
frame_template=frame_template,
mask=mask,
plot=plot
)
except:
errors += 1
continue
# plt.imshow(scene.img)
# plt.show()
# plt.imshow(mask.img)
# plt.show()
maskfp = f'{str(masks_path)}/{i}.tif'
scenefp = f'{str(scenes_path)}/{i}.jpeg'
mask.save(fp=maskfp)
scene.save(fp=scenefp)
print('Finished with {} errors.'.format(errors))
exp_id = 0
scenes_path = Path("/ws/data/wallshots-framefinder/{}/scenes".format(exp_id))
masks_path = Path("/ws/data/wallshots-framefinder/{}/masks".format(exp_id))
scenes_path.mkdir(exist_ok=True, parents=True)
masks_path.mkdir(exist_ok=True, parents=True)
n = 2000
color_list = [
(0, 0, 255), (102, 255, 51), (204, 153, 0), (255, 51, 204),
(51, 102, 153), (255, 0, 0), (0, 255, 0), (255, 255, 0), (0, 255, 255),
(128, 0, 255), (204, 102, 0), (153, 0, 51), (255, 102, 153),
(102, 255, 153), (204, 255, 153), (255, 255, 204), (51, 51, 0),
(126, 153,64), (230, 30, 120), (50, 23, 200)]
scene_shapes = [(224,224) for i in range(n)]
lws = [(np.random.randint(10,100),np.random.randint(10,100)) for i in range(n)]
frame_templates = [{
'label_data': np.array([
(0,0), (0,lw[1]), (lw[0],lw[1]), (lw[0],0)
]),
'color': select_random_color(color_list),
'label_num': 1
} for lw in lws]
generator(
scene_shapes=scene_shapes,
frame_templates=frame_templates,
color_list=color_list,
scenes_path=scenes_path,
masks_path=masks_path,
plot=False
)
Manually Labeled Dataset
10 images selected from the internet and labeled using our labeling test bench. We again used Label Studio for the traditional labeling task by setting up a new Label Studio Project so that we didn’t get mixed up between the images we labeled for benchmarking purposes and the Labels we created for training purposes.
- 10 base images take from "in the wild" sources E.g. google searches.
- Manually labeled in a "first pass" without significant fine tuning.
- Images contain picture frames in real world or real world contrived scenes


Benchmark Dataset
We created a benchmark dataset by carefully selecting example images "from the wild" and labeling them using label studio Segmentation features. See our previous article which describes this in depth: https://towardsdatascience.com/development-of-a-benchmark-dataset-with-an-interface-to-the-fastai-dataloader-using-label-studio-d3aa3c26661f
- 8 base images take from "in the wild" sources E.g. google searches.
- Manually labeled with significant fine tuning.
- Images contain picture frames in real world or real world contrived scenes.
- Images chose to trip up the models. For example, striped walls, obstructions, depth of field, image noise.

Experiments
Measuring Results
The models created in each experiment are evaluated against an independent benchmark dataset designed to accurately represent the real world context of the model. The separate benchmark dataset will be the single source of truth for experiment results. The reason we have a separate benchmark dataset with its own data pipeline is because the purpose of the benchmark dataset needs to be clear, and reflected in the underlying process which creates it. A benchmark dataset here is defined to inject examples that cause the models to falter. In contrast with that, the training dataset is designed to make the model as good as it can be. We believe that this may be an important distinction, which warrants the extra overhead of maintaining the benchmark dataset separately.
Models
Here, we compare the outcomes in two different experiments. These models will be very similar, but differ in the training dataset on which they train. In each case, the instantiation of the model is similar, using fastai unet_learner
which requires a segmentation data loader. We were able to use the same directory structure for all our datasets, which simplifies the creation of the data loaders because they can share the same code. That structure is illustrated as follows:
benchmark/
├── masks
│ ├── 1.tif
│ ├── 2.tif
│ ├── 3.tif
│ ├── 4.tif
│ ├── 5.tif
│ ├── 6.tif
│ ├── 8.tif
│ └── 9.tif
└── scenes
├── 1.jpg
├── 2.jpg
├── 3.jpg
├── 4.jpg
├── 5.jpg
├── 6.jpg
├── 8.jpg
└── 9.jpg
The data loader is instantiated as follows, where saveto
is the folder location (above e.g. benchmark/
):
size = 224
imgs_saveto = saveto/'scenes'
dls = SegmentationDataLoaders.from_label_func(
imgs_saveto,
bs=6,
fnames=[
name
for name in imgs_saveto.iterdir()
if not name.is_dir()
],
label_func=
lambda x:
str(x).replace('scenes', 'masks').replace('jpg', 'tif'),
item_tfms=[Resize((size,size))],
batch_tfms=[
Normalize.from_stats(*imagenet_stats)
],
valid_pct=0.00
)
The valid_pct
argument is 0.0 in the case of the traditionally created training examples because we want to utilize all 10 images, since there are so few. Each on is precious…
Training
We present a side-by-side comparison of both models by plotting the predicted mask overlaid onto the input images from the benchmark dataset. For this, we ran the following training procedure for the model trained with synthetic data:
learn = unet_learner(
dls,
resnet34,
cbs=WandbCallback(),
n_out=2,
path=Path('.')
)
learn.fit_one_cycle(10, slice(1e-2, 1e-3))
learn.save('model1-a-0')
learn.fit_one_cycle(10, slice(1e-2, 1e-3))
learn.fit_one_cycle(10, slice(1e-5, 1e-7))
learn.fit_one_cycle(10, slice(1e-6, 1e-7))
learn.save('model1-a-1')
The synthetic dataset takes longer for each epoch than the traditionally labeled dataset. The training strategies we used were just meant to be a reasonable starting point given the data. For the traditionally labeled dataset, we ran the following procedure:
learn = unet_learner(
dls,
resnet34,
cbs=WandbCallback(),
n_out=2,
path=Path('.')
)
learn.fit_one_cycle(20, slice(1e-2, 1e-3))
learn.fit_one_cycle(20, slice(1e-2, 1e-3))
learn.fit_one_cycle(20, slice(1e-2, 1e-3))
learn.fit_one_cycle(20, slice(1e-2, 1e-3))
learn.fit_one_cycle(20, slice(1e-2, 1e-3))
learn.fit_one_cycle(20, slice(1e-5, 1e-7))
learn.fit_one_cycle(20, slice(1e-6, 1e-7))
learn.save('model2')
Qualitative Evaluation
In the following image, we overlay the predicted masks on the resized input to the each of the models trained with the strategies pasted above. On the left, the images were run through the model created with synthetic data only, while the right is the model from our manually labeled dataset.

Conclusions
The objective of this exercise was to make a decision about next time investment. Should we spend more resources labeling data, or should we further develop the code to make the synthetic dataset more realistic? We were most impressed with the effectiveness of a 10 image traditionally labeled training dataset. So the path forward is to add in data augmentation to better maximize each training sample from the traditional dataset. We intuitively see that, while promising, the up front investment for making the synthetic labeling code more realistic is high. Therefore, we decided to revisit it later. The approaches with merit are a combination of both the traditional and the synthetic labeling approaches, which would enable us to generate a large number of somewhat realistic labeled images without a huge investment in labeling or virtual reality development.