The world’s leading publication for data science, AI, and ML professionals.

How to utilise a manual labelling workforce

From interview through task definition to evaluation

Photo by Littlehampton Bricks from Pexels
Photo by Littlehampton Bricks from Pexels

Why manual labelling is needed

In order to generate machine learning models we need first to find datasets to train it on. For many cases we can rely on public datasets (using datasets search indexes like Google or Kaggle to find the data we need). But some cases require generating the datasets on our own (due to the data scale, privacy concerns or just since the use case is a niche), and if the ML task is supervised then we also need to take care of labelling the dataset. Commonly the labelling will start in house and once finding the scale needed is too high (probably once moving from POC to MVP ), the manual labelling process would be outsourced (using a labelling team or a dedicated contractor). A common mistake is to assume what worked in house will seamlessly work externally. There are many pitfalls to avoid in order to make sure the labelling process output will be what we wished for. From interviewing the candidates, to formalising the work to evaluating its quality. A walk through what is needed is ahead.

Starting to collect the data

Let’s assume our task is to find orange cats in images. The first step would be to search for relevant existing datasets, but since our use case is quite a niche (specifically orange cats) we decide to generate the required Dataset on our own. We start by crawling relevant images from the web (ones with orange cats, not orange cats and general images). A common initial simplification would be to rely on existing open source image segmentation libraries (like Facebook’s Detectron or Google’s DeepLab) to generate initial labelling for the images. Using it we can control the dataset topics distribution (to enforce stratification). But as we look specifically for orange cats (and the common output of such open source libraries would be just a ‘cat’), we decide to hire a labelling work force to assist us for that need. How to do so?.

Orange cat with a few competitors. Photo by cottonbro from Pexels
Orange cat with a few competitors. Photo by cottonbro from Pexels

Interviewing

The first step is to find labellers which will best fit our needs. Surprisingly, high grades and fancy diplomas are not always correlated with a fit to work. Moreover, few of our best manual labellers were undergraduates. While in a first glance it may seem counter intuitive, a possible explanation could be the fact that Labelling requires special characteristics like – attention, self discipline, out of the box thinking and the willingness to go the extra mile needed for success (not to mention that not all the labelling tasks are of the same complexity level and therefore each requires a different skills level). Such characteristics are likely to have some correlation with high grades but are not mandatory to go together. This is why the interview process should be tuned to better target these qualities; less grades filtering, more hands on evaluation. Being smart doesn’t directly indicate one would be good at labelling cats. Try to make sure the questions you ask are targeted at measuring the relevance to the labelling work. A small step in that direction could be to use home task which resembles the manual labelling work. It will enable a fair and immediate indication of the candidate’s fitness to the job. A good interview should measure the required KPIs (like willingness for manual repetitive work), to make sure the candidates you accept are able to fulfil the task needed.

Introducing the need

Manual labelling is for many cases a one-off project which leads to the use of out of scope employees, for many cases contractors. This is why the tendency would be to reveal as little as possible, to lower the context you share, assuming contractors can work tomorrow for your competitors. But this is a mistake as such attitude may lower the labellers’ engagement. The required task can be well defined without revealing any internal top secrets. Explain what is needed and why. Give your manual labellers the feeling that their work is important and appreciated. Make them your partners. One that understands the why besides the what, is more likely to pay more attention and in general to generate a better output.

Formalising the task

For most cases the first one to label the data will be ourselves. During that process we probably came across special samples and anomaly ones and therefore we generated many free-text feedback fields (like concerns or comments) to highlight it. Once we need to introduce the task to the labellers, the tendency would be to keep the same feedback fields structure. But the issue is what worked for us can confuse the out of scope labellers. Therefore the task definition should be as clear as possible. Close lists are better than free text fields. Like in our example, we could use – [orange cat/ not orange cat/ not cat/ ?]. It is also important to ask the labellers to fill each labelling context (in our example it could be the position of the identified object). The two main reasons for that are that providing context requires more attention and as it enables a quick manual validation path (strange labelling + problematic context can highlight samples that should be double verified. Like in our example, in case one will label ‘orange cat’ in a position where ‘dog’ should appear according to the open source we use for segmentation). ‘?’ is important in order to enable the labellers a place to admit they don’t know. Consider for example a scenario when an image includes both orange and black cats. What the labelling should be?. Context together with a ‘misc’ label (the ‘?’ sign) will enable the labellers to mark samples that should be double checked and in an efficient way. Mistakes are likely to happen. Formalising the labels input together with a well defined context, can prevent the mistakes from becoming fatal.

Work scope

Try to split the needed work into small batches, especially at the beginning, to let the labellers see they can manage it. From our experience labellers showed better performance when a file was divided VS working on the merged file (in its bigger format). Probably facing smaller file was less frightening and therefore seems like a more achievable task. With time make the files bigger and bigger. It will reflect to the labellers their improvement, an implicit feedback to bind them in. More confidence will make the labellers to feel better, work better and eventually even to outperform your initial labelling, to find stuff you weren’t even aware of.

Experts gap

For most cases you’ll find that not all the ‘to label’ samples are of the same complexity level; some will be more difficult to label than others. To best utilise your workforce find a way to direct the easier to label samples to the labellers and keep the more difficult ones to your experts. It is important since experts’ availability is mostly limited, and since labellers struggling with too complex samples can become despaired. A common way to distinguish the more difficult samples from the rest is to use a proxy classifier – train a simple classifier using subset of the already labeled dataset. This classifier isn’t going to be the classifier, but a tool to prioritise the samples’ complexity by looking at its prediction likelihood (high likelihood can indicate a simple labelling sample which should be directed to the labellers). It can be also utilised to verify the labellers work (looking at places where its output was super different from what the labellers said) or in a ‘active learning’ way; once in a while re-train the proxy classifier on the new labelled data in order to highlight which samples should be prioritise next. Not all samples are as important, and therefore the labelling force should be directed towards where it is more needed, to the samples that the simple proxy classifier failed to deal with.

Validating the work

The labelled data should be validated before entering our dataset. A common approach is to give a few labellers the same task and later to validate their work by looking into places where the labelling differed. But the issue is it includes an implicit failure point as the subtext is you don’t trust enough your labellers. For us, when we tried it, the results were so different that it highlighted the fact we had other issues to solve first. The inherent motivation is to bring the labellers to a point where you can trust their work as it was yourself doing it. It doesn’t mean work shouldn’t be verified – everyone makes mistakes. Find a mechanism to validate the labellers’ work. For example by comparing their work to a proxy classifier (like the image segmentation tools or the proxy classifier we mentioned earlier). It can be a simple compass to enable you to focus at the more complicated cases which require more manual validation.

Feedback

No one likes doing bad work. Especially not on purpose. Especially not contractors which are explicitly rewarded based on their performance. Therefore it is important to highlight the labellers mistakes in order to enable them to improve. Mistakes are likely to happen, especially at the beginning when the data domain is new and it’s not clear yet what to do and how. On the other hand, bad labellers are easiest to identify in the initial phases. This is why beginners’ mistakes should be distinguished from unsuitability to the work. A possible indicator can be the lack of labelling repeatability; It’s ok to make mistakes and guidance should be given how to avoid them. But if the same mistake happens over and over again it’s troubling. And if the mistakes seem random – same context having different labels, then it should ring a bell. A slow learning but willing-full labeller is way better than fast learning but one who doesn’t pay enough attention to the task.

Final notes

Don’t underestimate your labellers. Bind them to the cause and make them your labelling task partners. Not all tasks are the same, some will require experts and some are more simple. But all need proper attention. Once guided, a labelling workforce can be the special sauce needed for your model success.


Related Articles