
Notification: You have a new job request!
As a data scientist, you will definitely need to train models to meet your organization’s needs. Most of the time, you require labelled data from within your company in order to build a customized solution.
You’re approached by a product manager one day who wants you to build a named entity recognition model to improve the quality of the downstream data science product. Due to the short time frame this product needs to be launched, he hired a bunch of human labellers to assist you. What should you do next?
" Give us the tools, and we will finish the job "
by Winston Churchill
You need a labelling tool
The task objective is clear, manpower is ample, and a deadline is set. What’s next? A useful labelling tool. Please note the emphasis on "useful". But what is it that I mean by "useful"? The tool should at least give you the ability to:
- Track the labelling quality, on-premises or in the cloud, at any time. The annotation staff can be informed if they are not following the instructions before the entire labelling project is completed.
- Keep track of the progress of every annotator. The deadline is given to you, so it is imperative that every annotator be given a specific deadline for finishing their labelling. As a project leader, you should be able to track the progress to ensure everyone stays on track.
- Allow multiple annotations to work together. Bias is one of the most problematic topics in machine learning. As a data scientist, you do not want annotation bias to affect your model after spending several days or weeks training your model, only to find out that it is biased. As a possible solution, multiple authors could collaborate on the same task/article and only approve labels that are unanimously accepted.
- Post-process your labelling result with the least amount of effort. One particular aspect you will appreciate is the ease with which you can handle the labelled data. In my experience, the JSON format is the easiest to work with when I am working on labelling projects for clients. It would be great if the tool could support multiple types of formats for importing and exporting data.
- Provide a user-friendly interface to your human annotators. Developing a good data labelling tool for me is based primarily on this factor. Not only was I responsible for preparing the client documentation, but I also had to train the human annotations on how to use the labelling tool with as little human error as possible. You can therefore save yourself from a great deal of hassle if you use an easy-to-use labelling tool.
- Code your labelling interface to suit your needs. I’m sure, as a programmer, you’d like your tool to be controlled by code. Coding to make the interface customized would allow the time to be dedicated to other important tasks.
These are the features I would at least like my labelling tool to have. A tool like this could cost you thousands of dollars and countless hours of labour to develop. What is the best way to find such an off-the-shelf tool … for free?
Label Studio
Before I begin, let me disclaim the following: I do not work for Label Studio and I am not affiliated with Label Studio. It is simply my personal experience from working on a client’s project.
GitHub – heartexlabs/label-studio: Label Studio is a multi-type data labeling and annotation tool…
As I was developing a system to label data on a client’s behalf, I ran into this tool. In light of the fact that it is an open-source and free data labelling tool, I am quite surprised at its flexibility and functionality. Using a few simple steps, I will show you how you can construct a simple NER labelling interface.
Case study: NER Labelling System

You can easily set up Label Studio by following the instructions in the Github repository. It supports both installations on the local machine and deployment in the cloud. As part of my work for a client, I built and deployed the tool using Docker. It allows him to store the data locally since credentials are the main requirement for his business. To demonstrate the tool, I am going to use Heroku Buttons. You may deploy it in any manner you deem acceptable for your business.
Setting Up Your First Project

You will be able to see the project page after you log in. The process of creating a project is illustrated in the gif above. For this project, I call it NER.
Our next step is to upload our own data to prepare for labelling. For your convenience, I have created a script to help you get started. As a starting point, I used two articles from Reuters.com. Run the script above in Colab, and you should be able to view a list of tasks in JSON format like this:
[
{
"data": {
"text": "TOKYO, Oct 20 (Reuters) - A volcano ..."
}
},
{
"data": {
"text": "HONG KONG, Oct 20 (Reuters) - Bitcoin ..."
}
}
]
You can import data by uploading the label_studio_input.json file. After that, you may select from a large variety of templates to customize your labelling interface in the last tab. The default NER template can be used to start off our first NER project. Here are the steps to show you the first project page.

Start labelling
Label Studio is by far the simplest tool I have ever used. This tool’s user interface makes it very easy for me to use. As you can see in the above video, you simply select the label button and highlight the word associated with it. It is possible to activate the label button with the hotkeys if you want to do it more quickly. By pressing the "3" key, you can activate the label button for the "LOC" label. You can confirm your labelling by clicking "Submit" on the right panel after you have finished. Simply edit the text, then update it as needed.
You can see how intuitive the entire labelling process is. What about exporting your result to future processing?
Exporting your result

Once you have completed labelling, export it in a different format so you can post-process the result. In this demo, I will export it in JSON format and further clean the result into a Panda Dataframe.
Here is a script that shows you a very simple step to clean the label studio results:
Is that all it offers?
Certainly not! Earlier in the article, we discussed the defining characteristics of a good data labelling tool. Let’s go over some of the cool features you can customize for different business scenarios.
Additional tasks for labelling and more labels per article

# Code for the default NER template
<View>
<Labels name="label" toName="text">
<Label value="PER" background="red"/>
<Label value="ORG" background="darkorange"/>
<Label value="LOC" background="orange"/>
<Label value="MISC" background="green"/>
</Labels>
<Text name="text" value="$text"/>
</View>
I showed you only how to start NER projects using the freely available template in the case study above. Additionally, Label Studio gives you access to a code editor that you can use to customize your labelling templates. Code is in XML format, so you can add custom label configurations to your dataset using tags. The following are two examples to allow you to further customize your dataset.
1 Modify label colour and add more labels

NER’s default template assigns ORG and LOC labels very similar colours. It can cause confusion when annotations are checked after labelling. By changing the colour of the LOC label to blue, we can avoid this human error. To make the task more complete, we will also add more labels.
# Modify the LOC background color to blue
<View>
<Labels name="label" toName="text">
<Label value="PER" background="red"/>
<Label value="ORG" background="darkorange"/>
<Label value="LOC" background="blue"/>
<Label value="MISC" background="green"/>
<Label value="BRAND" background="yellow"/>
<Label value="TIME" background="purple"/>
</Labels>
<Text name="text" value="$text"/>
</View>

2 New labelling task
At the time of setting up the project, we only had the option of choosing one template. Even in the case of a single project, you may need to perform multi-task inference. In one of my client’s projects, I was required to design a multi-label classification model that also performed NER. Adding a new labelling job to our existing tasks is simply a matter of adding a few lines to the script.
<View>
<Labels name="label" toName="text">
<Label value="PER" background="red"/>
<Label value="ORG" background="darkorange"/>
<Label value="LOC" background="blue"/>
<Label value="MISC" background="green"/>
<Label value="BRAND" background="yellow"/>
<Label value="TIME" background="purple"/>
</Labels>
<Text name="text" value="$text"/>
<Taxonomy name="article_class" toName="text">
<Choice value="world">
<Choice value="africa"/>
<Choice value="america"/>
</Choice>
<Choice value="business">
<Choice value="environment"/>
<Choice value="finance"/>
</Choice>
</Taxonomy>
</View>

We can see that adding a new tag called Taxonomy generates a multi-label classification job right away for the existing task.
Now that you know how easy Label Studio is to use, you can now use it on your next labelling project. In closing, I’d like to point out one more important feature.
Collaboration is indispensable
There is no doubt that labelling jobs are labour-intensive, which means that we need to spread the tasks among multiple annotators to complete them as quickly as possible. To reduce human bias, we should assign more than one human to each labelling task. The good news is that Label Studio allows annotators to sign up and they can work on the same task if needed.
To start annotating the same articles as the existing annotator, all they need to do is create a new tab under their account:

Takeaways
Our goal in this article is to show you how to build a labelling tool to support your next labelling project. In addition to its array of free features, Label Studio also allows you to customize your labelling interface with just a few lines of code. The platform supports not only natural language processing task but also computer vision and speech recognition tasks. The team edition would be ideal if you intend to use this tool for a long time.
Label Studio really did a great job of making this labelling tool as easy to use as possible. No matter if you are a startup or a small business, labelling can sometimes be a problem when you are developing a new machine learning project. Costs and time can add up when designing a new labelling tool. In this regard, Label Studio is a viable option.
Reference
About the Author
Woen Yon is a Data Scientist based in Singapore. His experience includes developing advanced Artificial Intelligence products for several multinational enterprises.
Woen Yon works with a handful of smart people to offer web solutions including web crawling services and website development for local and international start-up business owners. They are well aware of the challenges of building quality software. Please do not hesitate to drop him an email at [email protected] if you need assistance.
He loves making friends! Feel free to connect with him on LinkedIn and Medium