A lot has been said about the importance of clean Data with accurate and consistent labels. The entire data-centric paradigm relies heavily on making data labels more consistent to improve model performance. So why isn’t the data science community quickly adopting this approach? There might be many reasons, but a recurring claim is that labeling is too tedious a task, one that’s hard to iterate over, manage, and scale.
DagsHub integrates Label Studio
Labeling is such an important task, but it’s more complex than it should be. Knowing that, we decided that it’s a barrier to entry DagsHub should help remove. We had many strong candidates for this integration, but the one that stood out was Label Studio – a powerful open-source tool that supports the labeling of many unstructured and structured data types with a strong and active community.
Supported data types:
- Computer Vision – images and video
- Audio & Speech Applications
- NLP, Documents, Chatbots, Transcripts
- Time series
- Structured data – tabular, HTML, freeform
- Multi-Domain Applications
Every repository on DagsHub comes with a fully configured Label Studio workspace. This workspace lets you annotate your data, with access to all the project’s files. By directly fetching data from DagsHub Storage, so you no longer need to move, copy, or pull it to a third-party platform. This reduces a significant burden associated with labeling, which is managing and synchronizing data and Labels.
Git Flow for Data Labeling
Labeling workflow is equivalent to developing a new feature. It should be done in an isolated environment with the ability to compare, analyze, and merge changes or roll them back and restore previous versions. Knowing that labeling is usually outsourced, these capabilities become even more vital to ensure its success.
Based on those needs and the challenges labelers face, DagsHub added a few toppings to Label Studio and created its unique flavor of the open-source version. It provides a Git experience, following the industry’s best practices, to ensure full reproducibility, scalability, and efficient version control of the labels and data.
The workflow for DagsHub and Label Studio
When creating a new labeling project on DagsHub, you associate it with a tip of an active branch. It marks the project’s starting point and will make all the files hosted on DagsHub Storage, under the selected commit, available for labeling. Once you reach a valuable result, you’ll be able to version and commit the annotations using Git, directly to a remote branch. Once the task is complete, you can create a pull request on DagsHub, where a reviewer can see and comment on every annotation.
How to version a Label Studio project with DagsHub?
To version control any artifact, it needs to have a single source of truth. To provide this source of truth for annotations, we created the .labelstudio
directory, which holds annotations for every task in open source formats. When creating a new labeling project, DagsHub parses the selected commit for this directory and loads the existing annotations to their associated tasks. This way, we can roll back to previous versions with a click of a button.
Get started with Label Studio and DagsHub
In this section, I’ll guide you, step-by-step, on how to use Label Studio and DagsHub Annotations while following the recommended Git Flow. The main goal is to help you gain hands-on experience while having the benefit of following my lead. For that, I’ll use my "Where’s Elon" project, where I annotate Elon Musk’s images. I’m assuming you already have a project on DagsHub, with versioned data ready to be annotated.
Step 1: Create a Label Studio workspace.
Navigate to the Annotations tab in your DagsHub repository and create a new workspace. This process can take 2–3 minutes as DagsHub spins up the Label Studio machine behind the scenes.
Step 2: Create a Label Studio project
In the new Annotation Project menu, choose the tip of a remote branch to associate the project with. It marks the project’s starting point and will make all the files hosted on DagsHub Storage, under the selected commit, available for labeling. To work in an isolated environment, we will create a new branch for the labeling project. The default project name is based on the annotator who created it; however, you can change it as you wish.
Step 3: Choose the files to annotate
When launching the project for the first time, you’ll need to choose the files to annotate (AKA tasks). You can choose a specific file or an entire directory by checking the box next to its name.
Note: you can annotate files hosted on both Git and DVC remotes. As a role of thumb: "if you can see the file – you can annotate it."
Step 4: Configure Label Studio
You can configure Label Studio’s labeling interface using one of its many great templates. If you need a custom template, you can create it using basic HTML.
Note: If you choose to work with a template, you’ll need to set the project’s labels manually.
Step 5: Annotate the data
As simple as that, you can start annotating your data. No need to move the data to a different platform, change its structure or synchronize anything. You can start working on the tasks and save the annotations to DagsHub’s database.
Step 6: Commit changes to Git
At any point in time, you can version the state of the project using Git, and commit the changes back to the branch you chose in step 2 or create a new branch and commit to it. The commit will include the special .labelstudio
directory. You can add an annotations file in one of the commonly used formats (JSON, COCO, CSV, TSV, etc.) to the commit.
Utilizing Git’s capabilities, you can now seamlessly iterate over steps 5 and 6, compare the different versions, merge the results, or roll back the changes.
Step 7: Create a pull request
When you’re satisfied with the labels, meaning they’re accurate and consistent, you can merge them to the main branch. With DagsHub, communicating over the labels is part of the pull request without moving to a 3rd party platform. The reviewer can leave his comments on each label and have the entire process logged and easy to manage. Once completing the task, merging it to the project’s main branch is one click of a button away.
Summary
Labeling unstructured data comes with various challenges, many of which are a by-product of the workflow and unrelated to the annotation task itself. DagsHub Annotations and the Label Studio integration are designed to help overcome those challenges and create a smooth labeling workflow. It does the DevOps heavy lifting and provides you with the tools you need to manage and scale the labeling process.
If you have any questions, ideas, or thoughts about the integration – we’d love to hear about it on our Discord channel! We can’t wait to see your amazing projects become even greater with accurate and consistent labels.