I’m working a lot on time series Anomaly Detection for industrial use cases and most of the time I rely on unsupervised approaches. Yet, semi-supervised approaches can add valuable incremental value. In other situations, you might also want to confirm unsupervised model outputs and having a labeling tool to easily integrate in your workflow becomes a must.
This is where Label Studio comes in!
Some time ago, a colleague of mine (Sofian, who you can follow here) wrote the following article to explain how to deploy Label Studio on Amazon SageMaker:
I’ve been toying away with this open source package to label time series data in the past: I thought this was the perfect time to expose how I integrate this labeling tool in my machine learning workflow.
In this article, I will show you the notebook I run to automatically deploy a Label Studio instance in my SageMaker environment. I will then expose how I configure my annotation environment automatically to deal with the structure of the time series data I would like to annotate.
I encourage you to follow along this blog post by browsing to GitHub to grab this series of companion Jupyter notebooks. As the objective is to deploy Label Studio on Amazon SageMaker, you will need to have a AWS account. Then you can create a SageMaker notebook instance (use a t3.medium type to benefit from the free tier if you have a new account). From there, clone this GitHub repository:
git clone https://github.com/aws-samples/amazon-lookout-for-equipment.git
Navigate into the apps/annotation-label-studio/
folder and open the 1-initialize-label-studio.ipynb
notebook.
Before we jump in the step by step process to configure your own environment from scratch, let’s have an overview of what we are going to assemble…
Technology overview
In this article, you are going to deploy a Docker image of Label Studio in a SageMaker notebook instance. You will then connect Label Studio to the Amazon S3 bucket where your time series data will be stored.
Label Studio is a flexible data annotation tool that can be used to label every data type: text, images, tabular data or time series data. In this article, we are going to programmatically configure a custom user interface to label time series data for anomaly detection purpose.
Amazon SageMaker is a managed machine learning service that helps you build, train, and deploy machine learning models for any use case with fully managed infrastructure, tools, and workflows. In this article, we are going to use the managed JupyterLab experience offered by SageMaker Notebook Instances.
Installing Label Studio
First, we are going to download the Label Studio docker image and deploy it in our notebook environment. To do this, we need to configure some parameters:
The previous piece of code generates a shell script which will run a dockerized version of Label Studio. This instance will be configured with an initial user that you can customize by changing the username
, password
, and token
. Note that the username
must follow an email address format. Otherwise, the user won’t be created when the Label Studio instance is launched. If you don’t create a user at this stage, you will have the opportunity to create one when you sign in into the application.
The token
can of course be generate randomly. For instance, you could use the following code for this:
This will generate a token that looks like this one: 2edfe403f2f326e810b9553f8f5423bf04437341
.
The get_notebook_name()
method is defined using the following method: this is used to generate the URL of your Label Studio instance.
Once you have you shell script generated, you can run it from a cell in Jupyter by running !source ./label-studio.sh
. The first time, it will download the docker image for Label Studio. Then it will run it with the parameters you defined above. After a few seconds, you should see the following message:
Django version 3.1.14, using settings 'core.settings.label_studio'
Starting development server at http://0.0.0.0:8080/
Quit the server with CONTROL-C.
This means your Label Studio instance is up and running!
Time to go and configure it to suit the time series dataset you want to label…
Preparing an example dataset
If you’re following this article with the companion GitHub repo, you can now open the second Jupyter notebook (2-configure-label-studio.ipynb
) while leaving the other notebook running. You should see an hourglass icon next to the JupyterLab tab name in your browser. That’s your cue that a process is actually running (in this case, your Label Studio instance).
I put a synthetic time series dataset in the repo: if you’re interested into how this dataset was created, feel free to check out this article:
You can of course use your own time series dataset if you have one ready! You can either store your dataset locally in your instance and let Label Studio access it from here.
However, if you’re an AWS user, most of the time you may already have your data stored in an Amazon S3 Bucket. Label Studio must then access your data from there, which, by default is not authorized. To enable this access, you need to enable cross-origin resource sharing (CORS) for your S3 Bucket. CORS defines a way for client web applications that are loaded in one domain (in our case, our Label Studio instance running in a SageMaker notebook instance) to interact with resources in a different domain (your dataset stored in Amazon S3). To do this, you can check out the CORS documentation and use the following JSON document to configure your access. You will need to update the AllowedOrigins
parameter below:
[
{
"AllowedHeaders": [
"*"
],
"AllowedMethods": [
"GET"
],
"AllowedOrigins": [
"https://<<notebook_name>>.notebook.<<current_region>>.sagemaker.aws"
],
"ExposeHeaders": [
"x-amz-server-side-encryption",
"x-amz-request-id",
"x-amz-id-2"
],
"MaxAgeSeconds": 3000
}
]
Time to configure your annotation template to match the data you want to label…
Configuring your Label Studio instance
Let’s assume you now have a dataset ready and loaded into a Panda dataframe. The next step is to configure an annotation template. Label Studio comes with several existing templates. Your template will however depend on how many timeseries (or channels) you have in your file. Building a customized template adapted to your dataset is a two-step process. First, we build a list of channels, one for each field in your multivariate time series dataset:
A given channel will take the following format:
<Channel
column="signal_00"
legend="signal_00"
strokeColor="#1f77b4"
displayFormat=",.1f"
/>
You can customize the name of the channel and the color that will be used to plot the time series.
Then, you use this channel_fields
to generate the annotation template:
Our template is ready, we will now:
- Create a new annotation project
- Configure the storage configuration
- Log into our Label Studio instance and create some labels
- Collect the results so that they can be further used in your machine learning pipeline
Creating a new annotation project
We will use the Label Studio API to interact with our Label Studio instance. Creating a project requires you to use the create project API:
After running this code, a new project will be created for your user (identifed by the token
variable above) in your Label Studio environment. We can now connect Label Studio to your data storage.
Configure storage to use Amazon S3 as a data source
Using the S3 configuration API from Label Studio you can tell where it can find the time series data to label:
To configure this data location, you need to know the bucket
and prefix
where your time series data will be located on Amazon S3. However, you will also need AWS credentials to be passed to Label Studio. These credentials can be collected from the current SageMaker environment with the following piece of code:
Once your project is created, you can sync it: when synchronizing, Label Studio searches for valid files (CSV in our case), in the configured data source and add them to your project so that you can start your labeling work. Triggering a sync is simple enough and just requires the project ID obtained following the project creation API call:
Our project is now ready and some annotation tasks should have been added after synchronization…
Creating some labels
We will now access our Label Studio instance by opening this link in a new tab of our browser:
https://**{notebook_name}**.notebook.**{region}**.sagemaker.aws/proxy/8080/
You will have to replace the variables in bold in this URL by your own parameters. You should see the login page and you can use the credentials you configured in the first notebook to login:

Once logged in, you should see a project already populated and synced (we can see a 0/1 tasks under the project title, which means there’s one outstanding annotation task):

Click on this project tile to bring up the time series to annotate. Each time series dataset will appear as an individual task to label:

Scroll down to the bottom of the time series view on the right and reduce the time period using the overview slider until the time series plot appear. You can then start labeling your data (check out the Label Studio website if you want to know more about the actual labeling process):

Once you have a few labels done, scroll up and click on the Submit
button. The annotations are saved in the local database from Label Studio (you can also configure a target location on Amazon S3). You can now collect your results!
Collecting the annotations
Use the following API call to get the labels from your previous labeling job and save them in a CSV format ready to be used by an anomaly detection machine learning service such as Amazon Lookout for Equipment:
And this is it! Congratulation on reading this far, you now know how to seamlessly integrated a labeling workflow to your time series anomaly detection process!
Conclusion
In this article, you learned how to…
- deploy a Label Studio instance with Docker on a local SageMaker Notebook instance.
- create a user, a project and configure it to access your time series data from an Amazon S3 bucket.
- authorize CORS from your Amazon S3 Bucket to allow Label Studio to directly access your time series data from there without the need to copy it locally in your instance.
- collect your annotation results once a labeling job is done.
I hope you found this article insightful: feel free to leave me a comment here and don’t hesitate to subscribe to my Medium email feed if you don’t want to miss my upcoming posts! Want to support me and future work? Join Medium with my referral link: