Prepare data for your ML models in the fastest and easiest way with Amazon SageMaker Data Wrangler — a visualisation and data preparation tool

A step-by-step guide to preprocessing and visualising data with SageMaker Data Wrangler

Published in

Towards Data Science

12 min readApr 26, 2021

Summary

Amazon SageMaker Data Wrangler is a new service announced back in December 2020 aiming to simplify the process of data preparation and feature engineering for machine learning. You can use the visual interface of the tool to build preprocessing and visualisation pipelines (or flows as they are called). When ready, export your flows in code and run anywhere! It also comes with integrations for other SageMaker (SM) services (like SM Pipelines and SM Feature Store)

In this post, we will go through the features of this new service by building a Data Wrangler flow using a sample dataset. We will import our data in the Data Wrangler environment and will explore its capabilities of data processing and visualisation. Finally, we will see how our work can be operationalised. The objective of this post is to explore what is possible with this new SageMaker offering and showcase how you can prepare your data for your machine learning tasks in an easy, fast and repeatable way.

Prerequisites

If you plan to follow along there are a few things you will need to have access to and at hand. Even if you don’t, you can still read this post and decide for yourself if this tool is right for you.
Data Wrangler is part of the SageMaker Studio which is a cloud based IDE. Therefore, you will need to have access to SM Studio as well as Data Wrangler. If it is your first time using SageMaker from your account, then the free tier will cover your costs as it includes 25hours of monthly Data Wrangler usage for the first two months. However, beware that if you do not fall on this category, some charges will occur.
In both cases, make sure to stick to the end of this post to see how to clean up the resources used and make sure nothing will be left running that will raise a surprising bill at the end of the month.

You will also need access to an existing or new S3 bucket.

Dataset

We will be using the well known dataset of wine quality that you can find and download here. Essentially this is a regression problem where we try to predict the wine quality score (sensory data) based on some known features (physicochemical tests).
The data on the above link consist of 2 datasets, one for each of white and red wines.

Set-up

Before we dive into the SageMaker world, make sure to download the data and do the following processing. As of the time of writing, Data Wrangler can load data directly from S3 if these are in CSV format (it also supports loading data from Redshift or by using Amazon Athena, but that goes beyond the scope of this post).
The dataset we have is actually semicolon separated values, not comma separated. Therefore you need to open a text editor of your choice and replace the “;” with a “,”.
Alternatively, if you like the command line and are on a Unix based system, you can quickly do this from the command line:

sed ‘s/;/,/g’ winequality-white.csv > winequality-white-converted.csv
sed ‘s/;/,/g’ winequality-red.csv > winequality-red-converted.csv

As a next step, upload these files to your s3 bucket. It doesn’t matter if it is a new or an existing bucket, as long as you have permissions to upload the data there and retrieve them.

To get started, open SageMaker Studio, since this is how we will be accessing the graphical user interface (GUI) for Data Wrangler.
If you have not done that before, open AWS console, search “sagemaker” and open Studio. If you would like to know more about Studio take a look here.

Step 1 — Create a new flow and import data

Once SageMaker studio is up and running, create a new Data Wrangler flow by clicking on “New flow” that appears once you go to Data Wrangler’s menu as per below.

Once this is done, it’s time to import the data. On the import tab, click on s3 and navigate to the s3 file with the data.

Note that you will be able to import one dataset at a time, so you will need to do that twice since our data are in two files. Click on the “import” tab again so that the page like the one above appears and click again to import from Amazon S3.

Step 2 — joining the data together

On that same window, going to the tab Prepare should look like the below. If not, then check again you followed the instructions.

Now this is where we will build our processing flow!

The first step will be to join our datasets by concatenating them.
Clicking on the “+” icon reveals actions we can take/add to our processing.

In our case now, we will be choosing to concatenate.
Once you click Concatenate you will need to click on the dataset with which to concatenate. Click onto the other dataset. Then click on configure for more options.
Here we choose the name of the new dataset as well as if we want to add one more column that would indicate which file any particular row originated from.

Step 3 — Transforming data

One of the benefits of Data Wrangler is the collection of available transformations that you can do to your data out of the box. But if that is not enough for you, you can add your custom transformations as well.

Once again, we will press the “+” icon, this time the one that is next to the newly created wine-quality box (that represents the new dataset). From the dropdown, select “Add transform”.
On the right you will see a list of transforms that are supported. Feel free to explore and experiment with those you prefer.

To show you how it works, let’s choose Process numeric → leave transform & scaler as is → select a column (eg. sugar residual) → click on scale & centre → preview & add

Let me show you how you can add your custom calculation as well.
Click on the Custom Transform, select the underlying framework you will use and type your code. As simple as that.
Currently your options are to use Python with Pandas, Python with PySpark or SQL on PySpark SQL.

So, let’s create a new feature. I’ll create it using the python-pandas code:

df[‘leakyFeature’] = df[‘quality’] * 2

If you are paying attention you might be jumping off from your chair shouting at me that this makes no sense. Bear with me, there is something interesting I want to show you later. For now, this shows how you can use custom code to add your own, unique transformations.

Step 4 — Visualising data

An integral part of building your processing flow is understanding your data and one of the best ways to help you understand your data is to visualise them. Data Wrangler can help you with this as well.

Let’s go back to the main prepare page and click on the “+” icon again. This time select “add analysis“.
This will take you to a menu where you can select from a list of available graph types. Feel free to explore as much as you like this, but for now, let’s choose to make a Scatter Plot. I would like to see the relationship of residual sugar to total sulfur dioxide broken down by wine type.
Choose preview to generate the plot and if you want to keep it, click on Create.

That was quick and painless to do. Using these graphs we can now explore the dataset, understand better the relationship between our data and we can choose to keep the analysis steps that we would like to recreate in the future or delete the uninformative ones.

When it comes to this analysis functionality, there are a couple of non trivial analysis you can do that can greatly help with preparing your data for the upcoming machine learning modelling phase.

First let’s look at leakage detection. This analysis attempts to inform us of potential target leakage. That is when information of the target variable appears to leak through another feature. In real life, this may happen in cases where we use a feature that might be a sub product of the target variable but in reality won’t be available during inference time. This is exactly why we created the “leakyFeature” before. To simulate such a scenario. It may sound unlikely to happen, but it does happen and you need to be prepared and looking out for it. Let’s see how we can do it.

We create a new analysis and this time select “TargetLeakage” as the Analysis Type. Select the proper Problem Type and Target variable (“quality” for us). Click on preview and wait a couple of seconds.

As you can see, it was able to detect that there is leakage of information due to the feature “leakyFeature” (as expected) so now we can add a step to remove this column from our dataset.

Please note that at this step, the data scientist is responsible to investigate more whether a feature is indeed leaking information or not. It could be the case that there is a certain feature that is very predictive and will actually be available during inference time. Machines and algorithms are getting smarter but the data scientist is still needed to make such decisions.
To remove the column, add a transformation as before and choose Manage columns to drop the column leakyFeature (of course you could just not create it in the first place, but this is just an example)

The last analysis capability I want to show you is that of quick model.

This is a very handy feature, that under the hood trains a model (a random forest) and reports back on performance and feature importance. This is not to be used as a model for predictions but rather as an indication of how well you can expect the modelling to work and also give you insights of what features will be more important. Given this information, you may decide to go back to the drawing board and think over your feature engineering, saving you time, and allowing you to iterate over feature engineering much quicker.

If you are interested in knowing more about how quick model works, I encourage you to read more on the documentation. The one point that I need to stress out is this. Quick model will attempt to automatically guess what type of problem we are targeting to solve (classification or regression). The way it does that is by inferring this by looking at the target variable. If it has more that 100 distinct values, then it assumes it is a regression problem, if it has less then it is classification.
In our case we only have 10 distinct values, so it will be understood as a classification problem.

The next step is entirely optional — — -

Someone could argue that we can address this problem as a classification one, but I’d like to make it a regression. How? You remember that custom calculation we had before? Let’s make use of that.
Let’s create a custom calculation and this time add this python code:

import numpy as np
df[‘quality’] = df[‘quality’] + 0.1 * np.random.random(size=(df.shape[0]))

What is this doing? This is adding some low magnitude random noise on our target variable. This will not affect the quality of any analysis but will make data wrangler treat our problem as a regression

Note: If you try the below with both ways of regression and classification you will see that the feature importance is essentially the same.

End of entirely optional step — — -

To run a quick model analysis you need to add analysis, choose quick model and then choose what the target variable is. Click on preview and wait a few seconds to see the result.

As you can see, quick model was very quick and easy to use and can give us some really good insights into which features might be more useful at the modelling stage.

Step 5 — Export your work

This has been great so far, but I hear you ask how can we possible operationalise a flow we created by clicking around.
Time to export our work in a format that can get operationalised more easily.

Data Wrangler offers a few different ways to export your flow. Here are these options in a nutshell.

Export in python code: You can export your flow into a python file. You can then take this python file and run it in any way and anywhere you need. You can even edit the file, add your custom extra steps or change something.
Export as a Data Wrangler Job: This will generate a notebook that when run, will load your flow file and execute it as a SageMaker Processing job.
Export as Feature Store: This is similar to the above method but will then also create and ingest your data in a Feature Store. This can be handy if you are creating features that you want other members of your team — or your future self — to have access to and you decide that SageMaker Feature Store is the right service for you to do so.
Export as Pipeline: This will create a notebook that is defining a SageMaker Pipeline with a processing step being the execution of the flow. This will be ideal if you are looking to use SageMaker Pipelines to operationalise the whole ML project you are working on, as Pipelines is the CI/CD tool offering of AWS in the ML space.

In any case, let’s see how we can do it.
We click on the Export tab on the top, we click on the steps of our flow and then click “Export step“.

As you can see the python file generated is quite long, but don’t worry, most of it is the function definitions of the transformations. Scroll to the bottom and you will clearly see all the steps we took in the flow now been done through code.

Sum-up

Summarising all of the above, we have done quite a bit of work, quickly and wrote almost no code at all.
We started by combining the datasets together, created and transformed features, visualised our data and got insights as to which features will be more useful during modelling so that we can re-iterate through the feature engineering phase without even leaving the tool we were using. Finally we saw that this interactive and visual flow can easily be exported to various formats, depending on our needs of operationalisation.

You are ready now to experiment with your own data and create your own flows. Let me know on the comments below of your successes or any hiccups you encounter on the way.

Clean-up

If you have been following so far, before you close your browser window, don’t forget to shut down the Data Wrangler server we have been using or else you will incur charges that you didn’t intend.
First make sure you save your Data Wrangler flow, if you want to keep it, and then shut down the instance as in the screenshot below.