The world’s leading publication for data science, AI, and ML professionals.

AWS releases SageMaker Studio Data Wrangler

Categorical Encoding, Target Leakage Identification, and Feature Importance are readily available as no-code capabilities.

Image by Gerd Altmann from Pixabay
Image by Gerd Altmann from Pixabay

While the AWS Data Wrangler open source code has been available for some time, AWS has announced that SageMaker Studio now has Data Wrangler built right into the interface. No coding required. That is very appealing when building machine learning models in SageMaker. Let’s take a look.

The Setup

If you want to play along at home, the data is an open crash dataset from NY.

Once you have AWS and SageMaker set up, Data Wrangler has no additional configurations. You launch SageMaker Studio, and Data Wrangler is one of the options. You just click on New Flow, and you are on your way.

From SageMaker Studio, navigate to Data Wrangler - screenshot by the author.
From SageMaker Studio, navigate to Data Wrangler – screenshot by the author.

Data Import

As you might have guessed, the first step is to import your dataset. The three options that are available at this time are S3, Athena, and RedShift, all AWS offerings. I am using the crash dataset I have previously uploaded to S3. Following the prompts is easy.

S3, Athena, and RedShift are the available data sources - screenshot by the author.
S3, Athena, and RedShift are the available data sources – screenshot by the author.
Data Import screenshot by the author
Data Import screenshot by the author

Add Transformations

Data Transformations can be chosen from a pre-defined list or entered as custom transformations (pyspark). Most of the canned data transformations are typical of a data prep tool. Of note, there is categorical encoding and one-hot encoding, which is very useful to get data ready for Machine Learning.

Available Transformations - screenshot by the author
Available Transformations – screenshot by the author

Add Analysis

There is also the option to add analysis steps into your process. This is good to validate results as you go and provide ongoing insights if you continue using this recipe in the future. The short video below shows the ease of navigating between the different options and setting up analysis.

gif of adding an analysis step by the author
gif of adding an analysis step by the author

Analyze – Target Leakage

Under Analyze, you can identify target leakage. It is essential when training a machine learning model to avoid leakage. Is there data available at the time of the training that wouldn’t be known when making the prediction? Leakage can sneak in where you don’t realize it. Have a quick and easy analysis is helpful. Leakage in the training and test datasets may not be evident until your model hits production. Suddenly the accuracy drops off a cliff.

Click to Enlarge - Target Leakage identification - screenshot by the author.
Click to Enlarge – Target Leakage identification – screenshot by the author.

Analyze— Feature Importance through Quick Model

You can run Feature Importance by choosing the Quick Model option. I was getting unknown errors at the time of this writing. I was unable to determine the root cause.

I ran a separate, fully prepped dataset for a different project through Quick Model to verify I could get results. I did – it’s your standard feature importance listing, as seen in the screenshot below.

Feature Importance screenshot by the author
Feature Importance screenshot by the author

Export

Once you have completed creating the step you require, you can export the resulting ‘recipe.’ You have several options available. The code provided is python/pyspark scripts. You can build this output right into your Sagemaker machine learning projects.

Pick your output format - screenshot by the author.
Pick your output format – screenshot by the author.

Some additional Resources

Amazon SageMaker Data Wrangler – Aggregate and Prepare Data for Machine Learning – Amazon Web…

What is AWS Data Wrangler? – AWS Data Wrangler 1.10.1 documentation

How Data Leakage Impacts Machine Learning Models – ML in Production

A Gentle Introduction to Feature Importance in Machine Learning – Sefik Ilkin Serengil

Conclusion

The integration of a no-code solution right in AWS SageMaker makes sense. The move to a simpler no-code interface is the industry trend, and there are many emerging competitors. What is interesting to me is that AWS recently released DataBrew, but Data Wrangler was used in SageMaker. We will have to watch how AWS involves the machine learning platform and the data prep tools it offers.


Related Articles