While the AWS Data Wrangler open source code has been available for some time, AWS has announced that SageMaker Studio now has Data Wrangler built right into the interface. No coding required. That is very appealing when building machine learning models in SageMaker. Let’s take a look.
The Setup
If you want to play along at home, the data is an open crash dataset from NY.
Once you have AWS and SageMaker set up, Data Wrangler has no additional configurations. You launch SageMaker Studio, and Data Wrangler is one of the options. You just click on New Flow, and you are on your way.

Data Import
As you might have guessed, the first step is to import your dataset. The three options that are available at this time are S3, Athena, and RedShift, all AWS offerings. I am using the crash dataset I have previously uploaded to S3. Following the prompts is easy.


Add Transformations
Data Transformations can be chosen from a pre-defined list or entered as custom transformations (pyspark). Most of the canned data transformations are typical of a data prep tool. Of note, there is categorical encoding and one-hot encoding, which is very useful to get data ready for Machine Learning.

Add Analysis
There is also the option to add analysis steps into your process. This is good to validate results as you go and provide ongoing insights if you continue using this recipe in the future. The short video below shows the ease of navigating between the different options and setting up analysis.

Analyze – Target Leakage
Under Analyze, you can identify target leakage. It is essential when training a machine learning model to avoid leakage. Is there data available at the time of the training that wouldn’t be known when making the prediction? Leakage can sneak in where you don’t realize it. Have a quick and easy analysis is helpful. Leakage in the training and test datasets may not be evident until your model hits production. Suddenly the accuracy drops off a cliff.

Analyze— Feature Importance through Quick Model
You can run Feature Importance by choosing the Quick Model option. I was getting unknown errors at the time of this writing. I was unable to determine the root cause.
I ran a separate, fully prepped dataset for a different project through Quick Model to verify I could get results. I did – it’s your standard feature importance listing, as seen in the screenshot below.

Export
Once you have completed creating the step you require, you can export the resulting ‘recipe.’ You have several options available. The code provided is python/pyspark scripts. You can build this output right into your Sagemaker machine learning projects.

Some additional Resources
Amazon SageMaker Data Wrangler – Aggregate and Prepare Data for Machine Learning – Amazon Web…
What is AWS Data Wrangler? – AWS Data Wrangler 1.10.1 documentation
How Data Leakage Impacts Machine Learning Models – ML in Production
A Gentle Introduction to Feature Importance in Machine Learning – Sefik Ilkin Serengil
Conclusion
The integration of a no-code solution right in AWS SageMaker makes sense. The move to a simpler no-code interface is the industry trend, and there are many emerging competitors. What is interesting to me is that AWS recently released DataBrew, but Data Wrangler was used in SageMaker. We will have to watch how AWS involves the machine learning platform and the data prep tools it offers.