How to Decide Between Amazon SageMaker and Microsoft Azure Machine Learning Studio
Both Build Models Faster, but for Very Different Types of Users
I recently published a walk-thru of Microsoft Azure Machine Learning Studio (Studio) https://towardsdatascience.com/how-microsoft-azure-machine-learning-studio-clarifies-data-science-8e8d3e6ed64e and was favorably impressed with the simplicity and power. But there are other tools that also claim to make machine learning easier and speed model development. I am wondering how they compare? So, this week, I am taking a look at Amazon SageMaker (SageMaker) and how it compares to Studio.
What I found when I looked at SageMaker in comparison to Studio is a significantly different approach to model building. The vendors of each tool would both claim to offer a fully managed service that covers the entire machine learning workflow to build, train, and deploy machine learning models quickly. And this is totally true. However, they accomplish this in vastly different ways.
Azure Studio has a drag and drop UI where the machine learning modeling process is architected on a canvas (entirely without code if the user doesn’t stray too much). The user is shielded from the complexities of data engineering, open source libraries and Python coding. This product is targeted at data analysts and citizen data scientists and others who want a simple, visual way to build models.
SageMaker, on the other hand, relies heavily on code and much of the user interaction is designed to take place in a familiar Jupyter Notebook (certainly one of the most popular tools used by data scientists.) The SageMaker environment will allow maximum flexibility (with Python — the most popular coding language for data scientists) but requires much more knowledge of the data engineering, the storage of data and the compute resources than Studio.
So, while both products make data science easier, this is really a case of comparing apples and oranges because the operate so differently. SageMaker is not appropriate for the target users of Studio who are not knowledgeable on coding and data engineering and Studio would appear limiting to software savvy data scientists and developers who are used to coding up anything they desire.
To see why these products were designed for different users, it is best to just walk through the process of model building with screen shots of how each product would work for setup, getting data, preparing data, building and training models, testing and scoring models and deploying.
For SageMaker, I will use Python3 to implement the XGBoost algorithm to predict for the marketing department of a bank whether a customer will buy a CD or not. For Studio, I will conduct a linear regression using various car attributes to predict the price of a car. Here is how both products work.
Setup — Create an Environment
With Amazon SageMaker, we start out by creating a Jupyter notebook instance in the cloud.
The notebook instance is created so a user can access S3 (AWS storage) and other services. Note that in this setup process, the user is making decisions about which S3 buckets they should access, selecting the size of their cloud instance and other technical details — likely to be confusing for citizen data scientists.
The launching point for Azure Machine Learning Studio is the homepage.
In contrast to the initial setup and instance management required in SageMaker, Studio looks much more like a business application and skips the complexity. The basic layout is represented in the following tabs on the left:
· PROJECTS — Collections of experiments, datasets, notebooks, and other resources representing a single project
· EXPERIMENTS — Experiments that you have created or saved
· WEB SERVICES — Web services models that you have deployed from your experiments
· NOTEBOOKS — Jupyter notebooks that you have created
· DATASETS — Datasets that you have uploaded into Studio
· TRAINED MODELS — Models that you have trained in experiments and saved
Get the Data
Studio lets you import and save datasets on the left of this screenshot. In this example, I have chosen a dataset labeled Automobile price data (Raw) and then dragged this dataset to the experiment canvas.
The way studio works is through the creation of a canvas for conducting a machine learning experiment. On the canvas, the user drags and drops necessary components such as data sets, transformations, algorithms etc.. in an orderly process flow for the model experiment. After testing and deciding on a model, the experiment is converted to a working model and published.
One really nice feature that data scientists appreciate is the ability to get a quick look at the data columns and distribution to understand the data they are dealing with. To see what this data looks like, you can simply click the output port at the bottom of the dataset, then select Visualize.
The process of getting data into SageMaker is accomplished programmatically with Python in this example. The user selects the dataset (could be a CSV file etc.) and imports it into a Pandas dataframe for analysis. This is integrated into the data preparation part of SageMaker shown later. In this example, the code would look like the Python command:
model_data = pd.read.csv(….)
Prepare the Data
In SageMaker, once you have data in Python, the user is free to programmatically transform columns, drop rows with missing data etc.. with complete flexibility. I’ll show you some sample code involving reformatting a header and column as part of the model selection section later.
To accomplish data prep in SageMaker before training a model, there are again a number of steps to walk through to configure the environment. Note: there is again a lot of focus on containers, dataframes, libraries, defining the right S3 buckets and the country regions. This complexity is shielded from users in Studio.
First, we need to select the version on Python we are using — conda_python3.
To prepare the data, train the machine learning model, and deploy it, we will need to import some libraries and define a few environment variables. In this case, the code to do that looks as below. We would run that code in the Jupyter Notebook.
We will need a S3 bucket to store our training data once we have processed it and that needs to be defined too.
Next, we need to download the data to our SageMaker instance and load it into a Pandas dataframe as discussed above. To do this would look as follows:
In contrast to a code-based approach, Studio offers a drag and drop visual approach. Datasets and modules in Studio have input and output ports represented by small circles — input ports at the top, output ports at the bottom. To create a flow of data through your experiment, you’ll connect an output port of one module to an input port of another. At any time, you can click the output port of a dataset or module to see what the data looks like at that point in the data flow.
Studio makes the process of data prep very easy for business users. They supply a module (Select Columns in Dataset that removes unwanted columns (the normalized-losses column in this case which has many missing values) completely). We simply connect the Select Columns in Dataset module to our automobile dataset and chose the columns to exclude.
Similarly, to remove rows with missing data, we drag the Clean Missing Data module to the experiment canvas and connect it to the Select Columns in Dataset module.
Build & Train the Model
This step involves splitting the data into training and testing subsets, selecting the features to base the prediction on, and choosing an algorithm that best suits the data.
Split the Data and Select Features
Splitting data into training and testing sets is accomplished with a few lines of Python code. The training data (70% of the customer dataset in this example) will be used during an iterative cycle called gradient optimization to learn model features and infer the class label from input features with the least possible error. This seems to be an advantage of SageMaker in this example as feature selection (the independent variables) is automatically performed vs. Studio where we manually select the features as you will see. The remaining test data (30%) will be used to evaluate the performance of the model later.
To split data and select features in studio, we use the Select Columns in Dataset and Split Data modules.
Studio lets us select features to pass to the training algorithm module fairly simply by selecting the columns that we believe will offer predictive power and iterating until we get the desired performance. I did discover that some of the machine learning algorithms in Studio do use feature selection or dimensionality reduction as part of the training process. When we use these algorithms, we can skip the feature selection process and let the algorithm decide the best inputs just as SageMaker does. For linear regression, we select the features manually though.
To split data, we will use the split data module and visually select the split percentages vs. coding that as we do in SageMaker.
Train the Model
We have predetermined that we will use the SageMaker pre-built XGBoost algorithm. Similarly to how we eliminated columns and rows in Studio in the Prepare the Data Section, we can do the same type of data cleansing and reformatting in Python in SageMaker. The code below (in cell#6) reformats the header and first column of the training data and loads the data from the S3 bucket.
Next, we need to set up the SageMaker session, create an instance of the XGBoost model (an estimator), and define the model’s hyperparameters. The code for this is in cell #7.
With the data loaded and the XGBoost estimator set up, we can now train the model using gradient optimization on a ml.m4.xlarge instance by selecting run in the Jupyter notebook using the code in cell #8.
To add the Learning algorithm we wish to use we expand the Machine Learning Category on the left side of the canvas and expanding the Initialize Model section. We have predetermined that we will use linear regression, so select the linear regression module and drag it to the canvas. Next we would find and drag the Train Model module to the experiment canvas. Connect the output of the Linear Regression module to the input of the Train Model and connect the training data (left port) of the Split Data module to the Train Model as shown.
Click the Train Model module, and then click Launch column selector in the Properties pane and then select the price column. Price is the value that our model is going to predict.. We move price from available columns to selected columns list.
At last we can Run the experiment. We now have a trained regression model that can make price predictions and our model creation flow in Studio looks as follows:
Test, Score and Deploy the Model
To test the model in Sagemaker, we must first deploy it, which is a different process than in Studio. To deploy the model on a server and create an endpoint we run the following code in cell 9 in Jupyter. To predict whether customers in the test data enrolled for the bank product or not, we would run the code in cell 10 in Jupyter.
To evaluate the performance, we write Python code to compare actual vs. predicted performance and produce a table of results as follows.
From these results, we can conclude that the model predicted the outcome accurately for 90% of customers in the test data, with a precision of 65% (278/429) for enrolled and 90% (10,785/11,928) for didn’t enroll.
Last, in Sagemaker, we also need to remember to terminate our session and clean up the cloud resources to eliminate further charges in our account. The following code deletes that SageMaker endpoint and the objects in the S3 bucket.
Now that we’ve trained the Studio model on 75% of the data, we can use it to score the other 25 percent of the data to see how well our model functions. We do this by dragging the Score Model module to the experiment canvas and connecting the output of the Train Model to it. We then connect the test data output (right port) of the Split Data module to the Score Model module as shown.
We then run the experiment and view the output for Score Model by clicking the bottom port and selecting Visualize. The predicted prices are shown in the column Scored Labels along with all of the known feature data used by the model. The column price is the actual known price from the data set.
As with SageMaker, we want to evaluate how well our model performed. To do this we drag the Evaluate Model module onto the canvas and simply connect it to the output of the Score Model. Now when we run this experiment again, we can visualize statistical results on the mean error and R squared etc...
After being satisfied in the accuracy of the model, Studio makes it easy to publish to model for others to use with a Set up Web Service button. This option converts the model from an experiment to a predictive experiment by eliminating data splits, training and other unnecessary steps in a model after we have decided on its features and algorithms. We run the model one last time to check the results and it is ready to go with an API key for others to use on Azure.
Both Microsoft and Amazon offer a robust process and UI-based tool to accelerate and simplify the process of machine learning model development with Azure Studio and Amazon SageMaker. But the tools are designed for totally different users.
Studio offers a beautiful drag and drop interface with simple modules to perform common functions like accessing data, cleansing data, scoring and testing models and deployment etc.. It is designed to walk the citizen data scientist and beginner through the process of building a machine learning model while shielding them from the complexity underneath of managing cloud instances, Python coding and Jupyter Notebooks.
SageMaker, was built to serve the needs of developers and data scientists who are comfortable working in a Jupyter Notebook, programming in Python and want flexibility and total control of resources. But users of SageMaker are not shielded from other cloud operational complexities like cloud instance management — knowing the size of a cluster to choose, locations and spinning down clusters when done working. These types of tasks in addition to the Python programming would make SageMaker an inappropriate choice for most business analysts trying to build a model.
But does one product produce a more accurate model than the other? I don’t believe model accuracy will differentiate one from the other as both products will let us import any algorithms desired and both products offer some automated feature selection on different models. I doubt there is a generalizable difference where one product can produce more accurate modeling results than the other consistently.
The real difference is in the user design point. SageMaker is for data scientists/developers and Studio is designed for citizen data scientists. But, Studio does also support a Jupyter Notebook interface, making it possible that data scientists could also use Studio and the cloud infrastructure for Azure Machine Learning Services to also accomplish what SageMaker offers on top of Amazon cloud infrastructure. To that end, Studio may be a more versatile choice right now for more user types.
About the Author
Steve Dille is a Silicon Valley product management and marketing leader who has been on the executive teams of companies resulting in 5 successful company acquisitions and one IPO in the data management, analytics, BI and big data sectors. Most recently, he was the CMO of SparkPost, where he was instrumental in transitioning the company from an on-premises high volume email sender to a leading predictive analytics-driven cloud email API service growing it from $13M ARR to over $50M ARR. He is currently building deep knowledge in data science, AI and machine learning by pursuing his Master’s in Information and Data Science at UC Berkeley while working. His past education includes an MBA from University of Chicago Booth School of Business and a BS in Computer Science/Math from University of Pittsburgh. He has served as a software developer at NCR, product manager at HP, Data Warehousing Director at Sybase (SAP) and VP of Product or CMO at numerous other startups and mid-size companies.