such as the Microsoft Azure platform.

Deploying an ML model on the cloud is definitely different from working with Jupyter notebooks on your system. But it is more a matter of understanding the cloud system and how ML solutions get deployed. The way machine learning and data science have evolved so far, we are all very comfortable with our Jupyter notebook ways of working. Unfortunately, this does not accord well with the way data is being collected, stored and processed in organizations. Increasingly machine learning projects are moving from pilot project mode to the cloud as data sizes increase and working on systems becomes unviable. Many projects involve petabytes of data for which a cloud based solution is the only answer. Further, most large organizations are investing in cloud based data infrastructures. The big players are AWS and Microsoft Azure. So, whether you are a junior Data Scientist or senior some kind of cloud based exposure becomes essential.
In this post, I discuss the process for developing a project in the Azure system from scratch. It is actually familiar territory but not the same. Some modifications in the approach are needed, given that cloud systems are complex and have many layers related to access, permissions, data storage types, ETL, how data is processed, etc. Further, once we move our project to the cloud, it definitely moves from being our ‘baby’ to needing the support of a variety of other roles including data engineers and solution architects. Is it possible for one person to play these roles? At a pinch may be. But with the way IT departments are structured in most organizations, it may not be possible.
What this post specifically covers is the end to data science project process and how it takes place. This includes both the provisioning resources on cloud (not the exact process with screen shots, but more like what you need in terms of storage and compute capacity) and the process of setting up, training, testing and deploying models. As mentioned before I have made this specific to Azure but I am sure similar constructs can be used for AWS, for example. I also strongly recommend going through the Microsoft Azure documents on the Team Data Science Process – the link to which is here.
The Data Science Process Using Cloud Based Systems

The image above perfectly illustrates the cloud based data science process and at a high level it looks similar to the good old data science process. However the devil is really in the details. Let us look at each step in detail.
Step 1: Business Understanding
This step deals with understanding the business requirements for a data science project. Typically this step requires us to define the business problem for which the data science solution may be used. Now, business problems are often big and global. The data science solution may be a small part of the overall business problem. The data science application may be looking to develop an automation or application which can be used as part of the business process. For example, a Machine Learning led automation to improve targeting of emails to improve the open rate is part of the bigger objective of engaging customers with marketing emails. So, it is important to be clear of the distinction.
Once the specific business objective related to the data science application is identified, the next step is to identify what type of questions the data science problem would aim to solve. Broadly there are 5 types of questions that data science typically seeks to solve:
- Predicting ‘ how much or how many’ – typically a regression type problem
- Classifying observations into categories – typically a classification problem
- Identifying groups in a an unknown dataset – typically a clustering problem
- Trying to assess if ‘this is weird’ – typically an anomaly detection problem
- Identifying which options should be offered – a recommendation problem
Once the key question that data science can help with is identified, we can also attempt to identify the target variable which can be a key KPI for business, for example. At this stage we should also attempt to also have some definition of what ‘good’ looks like for the machine learning algorithm. This could be, for example, accuracy percentages, MAPE or MSE which are desired or would make the solution acceptable.
It is also useful to have a discussion on how the solution could be deployed in production, although this part of the discussion may be hypothetical in nature. At this stage also a discussion should take place on identification of data sources. Some discussion should also be had on cloud resource provisioning. One of the problems with the cloud is that unlike your local system, a virtual machine on the cloud carries with it a charge and hence that needs to be picked up. In general all compute, storage and deployment applications will have some associated cost and will most likely be will be gated. So, it is important to be mindful that clear business benefits may need to be demonstrated.
Typically Step 1 is signed off by a project charter or scoping document. It is important to have this in place as data science projects are highly evolutionary in nature and documentation on agreement and scope is critical.
Step 2: Identification and Understanding of Data Sources
Identification
This step is the one that is likely to take the longest. Data in most organizations is messy, incomplete and difficult to source. Things actually become a bit better if data is in the cloud since it may have gone through some kind of harmonization and rationalization when being ingested into cloud storage. However, data in the cloud also means that it is not necessarily easier to reach. As noted above, the data and the compute and analytical resources will need to be provisioned. Given this, the scanner on outcomes will be more than if a flat file pulled from the ‘on premise’ data warehouse is used for analysis on the analysts own laptop. When reviewing the data sources it is good to understand which type they are: on premise vs cloud. Whether it comes through a pipeline with streaming data vs batch. Whether it is stored in a data lake or a SQL database.
The goal of this phase is to identify good quality data for carrying out the initial pilot rounds and then developing data pipelines that refresh and score data automatically.
Understanding of Data
Data understanding follows the standard processes of exploratory data analysis, data cleaning, identifying missing values and outliers and determining the best methods to impute them. This process is often iterative.
Once data quality has been reviewed, outliers and missing values treated, the next step is to understand patterns in the data to choose appropriate analytical models. A key focus of the EDA should be on:
- Determining if there is sufficient data for modeling
- Identifying variables that are closely connected to the target variable
- Determining patterns and interconnections among variables which can act as a precursor to the feature engineering phase of modeling
Step 2a: Preparing Solution Architecture
This is the first point at which we begin to depart from the approach on local machines. One of the key features of cloud systems is that all the data exists in various forms of storage such as data lakes. It can easily be pulled for the ML application by building automated data pipelines. This has the benefit of making the entire end to end process automated. Unlike models run on organizational servers or local machines of data scientists, the model can build up automated pipelines for scoring data and for retraining as well.
At this point we need to start thinking about the cloud architecture and how the data flow can be automated. This can be done either by an automated pipeline or a workflow. The architecture will consist of pipelines for:
- Scoring fresh data
- Retraining models based on new data
This process can be made more advanced by developing automations for time consuming and iterative tasks such as EDA, model training and validation. However, if we are early in model development phase we can focus on pipelines for scoring.
Step 3: Feature Engineering
This is the phase in which the modeling process is developed. This is often an iterative process. The first step is to carry out feature engineering to identify the best variables for modeling and the second step is carry out model training.
Feature engineering is a complex and critical process in the modeling step. It is often iterative in nature. An important part of feature engineering is to leverage domain and business knowledge for EDA, inclusion of variables and transformations to variables. A key balancing act is to identify variables that are strongly connected with the target variable and the inclusion of too many variables which can result in noise being included in the model.
Step 4: Model Training and Validation
Model training procedure is standard. If sufficient data is available we can follow the ‘test – train’ approach. One of the advantages of carrying out analytics using cloud based resources is the availability of flexible compute resources. This enables multiple models and hyperparameter tuning to be tested at a time and the results compared. The algorithms to be used depend on:
- Business question to be answered ( prediction, classification, exploration, etc)
- Volume of data and availability of labeled training data
- Requirements of data scenario in terms of accuracy – measures the overall effectiveness of the model
- No. of parameters – these are used to improve fit of the model such as error tolerance or number of iterations
Cloud systems such as Azure offer fully automated machine learning services such as AutoML. During training Auto ML creates a number of pipelines that try different algorithms in parallel along with hyperparameter tuning. It exits from the run when a target exit criteria such as accuracy percentage is met. Auto ML also enables easy ensemble modeling via voting or stacking ensemble methods. It also includes best practices to avoid overfitting using regularization ( L1, L2 and Elastic Net). Other best practices such as K fold cross validation can be easily incorporated using pre-built modules.
Other advantages of Azure ML include centralized workspaces that can keep track of artefacts created when running ML models. These include the history of all models, logs, metrics, outputs and scripts in one place. Sharing of workplace via assignment of user roles is definitely a big advantage of cloud working, enabling projects to proceed faster with efficient and collaborative division of labour.
Step 5: Deploying Model and Monitoring
Model Deployment
Once the model is considered fit for purpose and has been approved by business, we can proceed to deploy the model in a more scalable fashion. Cloud based systems make model deployment a much easier and faster process. Deployment of the model is typically the process of setting up the model so that it can consume fresh data. When deploying your AI model during production, you need to consider how it will make predictions. The two main processes for AI models are:
Batch inference: A process that bases its predictions on a batch of observations. The predictions are then stored as files or in a database for end users or business applications. This is useful when large amounts of data need to be processed asynchronously.
Real-time (or interactive) inference: The model makes predictions at any time and triggers an immediate response. This pattern can be used to analyze streaming and interactive application data.
— Machine Learning Inference during Deployment, n.d.
Model Monitoring
An important aspect of model deployment is to monitor the model to ensure that it is technically functional and able to generate predictions. This is important if an organization’s applications depend on the model and use it in real time. Secondly, its important to monitor model performance to check if the predictions it generates continuously are relevant.
Another important thing to watch out for is whether data drift occurs where there is a significant difference in the data used to train the model versus the data that is sent to the model during prediction phase. There are many causes of data drift including sensor issues, seasonality, changes in user behavior, and data quality issues related to the data source. Cloud platforms typically build in this functionality to enable model monitoring with data drift. This can then provide an alert that the model needs to be retrained with more relevant data. However, model retraining isn’t required in all cases, so it is recommended to investigate and understand the cause of the data drift before plunging into retraining. The usual investigation is to collect and perform an EDA on production data vis a vis training data.
Retraining the model is never an entirely automated process. The model has to be examined for data drift or a change in data which is very clear – for example, if a model needs to be deployed to a new region. Initially it is common for an organization to only automate a model’s training and deployment but not the validation, monitoring, and retraining steps, which are performed manually.
Final Thoughts
This has been a somewhat long reprisal of ‘old wine in new bottles’, i.e., the data science process transitioned to the cloud as opposed to being carried out ‘on premise’ using servers and various laptops. It is clear that the cloud comes with significant advantages in terms of scalability, collaborative workplaces, preprogrammed modules for many key machine learning algorithms and processes. The con of the cloud based approach is that machine learning projects cannot be just quickly piloted. There is a fair amount of infrastructure that needs to be arranged for even a simple pilot. Secondly, cloud based solutions also bring in requirements for resources such as data engineers and machine learning engineers.
References
This article has been based on Microsoft documentation on the Azure platform for machine learning, data science, AutoML, deployment, etc. In the reference list, I have mentioned pages that I have referred to more.
- Team Data Science Process for Data Scientists. (n.d.). docs.Microsoft.com. Retrieved March 31, 2021, from https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/team-data-science-process-for-data-scientists
- Machine Learning Inference During Deployment. (n.d.). docs.Microsoft.com. Retrieved April 2, 2021, from https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/innovate/best-practices/ml-deployment-inference
- https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview
- https://docs.microsoft.com/en-us/azure/machine-learning/