Various steps Involved in Building Machine Learning Pipeline

Understanding the essential steps present in machine learning can be beneficial so that one can organize and focus their energy and resources to completing each step in the overall ML workflow.

Published in

Towards Data Science

8 min readJun 16, 2022

Oftentimes in machine learning, there is a confusion about how to build a scalable and robust models which can be deployed in real-time. The thing that mostly complicates this is the lack of knowledge about the overall workflow in machine learning. Understanding the various steps in machine learning workflow can be especially handy for data scientists or machine learning engineers as it saves a considerable amount of time and effort in the long run. In this article, we will be going over the steps that are usually involved in building a machine learning system.

Having a good understanding of the principles needed to build a high-level design of an AI system is useful so that one could allocate their time and resources to complete each part of the puzzle before coming up with a robust high-performance model that is put to production. Each of the steps that are highlighted in the article can be very useful to check and monitor to get the best model deployed in real-time. Let us go over the list of steps that are present in the lifecycle of a machine learning project.

Defining the Business Goal or Objective

Industries moving towards AI are increasing and are likely to grow in the future as companies are making huge margins and profits by leveraging the tools and skills needed to get human level performance in various tasks. Therefore, there are many goals and objectives in various projects that also involve machine learning capabilities. It is important that the goals of the project are discussed and understood before building or extracting the data. By defining the problem and understanding the goals, it is possible to know where to actually apply ML and where to also reduce its applications which need not be automated. Consider an example of predicting whether a customer is going to exceed a transaction limit per day based on features such as the amount spent, the card usage, time spending shopping and various other factors. It is implicit in this data that we need not automate this process because the feature “the amount spent” is actually a rough indicator of whether a customer exceeded the transaction limit or not. In this case, it would be a considerable waste of resources and compute power for ML models when there is little to no potential business impact from the predictions. The initial step of defining the business goal or objective can be a very powerful step that should be taken before starting and allocating the resources needed for the ML system.

Data Collection

Okay we have actually defined the business objective and have a concrete idea about building the best ML models for prediction. The next important step becomes collecting important information for the ML models to perform well on the key performance indicators (KPIs). These indicators vary depending on whether the problem that we are trying to tackle is a classification or a regression task. Understanding and optimizing the ML models to perform well on these metrics can be an important thing to consider when trying to deploy them in real-time. It can always be a good thing to check the quality of the data as it should be taken into principle that if the data given to the model does not exhibit any relationships with the target variable, we most likely end up getting bad performance from the models on KPIs that we have defined. Therefore, trying to gather the most appropriate data for the problem at hand that we are trying to solve can have a significant positive impact in the performance of the models.

Exploratory Data Analysis

After getting the right data that is important for predictions, it is now time to explore whether there is any relationship between features in data and the output variable. Using useful visualizations such as bar plots, scatterplots and count plots aid in understanding and analyzing the data to a large extent so that it could also be explained quite well to the stakeholders. Furthermore, it can also be that our data contains a lot of missing values or outliers. When there are outliers in the data, it can often mess up the model in thinking that they are very important and when our model is faced with the actual data, it often fails when deployed in real-time. Hence it can be considered a useful practice to explore the data and understand if there are outliers or missing values. Dealing with missing values is also crucial as there are many ML models that are not robust to missing values. We often get errors when we try to give data that contains missing values. There are various strategies that can be employed to deal with missing values such as mean, median or mode imputation along with a few others.

Data Preprocessing

Now we got a good understanding of the overall goal of the project with ML and also an intuition about the data based on the visualizations generated by various plots. It can often be that our data contains a lot of missing values or outliers as discussed in the above part of the article. Hence it is now time to deal with these values before feeding them to the ML models for predictions. After performing these steps, it is also important to perform data standardization as it can be useful for most of the models. Apart from that, the presence of categorical features must be taken into consideration. They can be converted to numerical features by taking into account various feature encoding techniques. After performing all the steps mentioned here, it is now time to train these models which we will go over in the next part of this article.

If you are more interested in learning about various preprocessing steps that can be taken for the data, feel free to take a look at my earlier article where I mention them in great detail. Below is the link.

What Are the Most Important Preprocessing Steps in Machine Learning and Data Science? | by Suhas Maddali | May, 2022 | Towards Data Science (medium.com)

Training Machine Learning (ML) Models

You have made the data ready to be used for the ML models for them to make useful predictions. It is now time to train our models and let them learn some important representations from the data before they can make their guess of the outcome or the target variable. We train various models to find out the highly performing model that can be taken to deployment based on the key performance indications (KPI) that we have defined in the first part of machine learning workflow. After training and performing hyperparameter tuning to get the best model, we finally decide to deploy it in real-time which is the next step in the workflow.

Deploying the Models

Photo by Nguyen Dang Hoang Nhu on Unsplash

We have trained various models and also performed hyperparameter tuning (changing parameters in models to get best performance), it is now time to deploy it in real-time and understand the performance of it. While we do not have knowledge about what our output label is going to look like in real-time, we should use our domain knowledge, and expertise from others to make judgment whether our model is really on track and whether it is making predictions that is actually what is expected from it. Any deviation from the desired performance could have a significant impact in the business value that the models actually create in the organization. That is where constant monitoring of the model should be done which we are going to talk about in the next part of this article.

Monitoring the Performance

The final step of the workflow would be to constantly monitor the performance of the model and see how well it is performing and whether it is meeting the expectations based on the KPIs. While one might ask without having output label, how do we actually determine the performance. Well that’s a good question. In that case, we use our own domain knowledge and expertise from others to find out if the results that are shown match with what is generally expected when human analyst makes his/her guess about a particular outcome. If it is matching that of a human analyst, it means that our model is performing quite well. On the other hand, there can be situations where the model tends to perform poorly after being deployed. In this case, one solution would be to retrain the model and also change the data that reflect the patterns from the present rather than just the past so that it can learn new representations. When we do not constantly monitor the performance of the model, however, there can be situations such as concept drift or data drift which cause the model to perform quite poorly during run-time. Hence, constant monitoring and evaluating the performance of the model can be a good thing to do in the final stages.

Conclusion

After going through this article, hope you got a good intuition about the overall workflow of machine learning with the steps described in detail. By learning the steps in the overall workflow, one could necessarily devote their time and effort to improve on certain aspects in the project that needs careful attention. Sometimes there can be more emphasis placed on training ML models rather than focusing more on the data preprocessing side of things. It should always be taken into account that the quality of data that is given determines to a large extent the quality of the outcomes and performance of the models. Therefore, spending the right amount of time at each part of the pipeline can be a very efficient way to develop ML solutions and generate business impact and value from it. Thanks for taking the time to go through the article.

Below are the ways where you could contact me or take a look at my work. Thanks.

GitHub: suhasmaddali (Suhas Maddali ) (github.com)

LinkedIn: (1) Suhas Maddali, Northeastern University, Data Science | LinkedIn

Medium: Suhas Maddali — Medium

Thanks toBen Huberman (hide)