Understanding the whole process from data gathering to model deployment.
Data Science – the business affair
The typical process of building a machine learning model involves data collection: the process used to get the data from the databases (ETL) or from any other source, then analyzing the data, engineering and selecting features from the data to building a model on the data, and lastly analyzing model results. But what do you do once you have built the model? Where and how it is used to derive benefits?


Breaking into pieces
There are ways to collect the data or should I say there are many sources of data. Sometimes it follows a long process (ETL, Databases) to get the required data pertaining to the business problems, and sometimes data is given to you by the client in an excel, CSV, or any other file format but chances of the latter are few. Most of the time you need to connect to the database or follow a whole DBM process to get the data to the warehouse and then get it to analyze and build a model. I will talk about this process in detail in another post.
Once you have the data all cleaned and ready. We start by analyzing the data, it usually starts with finding basic insights and some sanity checks (datatypes, summary statistics, etc). In data science, the term for this is exploratory data analysis (EDA). This can be done while getting the data i.e. in the systems that serve as a data source or it can be done through a tool like Python, R, etc. I prefer Python.
From here we move towards the business end of the process. Starting with feature engineering, which involves missing value imputation, outlier detection, variable transformation, etc. Feature selection which is selecting the most relevant features from all the given features(features = variables) in the dataset is also a pivotal part of this process. These two processes give us what we call a "featurized" data. Feature engineering and feature selection are a big and time taking topics to discuss or digress to, for that reason, it’s important to concentrate on the process being discussed.
Finishing Touch
We amble over to the model building and evaluation stage, having gotten the "featurized" data, this concerns only with building different machine learning models and selecting the one which gives the best result. (Training and testing phase). Once we have gotten the model, what do you do after that?
You have the final model that is approved by the team and now you to need productionize this model. What does productionization mean? It means deploying the model into a production system, into a real-time scenario. All the necessary tests have been performed on the research stage. We need to deploy this model to put our research to use.
We train our model offline, given the data. This model is deployed in the real-time scenario where it takes continuous inputs from the real world and depending upon the type of model (regression, classification, etc) it gives continuous outputs.
At this point, I think it is important that we discuss the difference between a machine learning pipeline and a machine learning model because we deploy not just the model but the entire pipeline.
A machine learning pipeline encompasses all the steps required to get a prediction from data (the steps that I mentioned above). A machine learning model, however, is only a piece of this pipeline(Model Building Part). While a model describes a specific algorithm or method for using patterns in data to generate predictions, a pipeline outlines all the steps involved in a machine learning process, from gathering data to acquiring predictions.
How do you put your model into a production environment, all the necessary details will be discussed in subsequent posts.