The world’s leading publication for data science, AI, and ML professionals.

Life Cycle of Data Science

An inevitable part of today's world is to up-skill oneself in order to either kick start their career or move ahead to another phase. A…

An overview on the various stages of a Data Science project

An inevitable part of today’s world is to up-skill oneself in order to either kick start their career or move ahead to another phase. A well-planned skill enhancement always pays off. Before jumping into any technology or a field of study, it is necessary to perform the groundwork to gain awareness of what is ahead of us. One of the best ways is to get a grip on the end-to-end process. A firm idea on where we start and where we finish sets the road for our journey. It creates a smooth learning path and also provides an opportunity to set short term goals and milestones. Data Science as a field of study is no different.

The project life cycle of Data Science consists of six major phases. Each has its own significance.

  1. UNDERSTANDING THE PROBLEM STATEMENT

The first and probably the most important step is to understand the business problem. This involves constant communication and listening skills in order to understand the problem at hand. If you are someone new to the field, the problem statement will obviously not be as simple as something we encountered while learning the concepts. In real world, the complexity of the problem statement increases multiple folds. It is imperative to understand the problem statement to fulfill the business needs and also for a data scientist to understand the end goal. Usually, there are three types of firms that exist in the field of Data Science/Analytics

  • Captive Analytics Firm: There are no actual clients but a problem statement is already formulated. The firm aims to constantly work and improve on it
  • Non-Captive Analytics Firm: These firms look for a client to provide their analytics services. The problem statement needs to be formulated well by the clients.
  • Product Based Analytics Firm: These firms do not have clients nor do they have a problem statement. They focus on building analytics tool which will be sold to the required clients. The primary focus is to build an extensive product/tool to satisfy multiple clients.

2. Data Collection

Data acquisition or data collection is the next step. Data is the starting point of the problem. Data is a combination of information and noise. The point of interest is to work on the information while negating the noise. Basically, there are two types of data

  • Primary Data: It is raw data which is usually obtained by doing surveys or questionnaires. A first hand data that we can make use of.
  • Secondary Data: Data that is already collected and published but still unprepared.

3. UNDERSTANDING THE DATA

This point is more of a consequence of the first point. In order to understand the data well, one needs to pay undivided attention to the problem statement. The data points constructs a smooth road for solving the problem. This involves getting familiar with the different variables in the data set, the nature, the impact it has towards reaching the end result. By doing so, the priorities are set and working with the relevant data makes the job that much easier.

4. DATA PREPARATION

The data preparation is the phase where one can understand what is actually happening. To talk technically, here is where one performs Exploratory Data Analysis. As the term suggests, we aim to explore on the given data. Understanding the data also means that we represent the given data in an understandable way. An efficient way could be plotting the data in terms of graphs to understand visually. There are broadly two types of analyzing the given data.

  • Uni-variate Analysis: It is a process of analyzing a single variable. This method determines the behavior and properties of the particular variable.
  • Multivariate Analysis: Another term could be bi-variate analysis which is generally used to determine the relationship between the variables and the cause and effect relation.

5. DATA MODELLING

This is the penultimate step and probably less time consuming step. Since 60–70% of the work is done from understanding and prepping the data, the job is to fit the data into various algorithms that works best for the problem statement. Major division in this step involves two ways –

  • Supervised Learning: A learning model using the data that contains independent variables(inputs) and the dependent variable(output). This model ensures that there is something to cross check against the result. Popular methods of Supervised Learning are Regression and Classification.
  • Unsupervised Learning: It is quite the opposite of Supervised. A model using the data that has only independent variables(inputs) but no dependent variable(output). This is mainly done to group the data to find patterns. Popular methods of Unsupervised Learning are Clustering, Dimensionality Reduction and Associate Rule Mining.

These models are time invariant i.e., they do not depend on a time factor. However there is a different set of model building/predicting method that depends on time known as Time Series Analysis.

6. MODEL EVALUATION

The last important step in the life cycle is model evaluation. Once the model is built, the quality of the model is measured by evaluating it based on different techniques. The quality of the model is generally determined by putting a quantitative measure on it. Several techniques such as confusion matrix, classification report, loss functions, errors are some of the measures of evaluating a model. The benchmark of the model depends on the stakeholders. If the model is not good enough, rework can be done by tracing back to the previous phases. An error free, well weighted model are the ones that are eligible for moving further.

The aforementioned six phases are the important phases however there are other phases too that works as a part of the life cycle.

  • Output interpretation
  • Model Deployment
  • Monitoring
  • Maintenance and Optimization(if necessary)

We have reached the end. This was a high level explanation of the life cycle of Data Science.

Feel free to connect, discuss, exchange knowledge with me on LinkedIn.


Related Articles