The world’s leading publication for data science, AI, and ML professionals.

Approaching Data Science Projects like a Scientist

Using the Scientific Method approach in a Data Science Project

Getting Started

Introduction

A scientist uses the "Scientific Method" to conduct research. The term scientific method refers to a process of carrying out experiments that begins with an observation and starts by asking questions, performing hypotheses, carrying out experiments, observing and analyzing, drawing out conclusions.

The approach of using the scientific method is not just applicable for scientists to perform experiments but can be used in our everyday life to solve problems whereby we learn how to ask questions, collect information, analyze based on evidence, use logic to draw conclusions, and making better decision in our life.

In this article, we will explore and learn what exactly is the scientific method and how the scientific method can be used in Data Science projects.

What is Scientific Method?

The Scientific Method was first introduced by Sir Francis Bacon when people begin to perform studies through microscopes. Bacon created the scientific method for philosophers to ensure the truthfulness of the results and argue that the philosopher must doubt their claims before confirming the truth. The scientific method consists of a series of steps which begins by making an observation, asking questions, forming a hypothesis, performing experiments to test the hypothesis, analyzing the results, and determining if the hypothesis is correct or incorrect.

"The approach that science uses to gain knowledge based on observation, formulating laws and theories, testing theories or hypothesis through experimentation."

Now, let’s understand deeper the approach of "Scientific Method" from a data science perspective.

Understanding the Scientific Method

Step 1: Observation & Understand

Before starting a machine learning project, we first need to observe, perform some research and understand the business that the project is part of. In this stage, you might or might not be provided a set of data. If data is not provided, you will then need to look into similar cases, perform research, look for existing data and investigate to have a better understanding. If data is provided, observe, explore and identify which are the relevant information for your use case.

Eg: A telco company approaches us to build a machine learning model that predicts and identifies customers that are likely to churn in the following month. – What are the Observation & Understanding at this stage?

  • Understand the big picture of the use case, the business problem, and the goal/outcome.
  • Communicate with stakeholders to understand more about the telco industry.
  • Understand the previous approach taken to identify and prevent customer churn.
  • Explore the data to have a better understanding of the information available.
  • Observe if there are any patterns that can be seen in the data set.

After observing, understanding, and exploring, you will start to have questions popping up in your head which moves us to the next step of the Scientific Method Process – Begin with a Question.

Step 2: Begin with a Question

In this stage, formulate questions based on the previous steps of observation, understanding, and exploration. Questions are listed to provide a better understanding of what areas can be answered with the current set of information. Start questioning with the 7 Ws – "What", "Why", "How", "Which", "When", "Who", or "Where".

Eg: What are the possible questions?

  • What is the churn rate?
  • Are there unusual behavior or signs before a customer churn?
  • Do most of the customers that churn hold similar mobile plans?
  • Which packages have the lowest churn rate?
  • Which customer will leave next month?

After having a preliminary list of questions and a better understanding of the use case, the question will then be translated into a hypothesis.

Step 3: Develop Hypothesis

In a machine learning project, hypothesis generation is like making an "educated guess" by identifying factors that can impact the problem that the business is solving. Developing a hypothesis is an important process in data science projects as it helps to identify information/drivers that could help the model to predict more accurately and understand what additional data is required to be collected or to collect if data has not been provided in the earlier stages.

Eg:

  • The customer that churns often belongs to a similar mobile plan.
  • The customer that churns stays out of the city where the connection is weaker.
  • Senior Citizen tends to have a higher churn rate compared to working adults.

As a data scientist, you must be able to perform critical thinking and have domain expertise to identify different factors and generate relevant hypotheses in order to have a successful outcome from the model.

The next step after hypothesis generation is to perform experiments and test the hypothesis.

Step 4: Perform Experiments to Test Hypothesis

In this stage, you will collect relevant data based on the hypothesis generated and build an exploration chart to support the hypothesis before adding them in as a factor/driver when training the machine learning model.

For example, after formulating the hypothesis –The customer that churns often belongs to a similar mobile plan, we will then need to look into the data set to determine if our hypothesis is supported by the data. This process can be done by tabulating the result in a table or building visualization to have a better view.

The chart above shows that the churn rate is highest for customers in Plan A which aligns with our hypothesis. Therefore, having the information of each customer mobile plan in the model training data set could lead to better prediction results. Alternatively, if the data does not support our hypothesis, then this indicates that the hypothesis should not be accepted.

This step is repeated for each of the hypotheses generated, where we test and verify the hypothesis before adding them as a factor/driver for model training.

The next step after hypothesis testing is to analyze and interpret results which also means evaluating our model performance.

Step 5: Analyze & Interpret Results

In this stage, we will evaluate our model performance based on the model that was trained in the previous step. A machine learning model typically goes through iterative steps of model tuning but having a set of strong hypotheses that are tested and supported by the data will lead to better model performance. After model training, the model performance is validated against the validation set and holdout set. (If you would like to understand more on the difference between validation set and holdout set – refer to this article.)

There are several questions to answer while validating the model performance, such as:

  • How accurate is the model in identifying churn customers? What’s the error rate?
  • Is there any new hypothesis that can be tested and added as a feature to the model?

The next step after having a finalized model that answers the problem statement is to have a conclusion and communicate the results back to the stakeholders.

Step 6: Communicate

The final step is to be able to communicate the result in an understandable format and provide recommendations that can be taken by the stakeholders (For example, providing a list of customers that are likely to churn next month and how the marketing team can take action based on this information.)

A great way to present your findings to the audience in an understandable format is through visualizations with interactive charts or presentation slides.

Conclusion

Data Science is not just focusing on different machine learning models (Deep learning, Neural networks, Clustering, etc.) but also taking into consideration of the scientific approach. Just like conducting science experiments, the scientific method in data science helps us to avoid reaching a wrong conclusion and building a model that is biased based on our assumptions. As organizations adopt the scientific method approach in projects, they will be able to make sense of the data that eventually leads to building a more robust model and making better decisions.


Thanks for reading my article and if you enjoyed and would like to support me:

References & Links:

[1] https://openstax.org/books/biology/pages/1-1-the-science-of-biology

[2] scientific method. Oxford Reference. Retrieved 1 Nov. 2021, from https://www.oxfordreference.com/view/10.1093/oi/authority.20110803100447727.

[3] https://searchbusinessanalytics.techtarget.com/feature/The-data-science-process-6-key-steps-on-analytics-applications


Related Articles