The world’s leading publication for data science, AI, and ML professionals.

3 Reasons Why Data Scientists Should Learn Statistics Well

We can only be a tool expert without understanding the data

Photo by Jack Hunter on Unsplash
Photo by Jack Hunter on Unsplash

Data Science is an interdisciplinary field. In order to have a flourishing career, a data scientist should obtain a comprehensive set of skills that covers each building block of the field.

One of the building blocks is statistics. Some even call machine learning glorified statistics. I do not completely agree with this argument but machine learning and statistics are closely related.

The goal of data science is creating value out of data. The initial requirement for accomplishing this goal is to understand the data very well. Statistics can be considered as the most impactful tool to understand, interpret, evaluate the data.

In this article, we will go over 3 main reasons why a data scientist should have a comprehensive understanding of statistical concepts.


Know what you have

A successful product starts with understanding the data. We cannot just dump the raw data into a model and expect it to create meaningful results. A substantial amount of time in a typical workflow is spent on understanding the data.

Statistics helps us describe what we have in quantitative measures. Instead of browsing through a large amount of data, we can use a few measures to explain it in a sensible way.

Consider we have the three point shots data of a basketball player. The data contains the distance to the basket and the result of the shot. It is hard to manage such data just by looking at the raw values.

We can simplify this data using the following pieces of information:

  • The average number of points scored with the shots
  • The standard deviation of the distance to the basket

With just two simple measures, we have an informative summary of the shots and the performance of the player. We can also use these measures to compare the performance of different players.

These quantitative measures are part of descriptive statistics since they are used to describe the data. Descriptive statistics are not limited to the mean and standard deviation.

The mean, median, and mode provide an overview of the distribution of data. They are also called the measures of central tendency. The standard deviation tries to explain how much the individual values are spread out.

Distribution of a variable (e.g. normal distribution, binomial distribution) is also very important concept in descriptive statistics. For instance, in case of normal distribution, we can learn a lot about the data only with the mean and standard deviation.


Go beyond what you have

Statistics not only helps us understand what we have but also leads us go beyond it. We can use statistics to infer meaningful results about the entire scope (i.e. population) by using a limited scope of data (i.e. sample).

This part of statistics is also known as inferential statistics. It allows for extending the scope of our findings on the data at hand. It is of crucial importance because we usually do not have the data for the entire scope.

Consider you work for a chain store and you are given a task to analyze and compare the sales patterns of the stores in two different countries. The entire scope would be the sales data during the period that the stores exist. However, it is not manageable or affordable to collect and work with such a huge amount of data.

Instead, you take samples from both groups. You can analyze the sample data and compare the stores. Inferential statistics tell us if the sample results are applicable to the entire scope.

Hypothesis testing, p-value, statistical significance, and z-score are some of the terms and concepts used in inferential statistics. A data scientist should have a comprehensive understanding of these concepts and be able to apply them.

With inferential statistics, we can reach conclusions about a population based on our findings on a small scope of data. It is extremely important since we are likely to work with sample data instead of the entire population.


Machine Learning is not just about importing an algorithm

Machine learning is a part of data science. There are several machine learning algorithms that we use to learn from data.

In case of supervised learning, we train an algorithm with known data and expect it to make predictions on new observations. Unsupervised learning algorithms provide insight into the underlying structure within the data or the relationships among the observations.

In both cases, the processing of raw data is extremely important to get reliable and accurate results. We cannot just dump the raw data into a ready-to-use algorithm and expect outstanding results.

The raw data might contain outliers that negatively affect the performance of a model. There might also be some missing values in the data. They need to be carefully handled to preserve the integrity of features.

How we perform these operations has a large impact on the model performance. In order to handle them appropriately, we need to have a strong statistical knowledge. For instance, we use statistical techniques to mark the outliers. Similarly, the appropriate replacement for a missing value is determined with the help of statistics.

Evaluating the results of a model is just as important as creating it. We cannot just look at a metric and complete the evaluation process. In fact, it should be dynamic and iterative.

We evaluate the results to provide feedback for improving the model. For instance, it is of crucial importance to detect high bias or high variance in the results. The model is tuned or updated differently based on the patterns of errors. Statistics help us to create a valuable and informative evaluation process.

Machine learning is not just about importing an algorithm and use it. We need to prepare and process the data appropriately. Similarly, the output of a model needs to be evaluated carefully. Both tasks require statistical knowledge so it is a must-have skill for data scientists.


Data science is an interdisciplinary field. Statistics is an integral part and an absolute requirement for data scientists. Without a decent level of statistical knowledge, we can only be a tool expert.

Thank you for reading. Please let me know if you have any feedback.


Related Articles