The world’s leading publication for data science, AI, and ML professionals.

Basic Statistical Concepts Every Data Scientist Must Know

From a practical point of view with python implementation

Getting Started

Photo by Scott Graham on Unsplash
Photo by Scott Graham on Unsplash

Statistics plays an important role in Data Science projects. They are very helpful in getting a better understanding of the data and also for extracting insights. It is an important area and every data scientist must have a clear understanding of the basic statistical concepts.

In this article, I will provide you details about the basic statistical concepts that are frequently used in data science projects, the scenarios where they can be used, and their implementation using python. The scripts and data used in this article can be found in the following git repository here

1. Sampling Techniques

Sampling is an important concept in Statistics used while selecting a subset of data from the larger population, here population refers to entire data. To make this more intuitive let me take the example of an e-commerce company that wants to better understand the interests of its customers. We can better understand the customers by asking them to participate in a survey but it is not feasible and also advisable to ask every customer to participate hence we need to come up with a target audience and for this purpose, a suitable sampling technique can be used. There are many sampling techniques, we are going to see some of the popular methods and their implementation using python

Simple Random Sampling

A simple random sampling method is one in which we just randomly select the data from the population. The problem with simple random sampling is that it is always possible for us to omit a category present in the original data hence the sample we choose might not be a good representation of the population.

Below is the implementation of simple random sampling using python, we have first identified the original mean of the population so that it can be compared with the sample means. When the mean of the sample is close to the actual population then it is good to assume that the distribution of the sample is close enough to the population

Systematic Sampling

In this method, we use a systematic approach in selecting the elements for the sample, like selecting the Nth element based on the column id or date of record creation. The probability of ignoring a category is less in this method but the problem is there are chances of over or under-representation of certain categories. Below is a simple implementation of this sampling technique using python,

Cluster-based Sampling

This method solves the issue we have in the systematics sampling method and the simple random sampling methods, here we ensure that we include data points from all the categories and we also ensure that we don’t over or under-represent any categories. The only issue here is, in reality, the data could be divided into categories by different means like if I take humans, we can be divided into groups based on gender, age, country, education, and so on. So in those complex cases using cluster-based sampling might not work well for all the categories.

Stratified Sampling

This method is one of the most commonly used technique, this method ensures that the distribution of the population is retained in the sample. Below is the implementation using python, after the sample creation the script also checks for the distribution as compared with the original data.

That’s about sampling and the popular methods with their implementation using python, generally, the stratified sampling approach would be mostly close to the actual population.

2. Descriptive Statistics

Descriptive statistics provides a high-level analysis of the various features present in the dataset. Below are some of the descriptive statistics techniques that are commonly used and their implementation using python provided for your reference.

Histogram

Histograms are useful to understand the distribution of the features in the dataset. It is important to understand the distribution of the features (features are the attributes in the dataset) as it would play a key role in the selection of the algorithm for prediction. For example, when we use the linear regression algorithm, one of the assumptions of the algorithm is the features are all normally distributed. Below is the script to plot histogram,

Central Tendency

The various measures of central tendencies are mean, median and mode, these values are used to identify the central point in the distribution. Generally speaking, the mean is a good measure to identify the central point, in the case of skewed data, median might work better and in the case of ordinal data, median or mode would be a better option as compared to mean.

Image from codeburst.io author Diva Dugar
Image from codeburst.io author Diva Dugar

Skewness and Kurtosis

Skewness is used to identify if the distribution is symmetrical (i.e. normal distribution) or asymmetrical (skewed distribution). When the value for skewness is zero then it means that the distribution is normal. A negative value means that the data is negatively skewed that is there is a long tail on the left side of the distribution and a positively skewed means there is a long tail on the right side of the distribution.

A lot of financial data would be positively skewed, to take an example, the wealth of individuals, housing prize, spendings are few examples. In all these cases, the median and the mean will be much higher than the mode because though a majority of values would fall around the mode there would be rare cases with extremely high values and while building a predictive model we can’t ignore those values at the same time we can’t use those features as it is, as many predictive models wouldn’t work well with skewed data hence would require a suitable data transformation.

On the other hand, Kurtosis can be used to estimate the number of outliers in the data. A kurtosis value of 0 means that the distribution is normal with not many outliers and a high kurtosis value suggests that there could be many outliers in the data and a low kurtosis means fewer outliers in the data.

Both skewness and kurtosis are helpful to understand the distribution better so that the treatment plan can be designed if any deviations are found in the dataset like if the data is highly skewed or/and has a large number of outliers then a suitable transformation can be used such as log transformation. Below is the script to get the central tendency measures, skewness, and kurtosis

Variability Measures

The variability in statistics is used to measure the extent to which the data has been spread out. The different methods that can be used to measure the variability are percentile values, standard deviation, and variance. These variability measures can be very useful when we are making comparisons, to quote an example, let’s say an eCommerce company is making some design changes to improve the overall time taken to complete a transaction by the customers. In these cases, the variability measures before and after changes are used to see if the changes are successful.

In simple terms, before design changes, let the time taken to complete a transaction be 10 mins with 2 mins standard deviation and after the design changes, the time taken to complete the transaction be 9 mins with 3 mins standard deviations. Now to check if it is significant or not we use a suitable hypothesis testing.

Also, when we use distance-based algorithms such as K-Means or KNN it is expected that the features used are relatively on a similar scale hence the variability measures help here as well to check if the data complies. Below is the implementation of the above variability measures using python,

3. Relationship-Based Measures

The relationship-based statistical measures are Correlation and Covariance. I have seen a lot of people mistaking correlation for causation. There is a lot of difference between them and it is important as a data scientist to be aware of those difference.

Correlation

Correlation between two variables shows the relationship between them, that when one variable increases how does it impact the other variable. When we say two variables are positively correlated to each other it means that with the increase in one variable the other variable also increases. When the correlation value between two variables is close to zero it means that there is not much relation between them. Below are some examples of scatter plots with different level of correlations

Image Source - Pierce, Rod, 2019, 'Maths is Fun - Correlation', Available here.
Image Source – Pierce, Rod, 2019, ‘Maths is Fun – Correlation’, Available here.

Causation

As mentioned earlier, many people assume correlation is causation, that is when variables ‘A’ and ‘B’ are highly correlated they assume that A is causing B to occur but it might not be the case. In fact, correlation doesn’t explain anything related to causation. A popular example to explain this is, With the increase in temperature the number of crimes increased and the ice cream sales also increased. So here the ice cream sales and the number of crimes are positively correlated with each other but ice cream sales are not contributing (causing) to an increase in crime. On the other hand, an increase in temperate is causing people to buy ice creams also in a way contributing to an increase in crimes hence this is causation.

So never, ever take correlation for causation even if it is tempting, yes even if it is tempting because many times it is easy to pick a correlation as causation. Below is the implementation for finding correlation and covariance using python.

4. Distribution

It is good to know the different distributions as it would help in better understanding the dataset and also to choose an appropriate prediction model. There are many distributions but to start with you need to aware of at least the below distributions

Image from caladis.org
Image from caladis.org
  • Normal Distribution – A normal distribution means most of the observations are concentrated near the mean and as we move away from the mean the number of observations reduces and the distribution is symmetric, that is the left-side and right-side of the mean are the same.
  • Uniform Distribution – In the case of uniform distribution, the probability of occurrence is the same for all the options like tossing a coin, the probability of Head and Tail is both 50%.
  • Binomial Distribution – It is a frequency distribution of the possible outcomes from a finite set. Example- Sum of the rolls of two dices, the possible values range between 2 and 12 and each of them has a probability value.

5. Central Limit Theorem (CLT)

Central Limit Theorem is a popular concept in statistics, as per CLT, as we take more samples from a distribution the sample averages would tend to move towards normal distribution regardless of the actual population’s distribution.

To mention a real-life use-case for Central Limit Theorem, let’s consider predicting the election results. If we ask one group to run a survey then the results of that might be skewed as their target audience might not be a good representation of the population but when we ask multiple independent groups to run the survey and combine them then the results would be closer to the actual population as per the assumption from Central Limit Theorem (CLT). Let’s test the Central Limit Theorem using the below code where we first create random data with uniform distribution and then start picking samples and compute the mean for those samples as we increase the number of iteration the sample means tend to form a normal distribution.,

What’s Next?

On completing these basic concepts then you can focus on inferential statistics concepts where you learn about using suitable hypothesis testing to draw a conclusion which can’t be extracted based on descriptive statistics. Some of the statistical tests that are commonly used by data scientists are,

  • Z-Test
  • T-Test
  • F-Test
  • ANOVA
  • Chi-Squared Test

To learn about the basic statistical concepts that are required for data science check out my tutorial videos from the playlist below.

About Me

I am a Data Science professional with over 10 years of experience and I have authored 2 books in data science. I write data science-related content intending to make it simple and accessible. Follow me up at Medium. I have a YouTube channel as well where I teach and talk about various data science concepts. If interested, subscribe to my channel below.

Data Science with Sharan


Related Articles