Introduction: what am I talking about?
In this article I state that thinking a little about statistics can save time to the data scientist and money to the company that employs him or her. Although I do assume that the reader has some working knowledge of statistics (for I’m writing with the data scientist in mind), I try to explain the reasoning in the simplest terms I can think of. I simply assume some working knowledge just to make the text smaller in size. This enables me to avoid the need to explain in detail what an average or a standard deviation are and how we calculate them, and what is the shape of a normal distribution.
I’m perfectly aware that statistics is hard to grasp. So, I use my favorite rhetoric tool to clarify ideas: the working example. By making matters simple and tangible using an example, I try to justify why statistical inference is done the way it is, and of course why that makes sense, without relying on a bunch of mathematics.
After the example, I draw some conclusions by directly applying the rationale of the working example to a machine learning context. In particular, I argue that if we have a random variable, whatever it is, we can use basic statistics to save us time and money. Researchers have been doing statistical inference for decades in many critical fields such as engineering and medicine. It has proved its value on the battlefield, so to speak, and I honestly don’t see the advantage of trading them for some expensive computational routine in most cases.
That being said, let’s go to work.
A working example
To illustrate the ideas behind statistical inference and how we can use them, imagine the following scenario. We both work in an online education company, like Coursera or Udemy, and the company wants to reward better teachers. This is crucial because companies such as these obviously benefit from good courses that attract more and more students.
So, as data scientists (emphasis on scientists), we do an experiment: we make a course curriculum on Python programming and tell teachers A and B to implement it. Then, we apply the same test to every student enrolled in each course and measure their grades on a scale from 0 to 100. We randomly selected 1,000 people to take the course with teacher A and 1,000 people to take the course with teacher B.
Now, let’s make some assumptions about the design of our experiment to make matters simple:
- People will do the course at the same time
- People won’t share knowledge with each other nor have any contact with Python programming outside the courses
- People are similar in every respect, so that the difference in grades are solely due to teacher abilities and random chance. Think like these people are 1,000 couples of twins exactly like each other
- People are honest on taking the test, and they try as hard as possible to get the maximum grade of 100
In this context, we have a database of 2,000 rows and 3 columns. The first is the ID of the student, the second its teacher and the third its grade, like this:

Now, let’s think about statistics using this data.
Random Variables and Probability Distributions
Because our sample of 2,000 students are exactly like each other, the difference in grades for students taking the course with the same teacher are purely random. Maybe John got 65 on the test instead of Bob’s 89 because he hit his toe on the edge of the door that morning and got distracted due to pain during the test. In this case, the distribution of grades under a specific teacher, say teacher A, is the famous bell-shaped curve called normal (or gaussian) distribution.
So, we have two potentially different bell curves: one for teacher A and one for teacher B. Now, we can describe quantitatively those distributions. In particular, we want to know two things about them: what is the most common value and how far we expect any value to be from it. The most common value here is the famous average (or mean), and the measure of how far values tend to be from it is the standard deviation. So, the larger the standard deviation, the farther away from the average we should expect a typical value to be. In practice, a larger standard deviation translates to a larger range of values more likely to occur.
We opened up our spreadsheet app and did those calculations. We found that teacher A’s average score is 76 with a standard deviation of 20 points. For teacher B, the average is 54 and the standard deviation is 30 points. We have all that we need to decide which teacher is better based on this particular experiment. Therefore, we are ready to make some statistical inference.
Statistical Inference
Statistical inference tries to answer the questions that steal the sleep of every empirical researcher: are my results due to luck in sampling? If I get more samples and repeat my study I should arrive at the same conclusion? If so, how often?
To answer these questions, we could do a simple test. This test is based on two very simple statements. The first is that the average value is a good summary of information, meaning that the teacher with a higher average grade is the better teacher. The second statement is that we should address the uncertainty inherent in our measurements of the average in order to make a fair conclusion and be right most of the time.
So, many people would say "Well, just take two standard deviations below and above the average for each teacher and call it a day". This is wrong, although not far away from the correct answer. First, let’s assume that we are comfortable with being right 95 times if we repeated this experiment 100 times. This is our level of significance (95%). Now, for a normal distribution, to get this level we must have an interval of about 1.96 standard deviations from the average, hence the famous 2-standard-deviations rule.
If we do this, we are saying what is the interval in which 95% of grades would fall into for each teacher every time we do the experiment. For example, for teacher A we expect 95% of grades falling in between 36 and 100 (actually 116, but the maximum is 100). This is not what we want. Instead, what we want to know is the average grade we would get for each teacher, and the interval it would fall into 95% of the time.
Because it is expensive to repeat this experiment and manually calculate the average score every time, we rely on another measure: the standard error. This measure is a standard deviation, but not of the grades: it is the standard deviation of the average grade. Now, this is what we actually want. We want to know how much the average grade would change if we repeated the experiment over and over again, so that we can be confident we are making the right decision.
This is the remarkable beauty of statistical inference: we can get a very reasonable estimate of the standard error without repeating hundreds of times this expensive and annoying to execute experiment. To do that, we simply divide the standard deviation of grades by the square root of the sample size, in this case the number of students enrolled under each teacher. Like in other exact sciences, there is a reason for this and it is actually very simple in this case. We are saying that the variance of the average is proportional to the variance of the variable it is an average of (how could it be otherwise?). Also, we are saying that, as the sample size grows, the less room we have for variations in the average simply because there are fewer possibilities of obtaining meaningfully different samples. In the extreme scenario, we would "sample" the whole population, like every student in the world, and the variation in the average grade would be zero. This is simply because there isn’t any student left in the world that was not included in the calculation of the average, so it is nonsensical for it to have some inherent uncertainty.
But we are not observing every student in the world, and so our average grade estimate should have some degree of uncertainty. The question is: is this uncertainty sufficient to prevent us from deciding who is the better teacher? Calculating the standard error for teachers A and B, we get about 0.63 points for teacher A and 0.95 points for teacher B (just take the standard deviation and divide by the square root of 1,000). So, if we take an interval of 1.96 standard errors below and above the average for each teacher, we are correct when we say the following. If we repeated this experiment 100 times with different people, in 95 of them the average grade under teacher A would fall between 74.76 and 77.24, and under teacher B it would fall between 52.14 and 55.86. So, we can reasonably conclude that teacher A is the best one. This is because, in 95% of the time, the highest average grade expected for B (55.86) is still lower than the lowest average grade expected for A (74.76). Therefore, even when we account for the fact that our measurement of the average grade is somewhat imperfect, we still can differentiate between teachers A and B with an acceptable error rate.
Applications to Machine Learning
This carries out to the star of the moment in Data Science: machine learning. Everybody wants to run 10,000 versions of a neural network concurrently on the cloud in a database of 20 million rows. But I often ask myself: is that really necessary? I know data is abundant today like never before and cloud computing is surely affordable. However, why should I get into so much trouble if I know a faster and cheaper way to get to the same place? (With a reasonable error rate, of course.)
Statistics is all about samples. It is about doing the best you can as fast and as cheap as possible. Also, it is about knowing how much you are wrong. I personally think that speed, low costs and some degree of uncertainty accounted for are much more valuable to a company than a fancy model trained in a fancy way. Today we data scientists need to deliver value fast and in a way most people can understand what we did. Also, we are required not to spend too much money in research, for the value delivered from the project will be matched against the costs of delivering it.
So, how can we translate my simple example into a machine learning context? Well, we could look at evaluation metrics, for example. These are random variables too, like the grades above, and we can therefore make inferences about them in the same way. In regression problems, the RMSE can be used with a confidence interval in much the same way we did in the example. It is just the square root of the average of something that we measure for all observations in the dataset.
For classification, however, we have a problem: most metrics cannot be calculated for each observation. In this list there are: AUC-ROC, accuracy, precision, recall and F-score (to name only the most popular). This is unfortunate because, while we can get a squared error for each of the 10,000 observations in a dataset with a single sample in regression, we can get only one observation of the AUC-ROC random variable in classification. So we have to partition our dataset in many smaller samples in cross-validation to get a distribution of this variable, which consumes much time (often several hours) and makes cloud providers very rich companies.
However, there is a metric in classification we can use in much the same way as RMSE in regression: the log loss (or cross-entropy). It can be calculated for every observation and therefore we can calculate the average of it. We can then make inferences about the average log loss just like in our working example, simply because the average log loss and the average grade are the same thing from a purely statistical point of view. So, I can get a reasonable estimate of performance with just one sample. Much easier, much faster, much cheaper.
For example, let’s say that we have two classification models M1 and M2. The average log loss of model M1 is lower than that of model M2 even after we addressed the fact that they have some degree of uncertainty in measuring. That is, our confidence interval at 95% of significance still showed that M1 has a lower average loss than M2. Hence M1 is better than M2. This is exactly the same reasoning we used in the example. To arrive at it we just need to get a sample to train the model and another to make predictions, and we should get an amount of observations of the loss equal to the number of rows in the prediction sample. If we calculated the AUC-ROC, we would get only one observation of it.
Furthermore, our working example showed that a mere sample of 1,000 observations is sufficient to make pretty reliable statements about averages. So there is really no reason to get upset when our dataset has fewer than 100,000 rows or so. There is much we can do with just a few thousand rows in a dataset. Of course this depends on the standard deviation of the random variable, but how big must it be for not being relatively small after division by the square root of 1,000 or more?
As the reader may have guessed, I’m a fan of Statistics and hypothesis testing. I sure think that knowing statistics pays off more than knowing programming for being a data scientist (although programming is just as indispensable for the job). I test hypotheses about everything, and particularly like to use it to rank variables when modelling. This is another use of statistical inference that is often much easier to understand and faster to execute. And is also reliable, using methods that have proved their value in research across many disciplines worldwide. Why toss them away for some brute force computational routine like recursive feature elimination (RFE)?
To be clear, I’m not saying that cross-validation, RFE and other computationally expensive methods are irrelevant. They are amazing tools created by people much more capable in the matter than I am, and it’s perfectly fine to use them if you want. That is for sure. But there are other cool things out there that are pretty useful and reliable. Maybe you should give them a try and see if they make you more effective at your beloved job. They sure did it for me.