The world’s leading publication for data science, AI, and ML professionals.

Bounding the Sample Size of a Machine Learning Algorithm

It can be very helpful to bound the sample size of a machine learning algorithm. Read on to find out how.

Photo by Mika Baumeister on Unsplash
Photo by Mika Baumeister on Unsplash

One common problem with machine learning Algorithms is that we don’t know how much training data we need. A common way around this is the often used strategy: keep training until the training error stops decreasing. However, there are still issues with this. How do we know we’re not stuck in a local minimum? What if the training error has strange behavior, sometimes staying flat over training iterations but sometimes decreasing sharply? The bottom line is that without a precise way of knowing how much training data we need, there will always be some uncertainty as to whether or not we are done training.

However, in certain problems it is actually possible to mathematically determine how much training data we need. The idea is this: we have two parameters ε and 1 – δ. These parameters represent our desired test set error and our desired probability of achieving that test set error, respectively. With these two parameters, our goal is to find m, the minimum number of samples (training data) need to achieve those two parameters. For example, let’s say we have ε = 0.05 and δ = 0.1 (so 1 – δ = 0.9). Then, our goal is to find the minimum number of samples m, such that there’s a 90% chance our learning algorithm will achieve less than 5% error on the test set.

Let’s look at a specific problem to see how this approach is used. Our two independent variables are SAT score and GPA. Our goal is to use these two variables to predict whether or not someone is an average student. We define an average student as someone who has both a middle range SAT score and a middle range GPA. Graphically, we can put SAT score on the x-axis and GPA on the y-axis. Therefore, the rectangle in the graph below represents an "average student". Everyone inside the rectangle is an average student, everyone outside is not. The goal of our learning algorithm is to figure out the size and location of this rectangle.

The blue points are average students. The black points are not. The blue rectangle represents the boundary for a student that is average. Image by Author.
The blue points are average students. The black points are not. The blue rectangle represents the boundary for a student that is average. Image by Author.

The next thing is to define our learning algorithm. We are given a sample of (SAT Score, GPA) pairs as well as whether or not each pair is an average student. Our algorithm works by taking all the pairs that are labeled as average, and then drawing the smallest rectangle that fits those pairs. This algorithm is easier represented with a picture:

The green points are our sample. Our algorithm draws the smallest possible rectangle that fits around the points in the sample that are average. Image by Author.
The green points are our sample. Our algorithm draws the smallest possible rectangle that fits around the points in the sample that are average. Image by Author.

Now let’s think about how this algorithm depends on the sample size. The higher the sample size, the more of the blue rectangle (true average student cutoff) will likely be represented in the sample. Therefore, the green rectangle will cover more of the blue rectangle. In other words, higher sample size leads to lower error, which is expected. Our goal is to define this relationship clearly in terms of ε and δ.

First, we fix an ε. Recall ε is the error that we want to achieve – we want at most ε error (lower is also acceptable). Next, we will construct a scenario where ε can be achieved, and find the probability of that scenario. Consider "growing" rectangles from the four sides of the blue rectangle. These four new rectangles will have probability mass ε/4. We can represent them on our graph, labeled as A, B, C, and D.

Each red rectangle has probability mass ε/4. Image by Author.
Each red rectangle has probability mass ε/4. Image by Author.

First, note that if the Error > ε, the sample must not have a point in at least one of the rectangles. This is because if the sample had a point in all the rectangles, the error would be at most ε, as we constructed the four rectangles to have a maximum combined probability mass of ε.

A sample that has points in all four red rectangles. Because of this, the maximum possible error is less than ε. Image by Author.
A sample that has points in all four red rectangles. Because of this, the maximum possible error is less than ε. Image by Author.

Therefore, P(Error > ε) < P(sample has no point in at least 1 rectangle). There are 4 rectangles, so we have P(sample has no point in at least 1 rectangle) < P(sample has no point in A) + P(sample has no point in B) + P(sample has no point in C) + P(sample has no point in D) = 4* P(sample has no point in A) by symmetry, as we set all rectangles to have the same probability mass ε/4. Because our sample size is m, P(sample has no point in A) = (1 – ε/4)^m. Putting it all together, we have P(Error > ε) < 4(1 – ε/4)^m. Thus, given an arbitrary error ε, if we want to upper bound the probability of that error occurring with δ, we set δ > 4(1 – ε/4)^m. All that remains is to solve for m to get m > ln(δ/4) / ln(1 – ε/4). We can use the inequality (1 – x) < e^(-x) to simplify the expression to m > (4/ε)ln(4/δ). We’ve achieved our goal to bound the sample size m in terms of ε and δ.

Having a bound like this is very useful. We can directly calculate how many samples we need to achieve a certain probability of a certain error. For example, if we want a 99% probability of getting 1% error of better, we need (4/0.01)ln(4/0.01) = 2397 training examples. If we could calculate bounds like these for all Machine Learning algorithms, life would be so much easier. Unfortunately, other machine learning algorithms are more complicated than our rectangle algorithm, so the mathematical analysis may be intractable. Other times, a bound may be found, but that bound might be so high that it’s functionally useless. However, I believe that advances in machine learning theory will eventually give us useful bounds for most machine learning algorithms.

In this article, we derived a sample size bound for a simple learning problem in terms of ε and δ. If you are interested in this general framework of analyzing machine learning algorithms, go take a look at my article about PAC learning. Let me know if you have questions/comments about anything, and thanks for reading!


Related Articles