In my previous article, I wrote the first part of this series: Maths behind Supervised Learning for Dummies: The theory in plain words, where we saw an overview of how geometry&algebra are the basis of supervised learning. In the second part of this series, I will introduce the Vapnik-Chervonenkis dimension, the Probably Approximately Correct learning, how to extend our binary classifier to multiple classes, and regression.
The Vapnik-Chervonenkis Dimension a.k.a Brute Force
If four individuals (x1,x2,x3,x4) can be classified into 2 classes (class A or class B), then there are 2^4=16 possibilities of classification: (x1 in A, x2 in A, x3 in A, x4 in A) or (x1 in A, x2 in A, x3 in A, x4 in B) or (x1 in A, x2 in A, x3 in B, x4 in A) or etc etc. Thus, 2^N learning problems (or possible labels) can be given by N data points. If for all these learning problems, we can establish a h (which belongs to H class) that separates the elements correctly into their two classes, then H shatters N points. In other words, the given N points are always correctly separated with no error by a hypothesis h from H class.
The maximum number of points shattered by H is called the Vapnik-Chervonenkis dimension (it is enough to shatter a specific sample of N points in the space and not any N points in the 2-dimension). For example, four points can be shattered by rectangles, although there is no way if they are aligned, we have found a sample for which it does. For a given 4 points in two dimensions, we can find a rectangle h which allows dividing correctly any of the 2⁴=16 learning problems (or possible labelings), thus 4 points are shattered by H:

Nevertheless, there is no way of shattering 5 points with rectangles in two dimensions. For any 5 points in two dimensions, there are always some of the 2⁵=32 possible labelings which cannot be solved:

The Vapnik-Chervonenkis dimension of H, VC(H), is equal to 4 when H is the hypothesis class of axis-aligned rectangles in two dimensions.
Why did I say that this is also known as Brute Force? Well, most real-life problems cannot take 2^N possibilities. Real-life problems are generally conditioned to a distribution from which their points are drawn and points close in the space usually have the same label. We do not have to concern about all possible learning problems and we are going to use a hypothesis class with small VC dimensions over other hypothesis classes with higher dimensions. This will simplify our search and it will work fine for most given points (do you remember the bias definition?). There will be cases where we cannot divide all drawn points correctly into two classes. In such cases, we will have to say how good or accurate our hypothesis class is, for example, it works fine with a 94% of accuracy or it works with an error of less than 3% in more than the 95% of times.
Probably Approximately Correct Learning aka Real-life approximations
So, imagine we have a rectangle hypothesis h from H to solve a real-life problem whose data points are drawn from an unknown probability distribution. We want to find the number of points, N, for which the hypothesis h has an error at most E with probability 1-d. In other words, if the original class from H, which divides correctly all points, is C, then the region of difference between C and h will produce false negatives when using h as hypothesis. The probability of misassigning points in that region must be less than E (error probability) in the 1-d% of cases (confidence probability):

The region between C and h is another rectangle (shadow rectangle in image), thus it can be divided into 4 strips (overlapping in the corners). The probability of falling in each strip is E/4, and the probability of a randomly drawn point misses each strip is 1-E/4. The probability of all N independent points miss the strip is (1-E/4)^N. Additionally, the probability of all N independent points miss any of the four strips is 4(1-E/4)^N. This probability must be d at most, thus 4(1-E/4)^N<d. We do not want to go into greater detail, but dividing by four in both sides, taking logs and rearranging, we get N >(4/E)log(4/d). So we need at least this N size to ensure that error probability with the confidence probability given. Depending on your hypothesis class (we were working with rectangles), this formula must be calculated.
Learning Multiple Classes
So far, we have focused on binary classifiers, but what if there are several classes. If we have K different classes, where each instance belongs only to one of them, we can see this as K two-class problems. The empirical error from a training set X is now the sum over the predictions for all classes (k) over all instances (N):

Regression
In classification, we calculate the empirical error with a binary behavior: The prediction is well done or it is not. For a given instance x, C(x) was either 0 or 1. When we are not predicting classes and we predict values instead, the problem is different and is called regression. Now, r is a real value and is obtained from the inputs as follows r=f(x)+E. We must add noise E to the output, which represents the effect of hidden variables that we cannot see. To be 100% accurate, we should say that r=f(x,z) where z denotes the hidden variables. If there was no noise, it would be an interpolation instead of regression (f is interpolation for example). Please, observe that the noise in the natural output function is not due to the imprecision in recording, or errors in labeling, these errors only affect when predicting.
So we want to approximate f(·) with our hypothesis class G(·), and the empirical error on the training set X must consider the numeric distance in the output, and not just the equal/not equal used in classification. A way to consider this distance is to use the square of the difference :

Our G(·) can be linear or quadratic or any other higher-order function. We just have to replace G(·) with its definition on the empirical error formula and minimize it by taking partial derivatives for the coefficients _w (_which is an analytical solution for the parameters). Please, observe that first we choose the G(·) class (for example we decide to use a linear function), but once we set the coefficients that minimize the empirical error, then we have an instance g(·) from the class G(·). Which G(·) should we choose? As I said in the previous article, when the order of the function increases, the training error decreases, but you may fall into overfitting.
Adrian Perez works as Head of Data Science and has a PhD in Parallel Algorithms for Supercomputing. You can check out more about his Spanish and English content in his Medium profile.
Bibliography
Introduction to Machine Learning. Fourth Edition. Ethem Alpaydin. The MIT Press 2020.