Machine Learning Background Necessary for Deep Learning

Basics to Generalization to Maximum Likelihood Estimator to Kullback Leibler Divergence

Jake Batsuuri
Computronium Blog

--

In order to understand Feedforward Neural Networks, you have to have a pretty good grasp of the Kernel Trick. Kernel Trick appears in Support Vector Machines, in order to understand SVM, you have to know Maximum Likelihood Estimation. Maximum Likelihood Estimator is a type of point estimate that is equivalent in use as the Minimized Mean Squared Error, and it’s a Point Estimator.

So there’s a lot you have to know before you get to the fun stuff. Let’s do a quick recap.

Basics

An agreed upon definition of machine learning is, a computer program is said to have learned when it’s performance measure P at task T improves with experience E.

Under the definition of Supervised Learning, we get this diagram.

Here the experience would be the training data required to improve the algorithm. In practice we put this data into the Design Matrix.

Design Matrix [dəˈzīn ˈmātriks]: term — if a single input can be represented as a vector, putting all of the training examples, i.e the vectors, into 1 matrix makes the entire input aspects of the training data. We denote this with:

This is not all of the experience. We still need the labels, if the examples are the inputs. The labels are the outputs generated by the real data generating function. We denote these with:

Now both X and y comprise the experience. What we mean by “improve with experience E” is the ability to improve. The Optimizer is the function that allows us to improve, for now we leave it as a blackbox function. But just know that the Optimizer changes a bunch of parameters every time it trains with new data, and it changes it in a way that our entire machine gets better and better at approximating the real data generating function.

We conceptualize the actual process we are applying our algorithm to, as a probability generating function:

you might see both versions, they are context dependent, but mean the same thing

Now every time you see this, think of this as the real function we are trying to approximate. We generally don’t know what it looks like or exactly how it does it.

you might see both versions, they are context dependent, but mean the same thing

Let’s integrate our previously learned terms into this. Our design matrix consists of vectors x and the corresponding labels y.

The label values are generated from the real function.

Now because we don’t know what the real function does we try to approximate it with our algorithm, we symbolize our algorithm with:

It doesn’t matter whether it’s vector or not at the moment. What’s important here is that the real function has the star and our approximating function has no star. Rather it has 2 types of inputs, the input x and the parameter theta.

It turns out that in general we can approximate function behaviour with parameters instead of mathematical functions, for complicated functions we just either have to have many parameters or a complicated structure.

From our earlier diagram of supervised learning, notice that there’s a weights box. The weights are roughly what these parameters, thetas are.

And the Optimizer adjusts the values of the weights with each iteration so that the parameters are fined tuned to produce the right outputs in our algorithm. Therefore we call our parameters the learned variable or inputs.

The Optimizer can sometimes take in other settings that can change it’s behaviour, these settings are variables that are unlearned. With each iteration, they don’t change. We call them hyper-parameters.

Soft Introduction To Generalization

Underfitting & Overfitting

The whole point of the trained machine learning program is that, once it’s trained it should be able to take in previously unseen data and output meaningful and hopefully correct outputs.

We call this idea Generalization, the ability of the algorithm to perform well on previously unobserved values. This can go one of several ways: really well, not useful, too specific.

When a program generalizes well, the new input is entered and a useful output is generated.

When a program doesn’t generalize well, the new input is entered and a useless output is generated. This can happen when the algorithm overfits or under-fits.

When a program overfits, during training the weights/parameters are adjusted so that it performs really well on the training data but not to new data. This happens in a way because the weights are too specific to the training data.

The other useless kind of program is when it’s under-fit. This gives us output that are so generic that it’s not accurate, and that’s not useful.

Formal Introduction to Generalization

Let’s break down the many temporal stages of the algorithm:

  • Training
  • Testing
  • Usefulness

In the training stage, we continuously output training errors. We use these to improve the parameters or weights in the Optimizer. The Loss Function is what measures the Training Error, in fact it measure all the other errors as well.

In short, the Loss Function gives the distance between the actual output y* and the algorithm generated y. So you can imagine during the training stage, this error gets lower and lower, like this:

actual curve of the red line doesn’t matter, just that the error goes down with more training

Then the next stage is the test stage, where we just wanna make sure that our algorithm really worked. Given a giant data set of thousand of training examples, you can break it down to 80% training and 20% testing, or even 50% training and 50% testing, it’s all contextual. It just matters that we always test. Why? In order to get a generalization error or test error.

This generalization error is an indicator of how this trained algorithm is likely to behave out in the wild.

Let’s develop some intuition about these different types of errors. We know that during training, we get our training error to an acceptably low state. Then we do a test with the separated data to get the generalization error. Often, almost always the test error is bigger than the training error.

Although we use the training error to optimize the weight/parameters, the value that we really care about is the Generalization Error. How come we attempt to affect a value by indirectly changing stuff up in another stage?

You might have intuited already that because we split our original training data into 2 parts, we can assume improvements made on one side will suddenly appear in the other side.

The field of Statistical Learning Theory provides some answers and helps us formalize this understanding.

We basically assume that the data generating process, or what we’re calling the real data generating function, produces some sort of distribution, like this:

Instead of splitting the data perfectly like this:

We randomly sample so that we divide up the distribution more like this.

more like a zebra

Although we know that the 2 data sets come from the same data source, here we make IID assumption about the 2 data sets: training and test. IID stands for independent and identically distributed. If this this assumption were to ever break, don’t expect to see close generalization and training errors. Sometimes in the midst of all the coding and too many unknowns, you can forget to clear these fundamental assumptions.

With this assumption, you create one data generating process and use it for both.

So how do we finally make sure to not overfit and not under-fit? How do training errors and test errors lead to the formal understanding of Generalization?

We have:

  • Training Stage gives the Training Error
  • Testing Stage gives the Testing Error

Generalization problems:

  • Under-fitting happens when the training error is too big
  • The difference between training and testing error is too big:

You can probably see that the you must first make sure that the training error is small. Once that’s achieved, you keep the training error small and fixed, at least for the next stage, of testing.

These error measures are primarily the “performance measure P”.

So far we’ve touched on the experience E, being data points or examples, being in the design matrix and the labels. We touched on the fact that the Optimizer changes a bunch of learned parameters using the loss score from the Loss Function. Thereby touching on the performance measure P. The last point is the task T. Which is really important to know, because you shouldn’t think of the machine learning as a solve anything and everything magic pill. It’s really only good at few categories of tasks.

Before we look at the categories of tasks it’s good at and the amazing applications in the field. Let’s understand how the Optimizer works.

Point Estimators

The point estimation is the attempt to provide the single best prediction of some quantity of interest.

This quantity object can be a single parameter, vector of parameters or a whole function. As with everything in Machine learning, there is the true value of the parameter and the estimation of the parameter, which are denoted by:

… respectively.

Let X be a set of m independent and identically distributed, i.i.d, data points:

… a point estimator or statistic is any function of the data, such that:

The odd thing about the the definition is that it doesn’t require that g return a value that is close to the true parameter or even a requirement that the range is the same as that of the true parameter. But it’s for a purpose, because this lax definition allows the user a lot of flexibility. We should always strive for an close estimate of the the true parameter, though.

The point estimate can be a scalar, a vector or even a function. In the case of a function, think of it as the function as belonging to a space, this point estimate is a point in that space.

With our algorithm we wanna make a really, really good function point estimator. This really is the point of machine learning, right?

You might ask yourself, what makes a good function estimator?

Maximum Likelihood Estimation

Maximum Likelihood Estimation, which is quite important, will give us a clue or two.

Consider the same set of m independent, identically distributed examples:

… drawn from the true data generating function/process:

… now consider the the function we’re trying to make that mimics this baby:

… think of this function as a parametric family of probability distributions over the same space indexed by:

could be a scalar or vector of parameters

The maximum likelihood estimator for theta then is defined by:

ML is bitch you guessed it, maximum likelihood, check out bitch you guessed it by OG Maco, idk

If we write it broken down from the set notation we get:

Quick cutaway: argmax is a mathematical short notation of arguments of the maximum, it gives an answer often in sets, which input values give the max output. So for example on a sine wave, the wave function first peaks at 90 degrees and repeats every full rotation, giving us the answer of:

Back to the max point estimate of theta, if you’ve dealt with this probability before, you might see a problem here. The product of bunch of independent probabilities is problematic because, let’s say one of the probabilities is zero, it kinda ruins the party for everyone else. Formally, it removes a bunch of points that could have been possible max points. This by the way, we call numerical underflow problem, anytime when zero or numbers close to zero get approximated to zero and result in zero.

So to go around the numerical underflow, we log the probabilities, thereby getting this:

To understand why this helps us, you have to remember what log behaviour is especially between 0 and 1, since we’re dealing with probabilities.

it intercepts the x at 1

… so logging probabilities always gives us negative values, and the best part, it keeps the arg max values the same, since even though we change the output, arg max cares about the inputs.

The next thing to pay attention is that when we log the expression, we also sum the logged parts instead of multiplying, like in the first part. This again is convenient, because we can can express the whole thing as an expectation:

… remember that expectation looks like this:

Now this maximum likelihood estimate appears in a related and important concept of dissimilarity measurement.

Kullback Leibler Divergence

This divergence measure how different one distribution is from another distribution, i.e measure of dissimilarity. It’s given by:

we wanna measure the dissimilarity between the model distribution and the real data distribution.

When we train the the algorithm, we can’t change the actual data generating process, so it’s constant. Therefore to make these distributions more and more similar to each other, we minimize the KL divergence. The expected value of the first part is constant, the only part that matters is the second part, and we get:

… so we need only minimize the second part of the KL divergence.

So there you have it minimizing the mean squared error and maximum likelihood estimate of the Kullback Leibler Divergence arrive at the same goal of making our model distribution more like the empirical distribution.

Up Next…

These articles always get too long, I’ve really only started on the machine learning part, the next one will be about conditional log likelihood and mean squared error, and review and solidify some generalization concepts. If you would like me to write another article explaining a topic in depth, please leave a comment.

For table of content and more content click here.

References

--

--