What is Learning in Machine Learning?

6 min readNov 10, 2019

An easy way to learn the terms in Machine Learning from a Philosophical point of view.

Through the Philosopher’s specs

Before diving into philosophy of Machine Learning, let’s bring in more clarity to some terms. Datapoints are the pieces of a whole — the ultimate essence from which all of it originated, machine learning experts usually call it, insight. More technically, the underlying distribution that creates these pieces of information. So every datapoint counts. As you are on a journey of learning you pick each of these stones (datapoint) along your way and get some understanding from it — you learn something from it. As you progress, you will see and pick many stones along the way and continue accumulating knowledge about it.

After you have traveled so far, you would have considerable amount of understanding accumulated about all those stones (or datapoints). Remember that speech from Steve Jobs at Stanford (click here to visit) —

“You cannot connect the dots by looking forward but only by looking backward” — Steve Jobs

Connecting the dots or the datapoints

Now what you do is follow Steve Jobs, connect those dots (your understandings) i.e you combine all the things you learned during your journey till now. What you are about to experience is something else.

When you analyze the information, you may find that all those stones have something in common — similar chemical composition, similar shape etc. And you may find that all these stones had a similar damage causing you to infer that those stones fell down at the same velocity and same height, which you may attribute to a truck that carried them all to its destination — do you feel like Sherlock Holmes now ? because anyone who analyses data is one. There are so many possibilities or so many explanations (models) as to why all these stones have something in common.

The idea is to find the best explanation (model) that is least surprising (Surprise Principle by Charles Sanders Pierce)

Fitting and Generalization

Connecting the story behind each stone (the understanding you gained from them) to generate a grand story or theory is called fitting a model in technical terms.

So, for instance, you coming up with a theory or a reason behind why your air conditioner is not cooling the room as usual, is actually you fitting a model behind the under-performance of your a/c. And this fitting happens as you try to fit all those evidences you got from inspecting the a/c, into a convincing theory, just like a jig-saw puzzle. In layman’s terms, you learned the reason behind the a/c breakdown and the interesting part is, there can be more than one explanation!

Mathematically, learning is fitting a line on your datapoints or evidences. And there can be more than one line possible!

The fitted line or model need not touch all the datapoints perfectly, because that would be overfitting. This means, there will be always some information left out from each datapoint and only such a model will have better generalization performance. So what is generalization performance?

Let’s play Sherlock again. Imagine you cracked a murder case and the culprit was the one who smoked an expensive limited edition cigarette. If you are assigned a second murder case and there are two suspects who smokes cigarettes — one smokes a regular and the other an expensive limited edition, you cannot conclude that the actual criminal is the latter even though all the other circumstances of the crime are similar to the first one you cracked. This is the moment where a data scientist would say your model or theory has poor generalization performance. That is, the model you fitted or learned from the first case cannot be applied to the second case. Firstly, the model for the first case is very much adapted to the first case (overfitted) i.e you have gone deeper into the details of each evidence so much so that the probability of another case having identical set of details is so small or is negligible. If you were to make a general enough model in the first case, you can also apply it to the second case and thus to great extend gives you convincing predictions about who should be the prime suspects.

What you need is a crime pattern theory that successfully explains the first crime and can also explain any future crimes to great extent. Such a theory or model is very general in nature and relatively has less overfitting than an overfitted model.

This philosophy works great if you apply this for your own learning. What you learn is that line.

The line is the physical representation of your knowledge.

This line is telling us that in order to get better understanding than bookworms or overfitters, one must only get the core ideas from each datapoint. You fit a line for physics when you learn physics chapters, so does for chemistry and biology. Thus, we can call them physics line, chemistry line and biology line.

Remember, each datapoint is a stepping stone for your line and every stepping stone is just part of the story and NOT the whole story.

If you focus on just few datapoints and fit a line, chances are, you will be highly biased as it would only contain part of the reality or knowledge that you are seeking. This is called underfitting.

In the murder investigation scene underfitting would mean, you collecting just one or two evidences and reporting the murder to be part of mob-lynching which was actually a one-man crime.

Note: The process of making a line less overfitting than before is known as regularization, in technical terms — in layman’s terms it is a process of making the line or model more general and hence can be called generalization.

Final Note

The ultimate information or theory you seek is spread across many datapoints and you must visit sufficient number of these to get a reasonably accurate understanding. For example, Andrew N G’s deep learning course is just one datapoint in the deep learning domain space. There is an actual truth or the complete concepts of deep learning out there, which is spread across all the different deep learning courses available, online and offline. If you spend too much time and effort into Andrew’s course, you will never reach the ultimate truth and you will be at a disadvantage. It’s necessary that you get the main concept out of Andrew’s course. There will be other teachers, with a different viewpoint and the part of story or dimension which Andrew oversees.

Seeing through one pair of eyes makes you blind to the actual picture, instead you must see it through many pairs of eyes.

P.S : Share this article if you found this informative so that people in need may find it.

What is Learning in Machine Learning?

Through the Philosopher’s specs

Connecting the dots or the datapoints

Fitting and Generalization

Final Note

Written by Aurangazeeb A K