The world’s leading publication for data science, AI, and ML professionals.

Machine Learning or The Unexpected Virtue of Regression

Things I wish I knew before I learned ML the hard way

Photo by Dominik Scythe on Unsplash
Photo by Dominik Scythe on Unsplash

The idea that computers can learn sparks our imagination. We see robots, autonomous cars and we’re wondering how is that possible? Ok, we could try to say to our laptop "come on buddy, follow me and learn". That’s a great idea, isn’t it? Oh, you’re right, it doesn’t work that way… yet.

Understanding how computers are capable of learning things may be quite intimidating. I’ve been there, working hard to figure out how artificial neural networks work. Finally, all elements clicked. However the more I learned the more I understood that I should start from a different position to make the learning easier and more effective.

That’s why I share a story that explains how our computers can learn. You’ll also find out that regression doesn’t always mean something bad. Actually, it will give you a solid start into the fascinating world of Machine Learning.

Flashback

Imagine a stereotypical computer science student. Wearing glasses, geeky t-shirt, looking tired because of staying late at night and learning (or playing video games). The main source of light is one from a computer display, the sun is not his friend. That’s me, a couple of years ago. I already understand how neural networks work and I’m going to attend labs about more advanced intelligent systems. Labs starts, I get my task and I start to read:

Implement an intelligent method that…

This simple sentence strikes me. What is an intelligent method? I tried to follow the Feynman Technique[1] to learn things:

Step 1: Choose a subject you want to learn and start learning.

Step 2: Try to explain it as you would explain it to a child.

I couldn’t go through step 2, so I decided to ask my lecturer. That’s what he answered:

"Hmm, I consider every method that needs some kind of training as an intelligent or a Machine Learning method. You know, animals and people, usually, are considered to be intelligent. And we need to train to master the skills that we need. Your programs also train hard to solve problems for you."

This simple, informal definition is brilliant. Just impersonate a program and see it as an athlete or a student who by trial and error become better and better. And it’s very persistent in achieving the goal. That’s the big picture view of Machine Learning.

Having this picture in mind, let’s move to explain one of the main approaches in Machine Learning which is Supervised Learning.

Back to school

I like to compare supervised learning to being thought by a teacher. Let’s go back to when we’re at primary school. Sitting in our classrooms, waiting for our teacher to show us how to recognize different types of leaves, how to name and write letters and numbers.

Photo by CDC on Unsplash
Photo by CDC on Unsplash

Probably your teachers gave you some examples. For instance, they showed you leaves from different trees and named those trees. Additionally, they could bring your attention to some features of leaves like shape, color, or structure. After such a lesson you could pick a leaf while walking in the park and name a tree.

This idea applies to supervised learning. First of all, you’re a teacher. You want some task to be solved and want to teach someone how to do it. It means you need to have a student. In the Machine Learning world, it will usually be a mathematical model (like an equation or set of numbers combined in a specific way).

Then, you have to prepare your examples and correct answers for each example. This data set should be useful for solving your task (e.g. photos of leaves if you want to have a program that recognizes them).

Having that data set, you present your examples to your model. You check how the model responds and compare it to correct answers. Finally, you have to adjust your model parameters (usually a lot of numbers) so it responds correctly.

The process of presenting examples, comparing the model’s responses to correct answers, and adjusting parameters is iterative. You repeat it until your model learned correct answers.

Learn by fun

Now we can go into more detail and try to solve a problem using ML. So as a teacher you’d like to spark the curiosity of your students to learn ML. Probably a lot of students love superheroes (I love them too). You can find them everywhere, they became a part of our culture.

Photo by Ali Kokab on Unsplash
Photo by Ali Kokab on Unsplash

So let’s formulate a problem you want to solve. You’d like to know what is the relationship between a superhero costume and their popularity.

First of all, let’s prepare our examples and correct answers. To do it, we need to figure out what kind of data we can use to model a relationship between costume and popularity.

Fortunately browsing the Internet brings you an awesome idea. Money can be quite a good indicator of popularity. Let’s try it out. Choose around 50 superheroes and check how much we need to pay for each superhero costume.

Then you can find in how many comic book issues given superhero appeared and assume that the more they appeared the better. To find such data we’ll use the following ranking[2].

Gathering all together we have two numbers for every superhero from the ranking. Costume price (x), comic book issues appearances (y):

Superheroes data set. (Image by Author)
Superheroes data set. (Image by Author)

That’s our dataset. As it has two numbers in each row we can draw a plot and see how it looks like in 2D.

Superheroes in 2D.
Superheroes in 2D.

Now it’s time to choose our student, a machine learning model that will learn how to solve a task. If you look at our plot you probably see this trend, the more expensive costume the more comic book issue appearances. If I give you a pen and ask you to draw this trend you could draw a simple, straight line.

Drawing a trend using our own human brain. (Image by Author)
Drawing a trend using our own human brain. (Image by Author)

It fits our data pretty well, doesn’t it? So let’s use a line as our ML model.

It’s a bird… it’s a plane… it’s a linear regression!

The formal name of a method we’re going to use is the linear regression[3]. It’s a method for modeling the relationship between variables. You can use it to check how house size relates to its price or how electric current relates to voltage. And we want to know how costume price relates to comic book appearances.

Our model is a line equation (f(x)-number of comic book issues) with two parameters Θ₀, Θ₁(thetas), and variable x (costume price).

Line equation. (Image by Author)
Line equation. (Image by Author)

How does it work? Let’s assume that Θ₀ = 1, Θ₁ = 1. It means that if we put 0 as x we have 0 times 1 which is 0, plus 1 which is 1. For x = 2, it’s 2 times 1, plus 1 which is 3. We can use these points (0, 1), (2,3) to sketch a line.

Linear funtion sketch for Θ₀ = 1, Θ₁ = 1 (Image by Author)
Linear funtion sketch for Θ₀ = 1, Θ₁ = 1 (Image by Author)

If we change parameters our line will go through different points. Of course in our Machine Learning flavored linear regression we want to find optimal parameters automatically. Our program will try different values of thetas and it needs to know what does it mean that these parameters are ok. That’s why we need an objective function.

Test your student

In ML it’s very important to be able to assess your model. You simply need to know if it performs well or bad. We’re doing it using so-called objective functions. Let’s define our linear regression objective function.

Linear regression objective function. (Image by Author)
Linear regression objective function. (Image by Author)

Objective function Q depends on thetas parameters which change while our program is looking for a solution, and on our data points representing superheroes (points are not changed, they are the source of truth).

Following the formula, we need to iterate through all of our data points (numbered by jᵗʰ index). So take the first point, get its costume price (xʲ) and calculate a value of function f. Then subtract the number of comic book issues for this point (yʲ). Square the result, and add it to the overall sum. Do it for every point in our data set.

To make it easier to understand let’s look at these examples. Grab a point from our data set. Imagine that f(xʲ) for this point gives you 4, and yʲ is equal to 2. Put those values to objective function formula:

Image by Author.
Image by Author.

As you can see, after calculating the objective function formula for this particular point we end up with 4. It means we need to add 4 to our sum. Ok, the sum is growing, go to the next point.

Image by Author.
Image by Author.

This time value of function f(xʲ) is 1 and we were expecting 1 (yʲ=1). As you can see according to the objective function formula we got 0 for this point. It means we don’t need to add anything to our sum, it stays at the same level.

And that’s how our objective function works! The closer the sum is to 0 the better. If values of function f(x) for data points are close or equal to expected values (y), it means we found the optimal theta parameters. Putting it in other words, we’re happy when examples match correct answers. So we need to minimize objective function (overall sum should be low) to solve our problem.

Superhero of Machine Learning

To minimize our objective function we’ll use gradient descent algorithm[3,4]. For me, it’s one of the major algorithms in ML. The basic idea behind it is commonly used in complex and powerful ML methods like Deep Learning[4]. Fortunately, we can use it in a quite simple way for our linear regression task and understand how it works under the hood.

First of all, gradient descent is an iterative algorithm. In every iteration we update our theta parameters following the mysterious formulas:

Image by Author.
Image by Author.
Image by Author.
Image by Author.

No worries, it looks much more complex than it is. Before we start we initialize our thetas to some random values. Then we start updating them. Θ₀(t) and Θ₁(t) are the values of theta parameters in the current iteration (t). We change them by subtracting α, also called learning step (small real number, like 0.25), multiplied by partial derivative (Q/Θ) of our objective function. Ok, that doesn’t sound simple, why do we need derivatives?

The most important thing you need to know about derivatives is that the derivative of a function tells you how the function is changing. If your function is increasing, the derivative of this function at this point will be positive. If it’s decreasing, the derivative will be negative.

For sake of simplicity let’s assume that our objective function is a simple parabola. Its value depends on theta parameter (Θ axis). If we calculate a derivative of function f at the point where the parabola is going up, the value of the derivative will be positive. In gradient descent formula it stands: -α times positive derivative. Negative times positive gives us a negative value (e.g -4 * 3 = -12). It means we need to subtract something from our current theta.

During the next iteration, we calculate the derivative for a new point and if it’s positive we subtract something from our theta again. That’s how we reach a minimum in a couple of steps. Let’s visualize it for better understanding.

Updating theta by subtracting, when starting on increasing edge of parabola. (Image by Author)
Updating theta by subtracting, when starting on increasing edge of parabola. (Image by Author)

If we’d choose our starting point on a decreasing edge of the parabola, our derivative will be negative. So when this time we follow gradient descent formula we have -α times negative derivative. Negative times negative gives us positive value (e.g (-2)*(-3) = 6). So now we are adding something to our theta. Adding means that we’re moving to the right on Θ axis. Following the same steps, we finally reach the minimum. Here’s the visualization.

Updating theta by adding, when starting on deacreasing edge of parabola. (Image by Author)
Updating theta by adding, when starting on deacreasing edge of parabola. (Image by Author)

And that’s how gradient descent is used to minimize objective function and find optimal thetas. Imagine you’re walking down the hill. No matters where you are, calculate the values of derivatives, and go the opposite way (-α) to reach the valley. Keep this idea in your head. That’s how computers learn.

How much to invest in a superhero costume?

Finally, we have all elements of our puzzle:

  1. Data set (points representing superheroes)
  2. Model (line equation)
  3. The objective function (fancy sum over all of our points)
  4. Learning algorithm (gradient descent)

Let’s run our gradient descent for 1000 iterations. It means that we’ll update both thetas for 1000 times. Then our program should already have the best thetas is could find.

And our optimal thetas are:

Optimal thetas. (Image by Author)
Optimal thetas. (Image by Author)

When we put them into the line equation we got:

Optimal line equation for our data set. (Image by Author)
Optimal line equation for our data set. (Image by Author)

That’s what our model learned from the data set but I agree it’s not very informative. To make it more practical we can ask it to answer our questions. Let’s say you’d like to invest $1 000 for superhero costume (a lot of money, but wannabe superhero would do that). As a picture is worth a thousand words, let’s plot our optimal line:

Costume price to commic book issues model by Machine Learning flavoured Linear Regression. (Image by Author)
Costume price to commic book issues model by Machine Learning flavoured Linear Regression. (Image by Author)

Thanks to our model we see that it predicts around 8000 comic book issues appearances if we invest in such an expensive costume. Not bad, we can be in the top 10 most published superheroes!

You’re going on an adventure

That’s the end of the story about teachers, students, recognizing tree leaves, and solving superhero problems. However, it’s the beginning of your journey into the Machine Learning world. Now you’re equipped with intuition about how computers can learn and you’re familiar with one of the most powerful ML concepts, the gradient descent algorithm. If you’re interested in the actual implementation of ML flavored linear regression presented in this article check my Github repository[5].

It’s time to go further, meet new algorithms, concepts, and ask a lot of questions. Maybe it’s worth checking what happens when there is no teacher in a classroom? If so, understanding unsupervised learning should be your next step.

Bibliography:

  1. https://interestingengineering.com/learn-like-an-engineer-the-feynman-technique
  2. https://www.ranker.com/list/superheroes-ranked-by-most-comic-book-appearances/ranker-comics
  3. Grus Joel, Data Science from Scratch, 2nd Edition, O’Reilly Media, Inc.
  4. Patterson Josh, Gibson Adam, Deep Learning, O’Reilly Media, Inc.
  5. https://github.com/rauluka/mluvr-regression

Related Articles