GETTING STARTED

Binary classification and logistic regression for beginners

How to solve binary classification with gradient ascent

Lily Chen
Towards Data Science
7 min readDec 2, 2020

--

Binary classification (Image created by me)

Let’s say you have a dataset where each data point is comprised of a middle school GPA, an entrance exam score, and whether that student is admitted to her town’s magnet high school.

Given a new pair of (GPA, exam score) from Sarah, how can you predict whether Sarah will be admitted? Sarah’s GPA is 4.3 and her exam score is 79.

We need to classify Sarah as “yes” or “no” for admission

This is a binary classification problem because we’re predicting an outcome that can only be one of two values: “yes” or “no”.

The algorithm for solving binary classification is logistic regression.

Before we delve into logistic regression, this article assumes an understanding of linear regression. This article also assumes familiarity with how gradient descent works in linear regression. Need a refresher? Read this:

Another way of asking “will Sarah be admitted to magnet school” is:

“What is the probability of Sarah being admitted given her GPA and entrance exam score?”

The mathematical way of representing this question is:

“Probability of y equaling to 1 given x parameterized by theta”

This equation reads “probability of y equaling to 1 given x parameterized by theta”.

y = 1 means “admitted”. Conversely, y = 0 means “not admitted”.

x is the set of features, which in this case, are GPA and entrance exam score.

θ is the parameters that describes how much GPA/exam score affect probability. Remember in linear regression, θ is the vector [y-intercept, slope] and the slope m of a line (y = mx + b) describes how much the variable x affects y .

Logistic regression is about finding this probability, i.e. P(y=1 | x; θ).

The logic behind logistic regression

We don’t know Sarah’s admission status; but we do know the admission status of 17 other students.

For instance, as the chart shows, we know that John is not admitted, Elise is not either, and Bob is. We also know the score and GPA for all of them.

The probability of John not being admitted is some number between 0 and 1. The probability of Bob being admitted is also somewhere between 0 and 1.

We want our model to maximize P(y=0 | x; θ) for John, and P(y=1 | x; θ) for Bob, and P(y=0 | x; θ) for Elise, etc. In logistic regression, we want to maximize the probability of all the data points given.

Visualizing Logistic Regression

In linear regression and gradient descent, your goal is to arrive at the line of best fit by tweaking the slope and y-intercept little by little with each iteration. The line of best fit limits the sum of square of errors.

Linear regression is about finding line of least sum of squared errors

Obviously, finding the least square line makes less sense when you’re doing classification.

To visualize logistic regression, let’s start in 2D first, when you only have 1 feature instead of 2. In our case, let’s only look at GPA.

The x-axis is the GPA. The y-axis is the probability that a student gets admitted given her GPA.

Since it’s a binary classification, all the data points given have a y-value of either 0 or 1.

Instead of finding the least square regression line, you want to find a sigmoid function that best fit the dataset.

Which is a better fit? Red line or green line?

To answer this question, find where P(y | x) land for each GPA. For all your GPA values, you want P(y | x) to be as close as possible to the observed value of y (either 0 or 1). For instance, we know John is not admitted and his GPA is 2.7, so we want P(y | 2.7) to be close to 0. Similarly, Bob is admitted and his GPA is 3.8, so we want P(y | 3.8) to be close to 1. The exact math to compute P(y | x) will be discussed momentarily.

Logistic regression is about finding a sigmoid function h(x) that maximizes the probability of your observed values in the dataset.

Logistic regression algorithm

Onto the math itself!

If you remember from statistics, the probability of eventA AND eventB occurring is equal to the probability of eventA times the probability of eventB.

P(A and B) = P(A) * P(B).

In logistic regression, we want to maximize probability for all of the observed values. Mathematically, the number we’re trying to maximize can be written as:

Product of all probability of dataset. Pi means “product”.

L(θ) is what we want to maximize. In machine learning term, L(θ) is called “maximum likelihood estimation” or MLE.

The third function is a combination of the first two. If you plug in y = 0 or y = 1 into the third function, you get one of the first two.

In linear regression, h(x) takes the form h(x) = mx + b , which can be further written as such:

All the different ways of representing a linear function

In logistic regression we use sigmoid function instead of a line.

Sigmoid function for logistic regression

Taken together, this is the equation for P( y | x; θ)

This is how you compute P(y | x) for all the datapoint. Remember, y is either 0 or 1.

The last step to logistic regression is finding good value for theta.

https://en.wikipedia.org/wiki/Sigmoid_function#/media/File:Logistic-curve.svg

The way we go about finding the parameters in theta is similar to what we do in linear regression to find the line of best fit.

In linear regression, we adjust the y-intercept and slope through multiple iterations to arrive at the least square regression line.

In logistic regression, instead of minimizing the sum of squared errors (as in linear regression), we’ll adjust the parameters of theta to maximize L(θ).

Product of all probability of dataset. Pi means “product”.

From L(θ), we derive log(L(θ)), or l(θ).

Deriving l(θ) from L(θ)

The last equation for l(θ) is actually what the logistic regression algorithm maximizes. We take log of L(θ) purely to make the algorithm computationally easier.

Because we’re trying to maximize a number here, the algorithm we’ll use is called gradient ascent. This is in contrast to gradient descent used in linear regression where we’re trying to minimize the sum of squared errors.

Note: you can also use gradient descent in logistic regression. After all, maximizing likelihood is the same as minimizing the negative of maximum likelihood.

To get the gradient ascent formula, we take the partial derivative of l(θ) with respect to theta.

Computation steps found in this article

If you were doing gradient descent instead, you take the partial derivative of negative l(θ) to arrive at the formula.

I’ve implemented logistic regression with gradient ascent in the gist show below.

Through a series of trial and error tweaking the learning rate alpha and initialized values for theta, I found the parameters [-109.99, 10.655, 0.821] to be a good fit for the model.

Using these parameters, the probability of Sarah being admitted is:

(Remember Sarah’s GPA is 4.3 and her exam score is 79).

P = 0.665. Her chances aren’t great, but she has a decent shot. She’s more likely than not to be admitted. Thus, we’ll classify her as “admitted.”

This article talks about binary classification. In my next article, I will write about multiclass classification. Stay tuned!

You can find me on LinkedIn https://www.linkedin.com/in/yilingchen405/

--

--