The world’s leading publication for data science, AI, and ML professionals.

What is Logistic Regression?

Logistic Regression is a fundamental classification tool for data scientists and machine learning engineers.

Photo by Clay Banks on Unsplash
Photo by Clay Banks on Unsplash

This tutorial is on the basics of applying logistic regression, using a little bit of Python. It is also a continuation of the post "What is Linear Regression?", which can be found here.

About Logistic Regression

It is a little counterintuitive, but Logistic Regression is typically used as a classifier. In fact, Logistic Regression is one of the most used and well-known classification methods Data Scientists use. The idea behind this classification method is that the output will be between 0 and 1. Essentially returning the probability that the data you gave to the model, belongs to a certain group or class. From there the developer can set thresholds, depending on how cautious they want to be.

For example, I may set a threshold of 0.8. Which means any output from the Logistic Regression model equal to, or greater than, 0.8 will be classified as a 1. Anything less would be classified as a 0. From there I can move that 0.8 threshold up or down, depending on my use case and the metrics I care about.

How does Logistic Regression relate to Linear Regression?

The Linear Regression formula can be included in the Logistic Regression formula. If you recall, the Linear Regression formula is:

Linear Regression Equation
Linear Regression Equation

Well…the formula for Logistic Regession is:

Logistic Regression Equation
Logistic Regression Equation

And we can swap out the y with our Linear Regression formula:

Logistic Regression Equation 2
Logistic Regression Equation 2

What is this Logistic Regression equation doing?

The Linear Regression portion provides some output value and the Logistic Regression portion pushes these values between 0 and 1 (inclusive). This forms somewhat of an S-curve which you may have seen before.

If not, here is an example from the Scikit Learn docs:

https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html
https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html

Let’s code up an example with Python

We mainly need the popular Machine Learning library, Scikit Learn. Numpy is used later on to build some data structures to feed data into Scikit Learn.

We will create sample data about patients and if they should be approved for some type of treatment. We will create columns for age, weight, average resting heart rate, and then for the approval decision. We use all of the columns, besides the approval decision, as our features. Our model should estimate how important each feature is towards the approval decision and we should get a probability for the patient being approved for the treatment.

Note: This data is completely made up. This is not indicative of any real treatments or decision making by medical professionals.

Library imports

# Import numpy to create numpy arrays
import numpy as np

# Import Scikit-Learn to allow us to run Logistic Regression
from sklearn.linear_model import LogisticRegression

Creating data for our model

The data we are creating contain features associated with age, weight, and average heart rate. Our approval decision will be a column called approved and will contain 0 or 1’s indicating whether or not the patient was approved. 0 means the patient was not approved, 1 meaning the patient was approved.

# Sample Data
approved = np.array([1, 1, 1, 0, 0, 0, 1, 1, 0, 0])
age = [21, 42, 35, 33, 63, 70, 26, 31, 52, 53]
weight = [110, 180, 175, 235, 95, 90, 175, 190, 250, 185]
avg_hrt = [65, 70, 72, 77, 67, 62, 68, 65, 73, 75]

Structuring the features and labels

We stack the multiple Python arrays into one numpy object. Also, specifying the "shape" of the approved array.

# Combining the multiple lists into one object called "X"
X = np.column_stack([age, weight, avg_hrt])
# Reshaping the approvals to work with scikit learn
y = approved.reshape(len(approved), )

Let’s Build a Logistic Regression model

First we instantiate the model.

# Instantiating the model object
model = LogisticRegression()

Now we can fit the model with data. This is how we can approximate a patient’s approval status – given their age, weight, and average resting heart rate.

# Fitting the model with data
fitted_model = model.fit(X, y)

With the model now trained, let’s take a look at the coefficients.

The coefficient for age = -0.6785458695283049 The coefficient for weight = -0.10023841185826837 The coefficient for average resting heart rate = -0.5577492564980651

Interpreting the coefficients

What these coefficients are telling us, is that age is the most significant factor. Closely followed by average resting heart rate. With the patient’s weight coming in last, for this specific data + case.

The values of the coefficients quantify how much the features impact the probability of getting approved. Think of it this way:

As the feature value (left side) increases, the probability of getting approved decreases since the coefficients are negative. The decrease is quantified by the coefficient value (right side).

*Left and right side refer to the print out of the coefficients, shown above.

This is useful to understand what features contribute to your model and impact your decisions. When talking about explainable AI, typically having a way of intepreting the model and seeing how the model came to its decision is extremely important. In this case we can look at the coefficients to determine how impactful each feature was and why an algorithm may choose to approve someone, but not others.

Let’s test the model on new data

Create new data

new_age = [20, 45, 33, 31, 62, 71, 72, 25, 30, 53, 55]
new_weight = [105, 175, 170, 240, 100, 95, 200, 170, 195, 255, 180]
new_avg_hrt = [64, 68, 70, 78, 67, 61, 68, 67, 66, 75, 76]
# Combining the multiple lists into one object called "test_X"
test_X = np.column_stack([new_age, new_weight, new_avg_hrt])

Run new data through the model

results = fitted_model.predict(test_X)

Take a look at the results

print(f"Our approval results are: {results}")

Our approval results are: [1 1 1 0 0 0 0 1 1 0 0]

As you can see Scikit Learn automatically set a threshold for us and determined our approvals. If you want to look at the actual probabilities, we can use a different function provided by Scikit Learn – predict_proba():

results_w_probs = fitted_model.predict_proba(test_X)
print("Our approval results with their probabilites:")
for result in results_w_probs:
    print(f"Probability not approved = {result[0]:.2f}, Probability approved = {result[1]:.2f}")

Our approval results with their probabilites: Probability not approved = 0.00, Probability approved = 1.00 Probability not approved = 0.28, Probability approved = 0.72 Probability not approved = 0.00, Probability approved = 1.00 Probability not approved = 0.84, Probability approved = 0.16 Probability not approved = 0.92, Probability approved = 0.08 Probability not approved = 0.99, Probability approved = 0.01 Probability not approved = 1.00, Probability approved = 0.00 Probability not approved = 0.00, Probability approved = 1.00 Probability not approved = 0.00, Probability approved = 1.00 Probability not approved = 1.00, Probability approved = 0.00 Probability not approved = 1.00, Probability approved = 0.00

Note: The list of probablilites for the 1’s and 0’s are in the same order as our initial numpy array we passed to the predict_proba() function earlier.

From here we can set different thresholds to provide different number of approvals based on probability. If we want to be cautious approving people, we can set the threshold to 0.8 or 0.9. If the treatment is safe, non-invasive, and has a low cost – we can set the threshold lower to 0.25 or 0.3. This all depends on the use case.

Next Steps

I hope you enjoyed this Tutorial and found it useful! Here are some next steps you should consider if you want to make your model better:

  • Find and add more data, whether it’s more rows of data or new features.
  • Test out different thresholds.
  • Learn about classification metrics. (Scikit Learn’s classification report is a good place to start)
  • Tailor your model towards classification metrics that make sense for your use case. Accuracy may not be the best metric. (An example would be when you have imbalanced classes)
  • Explore more complex models like XGBoost, LightGBM, SVM’s, Neural Networks, etc.

Links


Related Articles