A while ago I started a Coursera Course on Applied Machine Learning. In order to help others taking the course and help myself better understand the topics, I’ve decided to make short tutorials following the curriculum. My last two articles covered KNN Classification and Linear (Polynomial) Regression. If interested, feel free to take a look.
Towards machine learning – K Nearest Neighbour (KNN)
Towards Machine Learning – Linear Regression and Polynomial Regression
Today, I will cover a technique called Logistic Regression. Even though it’s called "Regression" it is a classification method. The main difference when compared to Linear Regression is the output, where Linear Regression gives a continuous value, Logistic Regression returns a binary variable. In simple terms it explains the relationship between a two-class value, i.e. is an animal a Giraffe or not.

But more on that, later on.
Structure of the article:
- Introduction
- Dataset loading and description
- Data formatting and model definition
- Result visualization
- Conclusion
Enjoy the reading! 🙂
Introduction
Recently, I’ve tried KNN and Linear regression on car data. KNN helped us classify cars in car classes like "Compact car" or "Large car", with pretty satisfying results. Let’s say someone wants to know the probability of a car being a Front-wheel driven. Here comes Logistic Regression into play, since it calculates the score ("chance") of a object being the target class. i.e. it could predict the probability of a car being Front_wheel driven, according to it’s fuel consumption, engine size, cylinder count or any other relevant feature of the car.
How does Logistic Regression work?
Like in Linear Regression we have some input variables, X1, X2, X3. Linear Regression would calculate the weight of each of these variables, add a bias and return a label (class). Similarly, in Logistic Regression, weights for each input variable (X1, X2, X3) are calculated, a bias term is added, and then a logistic function is applied on the results. The function then returns a value between zero (negative class) and one (positive class), which describes the probability value that the input object belongs to the positive class.

Let’s see a simple example, according to Wikipedia a fully grown Giraffe has a height between 4.3 and 5.7 metres. Let’s say we want to know if an animal is a Giraffe or not, so we measure the height of animals, and display the values on the X-axis, the logistic function then returns the probability (Y-axis) of an animal being a Giraffe. Then, if the value is above 0.5, the animal gets predicted as a Giraffe, if bellow, the model predicts the animal is not a Giraffe. Additionally to the classification, we get information about the probability. So, an animal with a height of 4.8 metres, has a higher probability of being a Giraffe than an animal with a height o 2.7 metres.
Let’s rather get practical, shall we?
Dataset loading and description
First, we will import the dependencies, load the cars fuel economy data and take a look at the dataframe. To use Logistic Regression we need to import the LogisticRegression class from Scikit-Learn model module.
Again, we will use the cars dataset from Udacity. It contains technical specifications from 3920 cars, with data on cylinder count, engine size (displacement), fuel consumption, CO2 output, etc, and also drive.
Our goal today here is to classify cars in two classes according to driven wheels. Basically, there are Front-wheel driven cars, Rear-wheel driven cars, and All-wheel driven cars in our database. Each type has its own PROS and CONS, but that’s not the topic here. 🙂
We want to predict the probability of a car being Front-wheel driven, according to displacement [litres] and combined fuel consumption [mpg]. To do so, we need to manage the labels, where Front-wheel driven cars will receive the label 1 (positive), Rear and All-wheel driven cars label 0 (negative).

If we plot displacement and combined fuel consumption as a pairplot, we can notice a pattern where Front-wheel driven cars (class 1 – orange dots) tend to burn less fuel (higher mpg value), and have lower displacement.

Also, before applying Logistic Regression to any data it is good practice to perform data standardization (Source 1, Source 2). Basically, it centres the variable around zero and sets the variance to one by subtracting the mean from each measurement and dividing the result with standard deviation. We will run our model with and without standardization, just to see if it makes any difference on this particular dataset.
Data formatting and model definition
First, we select the desired columns from our dataset.
Also, when applying Logistic Regression to any data, it is good practice to perform data standardization (Source 1, Source 2). Basically, it centres the variable around zero and sets the variance to one by subtracting the mean from each measurement and dividing the result with standard deviation. We will run our model with and without standardization, just to see if it makes any difference on this particular dataset.
How to perform standardization?
Scikit-learn has an class for that also, it’s called StandardScaler. Let’s see an example:
Now, let’s compare the data before and after the standardization.

As expected, we can observe, that the standard deviation of the data now is significantly lower and the range of the data is also lower, so the values are closer together.

In the next step, we want to check if this measure, yields some benefits or drawbacks for our model.
Model definition
The we can apply the model. As usual with Scikit-learn, this is a very straightforward process.
Nice, the model is ready, now let’s see the results!
Does standardizing the data affect model accuracy?
Well, probably due to a large dataset, not at all. Let’s see the results.


Contrary to our expectation, the training result is unchanged, while the test set saw a minor decrease in accuracy.
These numbers look nice, but it certainly would be nicer to test the model on a real example, wouldn’t be?


I tested the model on two cars, a Ford Fiesta Sport from 2016, and BMW M5 from 2018. And for both cars, the model prediction is correct.
Result visualization
So we ask ourselves, where is the boundary between the data clusters? Well, let’s find out.
There are two ways of creating the plot. One includes, calculating the slope and intercept of the decision boundary line, and plotting it as a simple line plot. The other approach, creates a mesh grid and plots the boundary as a contour that splits the data in two halves.
There are two ways of creating the plot. One includes, calculating the slope and intercept of the decision boundary line, and plotting it as a simple line plot. The other approach, creates a mesh grid and plots the boundary as a contour that splits the data in two halves.
First things first, we start with the simpler approach.
We retrieve the parameters, and then calculate the slope and intercept of the boundary. Basically, we are calculating a linear regression line, which describes the relationship between our X-axis value (displacement) and Y-axis value (combined fuel economy).

Again, the label 1 represents the positive class (Front-wheel driven cars), while label 0 represents all other cars (negative class), in this case Rear-wheel and All-wheel driven cars.
The second way to plot the boundary is by using the contour method. First, a mesh grid is defined, where the minimum and maximum is defined for both features (displacement and combine fuel economy). **Numpy.c concatenates the array along the second axis. And the probabilities of each grid point get calculated. It is important to understand here that we use our model logreg_ here, which purpose is to visualize the slope and _intercept**_.
Next, we need to define the plot.

To display the boundary line, we only display the contour-line intercepting 50% probability of a car being Front-wheel driven. All cars above the line are predicted to Front-wheel driven (positive class 1), while all cars under the line are predicted to NOT BE Front-wheel driven, so they are either Rear-wheel or All-wheel driven.
We can also add probability as a colour to our plot. This can help in visual identification of the probability, so one can evaluate the probability without running the model, or calculating the exact probability.
Here’s the code:

I need to point out, that here we have 2 features, also we have a 2 dimensional plot. More features would add dimensionality to the plot.
Conclusion
Here we present an example how Logistic Regression can be used to predict the drive of a car according to its displacement and combined fuel economy. Also, a data standardization technique is evaluated. For this particular dataset, the standardization has no effect on model accuracy.
The decision boundary between the two labels data clusters (Front-wheel and non front-wheel driven cars) has been calculated and plotted using two techniques. As addition, a contour map plot of the probabilities has been added.
I hope you liked my article. If some of the presented topics need further explanation, feel free to contact me on LinkedIn. 🙂
Also until next time, check out my other articles. Cheers!