This is the fourth post in my scikit-learn tutorial series. If you didn’t catch them, I strongly recommend my first three posts – it’ll be way easier to follow along:
This 4th module introduces the concept of linear models, using the infamous Linear Regression and logistic regression models as working examples.
In addition to these basic linear models, we show how to use feature engineering to handle nonlinear problems using only linear models, as well as the concept of regularization in order to prevent overfitting.
Altogether, these concepts enable us to create very simple yet powerful models, capable of handling a lot of ML problems with fine-tuned hyperparameters, without overfitting, while handling nonlinear problems.

All graphs and images are made by the author.
Linear models
Linear models are models that "fit" or "learn" by setting coefficients such that they eventually only rely on a linear combination of the input features. In other words, if the input data is made of N features f_1 to f_N, the model at some point is based on the linear combination:

The coefficients the model learns are the N+1 coefficients beta. The coefficient beta_0 represent an offset, a constant value in the output whatever the values in the input. The idea behind such models is that the "truth" can be approximated with a linear relationship between the inputs and the output.
In the case of regression problems where we want to predict a numerical value from the inputs, one of the simplest and well known linear model is the linear regression. You most likely have done hundreds of linear regression already (by hand, in excel or python).
In the case of classification problem, where we want to predit a category from the inputs, the simplest and well known linear model is the Logistic Regression (don’t get fooled by the "regression" in "logistic regression", it really deals with classification).
Many other linear models exists, for example Support Vector Regression and Support Vector Classification, as well as lots of variants of linear regression and logistic regression. They could all be the subject of a series of posts. The idea here is not to review them in depth, but to show their basic usage and limitations (although my completeness syndrom will make me detail them a bit).
Important note: one of the committed stances of sklearn is to provide models that work out of the box, so newcomers can have running code quickly (and not spend too much time setting up framework gimmicks or have many errors to deal with when setting up new models)—I’d still recommend reading the documentation extensively as it is well written and you’ll learn a lot both about the python API as well as the mathematics and good practices.
Linear model for regression: linear regression
Linear regression is the most well known linear regression model. As stated above, the idea is to approximate as best as possible the output y from a linear combination of the inputs f_i:

One of the reasons of the popularity of linear regression and linear models in general is that they can be handle with matrices, since matrices are representations of linear operations. In particular, one the possible ways (and the most used in science in general, not particularly in ML) to learn the coefficients beta is to use the Ordinary Least Squares method.
The Ordinary Least Squares method consists in selecting the one and best vector of beta such that the sum of squared errors is minimal: this method has the advantage to be easily interpretable (the model "minimizes" the squared distance between the data and its predictions) and it has a closed-form solution (so no numerical optimization method is needed, it basically is just matrix multiplication and inversion). If you check the documention of sklearn LinearRegression, you’ll see that its this method that is implemented.
Many other methods exists in order to fit a linear regression, which all lead to "variants" of that simplest model. As we’ll see below, ridge regression and lasso regression are part of such variants.
Note that linear regression allows to do polynomial regression if a small preprocessing step is applied. Indeed a polynomial is just like a linear combination of monomial of inputs:

so a polynomial can be written

in the case of an univariate (a single input feature). To do so we simply generate a new input matrix made of all the polynomial variables we want (the X¹, X², etc for univariate problem, and even cross-variables in the case of multivariate polynomial like X_1 X_2, X_1 X_3, and so on).
Here is a simple example of a linear regression for a 1d input feature, so the model actually learn only beta0 and beta1:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X = 2 * np.random.rand(100, 1)
y_true = 4 + 3 * X
y = y_true + 0.5 * np.random.randn(100, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
fig, ax = plt.subplots()
ax.scatter(X_train, y_train, alpha=0.5, label='Training Set', color='blue')
ax.scatter(X_test, y_test, alpha=0.5,label='Test Set', color='green')
ax.plot(X, y_true, label='True Underlying Model', color='red', linestyle='--')
ax.plot(X_test, y_pred, label='Linear Regression Model', color='orange')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title(f'Linear Regression on a 1D feature with test score R^2=n {model.score(X_test, y_test):.2f}')
ax.legend()

To give you a bit more information about the linear regression model in sklearn:
- you can tune the linear regression to handle the offset beta0 or not by setting the
fit_intercept
hyperparameterLinearRegression(fit_intercept=True)
. If using False, the model expect target y to be centered, that is to have mean 0. - once the model has been fitted, it learned the coefficients betas. You can inspect them using
model.intercept_
andmodel.coefficients_
. Remember that in the sklearn API, the learned parameters are suffixed with an underscore "_" - the default score for linear regression is the R² coefficients which translates how well the fitted model "explains" the variability of the dataset. Of course, you can also import any score function from the metrics module and compute other scores using for example
from sklearn.metric import mean_absolute_error; mean_absolute_error(y_true, y_pred)
Linear model for classification: logistic regression
The equivalent of linear regression for classification problem is logistic regression.
The idea is pretty simple: create a linear combination y that when feeded of the logistic function best separates the target classes. Like the linear regression, the linear combination y can have any value – but to meet our context of classification it is feeded to the logistic function which is an S-shaped function that takes any real input and maps it to the [0, 1] interval. This interval is then associated with the target classes, where 0 corresponds to a class and +1 to the other.
In other words, if a sample is mapped to a very negative value from the linear combination, it’ll be heavily associated with class 0. As the y value increases and approaches 0.5, the target class becomes "uncertain". Then if the linear combination y keeps increasing above 0.5, it will be mapped to class +1. In this case, we say that 0.5 is the classifcation threshold. Note that some other similar algorithms use rather the [-1,1] mapping interval with a threshold value of 0. Those are basicaly jsut conventions and won’t change the model performance.
So we could write the model as such:

where x represents a sample vector of length N with features f_1 to f_N, and y is the linear combination of that sample with the model’s coefficients which can have any value, and that value is mapped to the [0–1] interval thanks to the logistic function.
To express it again differently, the probability of a sample to belong to a certain class is linked to its corresponding linear combination value y. The final corresponding class is simply the closest -or most probable- class, based on its position relative to the threshold.
In sklearn terms, the probability is computed with the .predict_proba
which returns an array of floats that sum to 1 to represent probabilities to belong to a class. On the other hand, the .predict
returns a class, which corresponds to the most probable class of the .predict_proba
.
Let’s see a simple 1D example: again, the linear model only has a single input to work on, so the X-axis can be used to plot either the feature value, or the y linear combination (beta1X+beta0):
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=1, n_informative=1, n_redundant=0, n_clusters_per_class=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
x_ = np.linspace(-2, 3).reshape(-1,1)
fig, ax = plt.subplots()
ax.scatter(X_train[y_train == 0], y_train[y_train == 0], label='Class 0 (Training)', color='blue')
ax.scatter(X_train[y_train == 1], y_train[y_train == 1], label='Class 1 (Training)', color='red')
ax.scatter(X_test[y_test == 0], y_test[y_test == 0], label='Class 0 (Test)', marker='s', color='blue', alpha=0.5)
ax.scatter(X_test[y_test == 1], y_test[y_test == 1], label='Class 1 (Test)', marker='s', color='red', alpha=0.5)
ax.plot(x_, model.predict_proba(x_)[:, 1], label='Logistic Regression Model', color='green')
ax.axhline(0.5, color='gray', linestyle='--', label='Decision Boundary (0.5)')
ax.set_xlabel('X')
ax.set_ylabel('Probability')
ax.set_title(f'Logistic Regression Example with score={model.score(X, y):.2f}')
ax.legend()
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

Here the green line corresponds to the linear combination of the input features. It corresponds to the logistic value of the linear combination for a given sample with value x. By tunning the linear coefficients, this green lines shape and position moves to better match the train samples. It is then used to predict the class and probability of new samples.
To better understand, let’s see a 2d example, which is kinda more suited for visual linear classification example:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.inspection import DecisionBoundaryDisplay
X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(C=10000000, max_iter=100000)
model.fit(X_train, y_train)
db = DecisionBoundaryDisplay.from_estimator(
model,
X,
response_method="predict_proba", # "predict_proba",
cmap="RdBu_r",
alpha=0.5, grid_resolution=200,
)
sns.scatterplot(x=X_train[:, 0], y=X_train[:, 1], hue=y_train, palette={0:"blue", 1:"red"}, alpha=0.5, ax=db.ax_)
sns.scatterplot(x=X_test[:, 0], y=X_test[:, 1], hue=y_test, palette={0:"blue", 1:"red"}, ax=db.ax_)
db.ax_.set_title(f"Decision boundary of the trainedn LogisticRegression with score={model.score(X, y):.2f}')

This example show how a 2d input feature is splitted by the model by a 1D line. This "line" represents the logistic function, so that above the threshold the samples belong to one class, and on the other side they belong to the other. The important idea here is to expand the reasoning from the previous example to higher dimension.
Like before, here are additionnal info about the logistic regression model in sklearn:
- LogisticRegression takes more hyperparameters, including
fit_intercept
like linear regression, but also additional parameters that allow to tune Regularization – we’ll see about those further below - also like linear regression, the learned coefficients can be accessed with
model.coef_
andmodel.intercept_
. Additionnaly, you can get a list of the encountered classes withmodel.classes_
. - the default score is the accuracy, which is simply a percentage of good classification: 0 corresponds to no good prediction and 1 corresponds to all good predictions.
Again, there are a lot of things to say about linear models for classication, but the point here is just to provide simple example. To learn more about LogisticRegression, I strongly encourage you to go check the user guide of Sklearn.
Handling non-linear data
So far we have seen examples of linear regression and logistic for synthetic data that were indeed linera. In other words, the truth we try to approximate with a linear model was indeed linear. But that almost never happens with real data, where the systems we try to model and reproduce are pretty non-linear. So does that means that the linear models fall short ? Actually no, there is a workaround.
In addition to using natively non-linear models (models that handle non-linear data by design), we can still use our linear models by creating new features in the input data that hold some non-linearity. In other words, we are going to use the same models, but with "bigger" input data matrix, where we add "ourselves" new features that contains non-linear relations between the input features.
A good simple example is a polynomial regression as introduced above. Say we want to fit a target y that is non-linear with respect to a single feature x. With standard linear model, this polynomial regression is simply a beta1 x + beta0 regression. If instead we create new features, say x² and x³, the input matrix has now 3 features, and the linear regression can use the relation y=beta3 x³ + beta2 x² + beta_1x + beta0 to fit the target. See the following example :
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
X = 2 * np.random.rand(100, 1)
y = X**3 + 3 * X**2 + 0.5 * X + 2 + np.random.randn(100, 1)
linreg = LinearRegression()
linreg.fit(X, y)
degree = 3
poly_linreg = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_linreg.fit(X, y)
x_ = np.linspace(0, 2, 100).reshape(-1, 1)
fig, ax = plt.subplots()
ax.scatter(X, y, label='Original Data')
ax.plot(x_, linreg.predict(x_), color='blue', label=f'Linear Regression (linreg) score={linreg.score(X, y):.2f}')
ax.plot(x_, poly_linreg.predict(x_), color='red', label=f'Polynomial Regression (poly_linreg, Degree {degree}) score={poly_linreg.score(X, y):.2f}')
ax.set_xlabel('X')
ax.set_ylabel('y')
ax.set_title('Fitting Linear and Polynomial Curves to Data')
ax.legend()

In particular, let’s inspect the coefficients for both the linear regression and the polynomial regression:
print(linreg.coef_, linreg.intercept_)
print(poly_linreg[-1].coef_, poly_linreg[-1].intercept_)
# [[10.607613]] [-2.32139028]
# [[-0.83958618 5.07382762 0.30530322]] [1.9408571]
For the simple linear regression we only get the beta1 value and beta0 intercept, but for the polynomial regressor, we get 3 coefficients corresponding to beta3, beta2 and beta1, as well as the beta0 intercept.
Using PolynomialFeatures is just one of many possibilities to create new feature that contain non-linearity. Other options are using KBinsDiscretizer (especially with encode='onehot
), the SplineTransformer or kernel approaches with Nystroem or the kernel trick implemented in some models (like Support Vector Machines models with SVR for regression and SVC for classification).
The approach is always the same: create new features that are non-linear and add them to the input data so the linear models can use them to fit complex y target. And the good news is that in sklearn all the approaches are implemented either as preprocessing steps in a pipeline, or are built-in the estimator models.
Regularization
So far we have seen how to use basic linear model, both on linear problems and non-linear problems by adding new non-linear features.
Regularization consists in changing or tweaking the way models learn, usually in changing the objective/cost function, in order to keep their complexity not to high.
Mathematically, it is often implemented by adding a term in the cost function of the problem. For example one of the simplest example of regularization is that of linear regression, in which case it is called a "Ridge regression". The classic linear regression cost function is given by the mean (or sum) of the squared errors:

With regularization, the cost function includes an additional term, that consists in the L2 norm of the vector beta:

The norm of the coefficient vector is weighted by the alpha hyperparameter, so that we can modify by how much its norm should impute of the final solution. This way, in the optimization/learning process, the coefficients of beta won’t go arbitrary large and instead a good balance between its norm and the errors will be found.
This concept can be applied to pretty much any other models, including logistic regression of course.
But let’s go a bit further regarding regularization: just like we saw how it is important to tune hyperparameters in a pipeline/model, the alpha parameter of the ridge regression should be optimized (and this applies to any regularization parameter).
To do so we can a GridSearchCV or RandomSearchCV as seen in the previous module, but since it is so common to optimize the alpha parameter of a Ridge regression, sklearn provide a RidgeCV model that take a list of alpha values to test and select using cross-validation.
So let’s sum up the 4 approaches to handle regularization for a linear regression:
- No regularization, using LinearRegression()
- Standard, non-optimized ridge regression, using Ridge(), equivalent to alpha=1
- Optimized ridge regression using GridSearch or RandomSearch
- Ridge regression with builtin optimization with RidgeCV
Let’s see visually how regularization influences the coefficients of linear regression, hence using a ridge model. In the following example, we use a 5 degree polynomial feature expansion to do a linear regression, with a regularization term through the Ridge model and its alpha hyperparameter to control the strengh of regularization.
With alpha=0, there is no regularization, we get the classic linear regression results. Since we use up to degree 5 features to regress a noisy linear relation, the model tends to overfit a bit, and the linear coefficients span a great interval (here between -1500 to +1500).
With alpha=1, we get "mild" regularization. The model overfits way less and the coefficients amplitudes are quite smaller.
With alpha=100, we get very hard regularization, so the coefficients are not allowed to grow a lot and the model tends to underfit.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
X = 2 * np.random.rand(50, 1)
y = 4 + 3 * X + np.random.randn(50, 1)
# Function to fit and plot Ridge Regression models
def plot_ridge(alpha, ax1, ax2, deg=8):
model = make_pipeline(PolynomialFeatures(deg, include_bias=False), Ridge(alpha=alpha))
model.fit(X, y)
ax1.scatter(X, y, color='blue', s=10, label='Data')
x_range = np.linspace(0, 2, 100).reshape(-1, 1)
y_pred = model.predict(x_range)
ax1.plot(x_range, y_pred, color='red', label=f'Ridge Regression (alpha={alpha})')
coefs = model.named_steps['ridge'].coef_.ravel()
ax2.plot(range(deg), coefs, color='green', marker='o', label='Coefficients')
ax1.set_title(f'Ridge Regression with alpha={alpha} / R^2={model.score(X,y):.2f}')
ax2.set_title(f"Linear coefficients with alpha={alpha}")
ax1.legend()
fig, axs = plt.subplots(2, 3, figsize=(18, 6))
plot_ridge(0, axs[0,0], axs[1,0])
plot_ridge(1, axs[0,1], axs[1,1])
plot_ridge(100, axs[0,2], axs[1,2])
fig.tight_layout()

Let’s go a bit further and inspect how the train score and test score evolve with the value of alpha:
from sklearn.model_selection import ValidationCurveDisplay
# Plotting the validation curve
ValidationCurveDisplay.from_estimator(
make_pipeline(PolynomialFeatures(10, include_bias=False), Ridge()),
X, y,
param_name='ridge__alpha',
param_range=np.logspace(-3, 3),
)

For a degree 10 polynomial, the optimum regularization coefficient alpha seems to be between 0.01 and 1.
Finally, remember that just like creating new feature to handle non-linearity can be applied to pretty much any model, regularization also can be included in most models (including logistic regression with its C parameter).
Wrapup
In this 4th post, we saw:
- the most important linear models namely linear regression and logistic regression
- how we can handle non-linear problems while using such linear models by creating new features. This can be done for example by creating polynomial feature
- how regularization can help control the complexity of models by adding a regularization term in the objective function, such that the linear coefficients cannot go arbitrarily large
You might like some of my other posts, make sure to check them out: