The world’s leading publication for data science, AI, and ML professionals.

Build your own custom scikit-learn Regression

Write a regressor and benefit from all the awesome tools of the scikit-learn ecosystem, such as pipelines, grid search, …

Build your own model

Photo by Osman Rana on Unsplash
Photo by Osman Rana on Unsplash

If Python is your programming language of choice for Data Science and Machine Learning, you have probably used the awesome scikit-learn library already. I’m a big fan of this project myself due to its consistent API: You define some object such as a regressor, you fit, you predict, done. It cannot get much easier than this.

While scikit-learn covers you by providing you with a whole range of useful algorithms out-of-the-box, at some point this selection will not be enough for you anymore. You might want to define your own transformers, regressors, and classifiers, and you want it to work together with scikit-learn pipelines and hyper-parameter optimizations such as grid search. You want it all. And if you are at this point in your career, this article is for you.


SPOILER: Get the complete code here.


Own Regressor Examples

Let us now see how to implement our own regressors that work perfectly as a small gear in the large scikit-learn clockwork. We first start with something simple and turn to more sophisticated things later on.

Mean Regression

Let us implement the easiest regressor available, just to get started quickly and without getting lost in theoretical details. I call this the Mean Regressor, which discards all input features (which I call X) and only outputs the mean of the labels (which I call y). That’s the whole theory.

The learning merely consists of computing the mean of y and storing the result inside of the model, the same way the coefficients in a Linear Regression are stored within the model. The most basic scikit-learn-conform implementation can look like this:

import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin

class MeanRegressor(BaseEstimator, RegressorMixin):
    def fit(self, X, y):
        self.mean_ = y.mean()
        return self

    def predict(self, X):
        return np.array(X.shape[0]*[self.mean_])

Done. If you input n samples now, the output will be n times the same number, as it is supposed to be. Just try it out via

from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

l = MeanRegressor()
l.fit(X, y)

followed by a

l.predict(X)

Which outputs 22.53280632 exactly 506 times, the size of the dataset. Do you know what’s cool? You can even use the scoremethod, although you did not even define it anyway! It is done implicitly since you are using the classes BaseEstimator which is the base of all of your own regressors, classifiers, and transformers, as well as RegressorMixin (this one actually adds the score method!).

The score method computed the _r_² score by default, and if you know a bit about it, you won’t be surprised by the following observation:

print(l.score(X, y))

# Output:
# 0.0

Constant Regression

Let us generalize our model slightly. Instead of always computing the mean, we want to add the possibility to add a parameter c during the model instantiation. The model will then output only c. If we don’t provide a parameter, the model should fall back on being the MeanRegressor .

class ConstantRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, c=None):
        self.c = c

    def fit(self, X, y):
        if self.c is None:
            self.const_ = y.mean()
        else:
            self.const_ = self.c
        return self

    def predict(self, X):
        return np.array(X.shape[0]*[self.const_])

We just created a model with a hyperparameter! We can use it via

X, y = load_boston(return_X_y=True)

l = ConstantRegressor(10.)

l.fit(X, y)
l.predict(X)

Again, check that the model really outputs the parameter c that you provide, and also that the score method works. In this case, if c is not None and also not the mean, the _r_² score is negative.

Quick excursion: The _r_² score is just designed that way. A model outputting the mean of the labels for any input feature always has an _r_² score of 0. A perfect model that hits all of the labels exactly has an _r_² score of 1. Usually, you land somewhere in between. And if you have a horrible model, which we just created, you can even get a negative _r_² score.

Anyway, we have only seen basic functionality so far. Let us try a GridSearch now since our model has a hyperparameter.

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    estimator=ConstantRegressor(),
    param_grid={
        'c': np.linspace(0, 50, 100)
    },
)

grid.fit(X, y)

It works! You can check the best c according to the standard 5-fold cross-validation via

grid.best_params_

Perfect!


Now, before we continue with a more interesting model, let’s polish our code to make it truly scikit-learn-conform. I skipped just a bit of boiler-plate code which does some basic checks and input validations but is not needed for our model to run. The full and proper scikit-learn regressor should look like this (I highlighted the added lines)

import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class ConstantRegressor(BaseEstimator, RegressorMixin):
    def __init__(self, c=None):
        self.c = c

    def fit(self, X, y):
        X, y = check_X_y(X, y)

        if self.c is None:
            self.const_ = y.mean()
        else:
            self.const_ = self.c
        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        return np.array(X.shape[0]*[self.const_])

Not as bad as you expected, right? Basically, we have added the following (taken from their respective docstrings):

check_X_y checks X and y for consistent length, enforces X to be 2D and y 1D. By default, X is checked to be non-empty and containing only finite values. Standard input checks are also applied to y, such as checking that y does not have np.nan or np.inf targets. […]

check_array does input validation on an array, list, sparse matrix or similar. By default, the input is checked to be a non-empty 2D array containing only finite values. If the dtype of the array is object, attempt converting to float, raising on failure.

check_is_fitted checks if the estimator is fitted by verifying the presence of fitted attributes (ending with a trailing underscore) and otherwise raises a NotFittedError with the given message. […] (that’s why we called our internal variable self.const_ and not only self.const!)

With this knowledge, we can create a useful regressor now.

Least Absolute Deviation Regression

Something that I still miss in scikit-learn is the Least Absolute Deviation Regression. Simply speaking, it is just a Linear Regression, but instead of minimizing the mean squared error (MSE), we minimize the mean absolute error (MAE).

Here, y is the actual target and ŷ is the model outcome. n is the number of samples.
Here, y is the actual target and ŷ is the model outcome. n is the number of samples.

This is useful because the MSE loss punishes high deviations from the true target, which means that outliers can influence the predictions a lot. The MAE loss, however, does not have this property which makes it a good tool to create more robust linear models.

There are at least two ways to implement it:

  • via Gradient Descent
  • using scipy.optimize (great numerical calculation library)

While I encourage you to compute the gradients of the loss function yourself and implement a simple Gradient Descent in your fit method, we will go for the other option. Our model will essentially be a wrapper around the scipy package that lets us minimize functions. Let’s dive right into the code.

import numpy as np
from scipy.optimize import minimize
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class LADRegression(BaseEstimator, RegressorMixin):
    def fit(self, X, y):
        X, y = check_X_y(X, y)

        d = X.shape[1]
        mae_loss = lambda coefs: np.mean(np.abs(y - X@coefs[:-1] - coefs[-1]))
        *self.coef_, self.intercept_ = minimize(mae_loss, x0=np.array((d+1)*[0.])).x # the heavy lifting

        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        return [email protected]_ + self.intercept_

Let us do a small check if this implementation does the trick. What I usually do is create a small 1- or 2-dimensional dataset that I am able to plot.

Here, we can see that the LAD Regression (red) yields a better fit since outliers do not affect the prediction as much as a normal Linear Regression (green). The Linear Regression gets pulled upwards by the three outliers at the top.
Here, we can see that the LAD Regression (red) yields a better fit since outliers do not affect the prediction as much as a normal Linear Regression (green). The Linear Regression gets pulled upwards by the three outliers at the top.

Looks good! Just as expected.


We have created a regressor that optimizes a different loss function. Feel free to create even more regressors that suit your needs! For example, if you are in a setting where overestimation is 100 times worse than an underestimation, create a loss function that punishes overestimation much more, such as

and plug it in. Do you want to optimize for MAPE? Use

The possibilities are endless.

Outlook

The journey does not stop here. Of course, you can also build classifiers in the same fashion. Just use the ClassifierMixin . Or build a transformer (no, not Optimus Prime) such as PCA yourself using the TransformerMixin . You have all the tools you need, now. Good luck!

Conclusion

In this article, we have seen how to build our own scikit-learn regressors. Whenever you cannot find a particular regressor for your special problem, you can just build your own one and benefit from the rest of scikit-learn, such as pipelines and grid searches. The same goes for classification, transformations, clustering, and more.


I hope that you learned something new, interesting, and useful today. Thanks for reading!

As the last point, if you

  1. want to support me in writing more about machine learning and
  2. plan to get a Medium subscription anyway,

why not do it via this link? This would help me a lot! 😊

To be transparent, the price for you does not change, but about half of the subscription fees go directly to me.

Thanks a lot, if you consider supporting me!

If you have any questions, write me on LinkedIn!


Related Articles