Logistic Regression has traditionally been used as a linear classifier, i.e. when the classes can be separated in the feature space by linear boundaries. That can be remedied however if we happen to have a better idea as to the shape of the decision boundary…

Logistic Regression is known and used as a linear classifier. It is used to come up with a hyperplane in feature space to separate observations that belong to a class from all the other observations that do not belong to that class. The decision boundary is thus linear. Robust and efficient implementations are readily available (e.g. scikit-learn) to use logistic regression as a linear classifier.
While logistic regression makes core assumptions about the observations such as IID (each observation is independent of the others and they all have an identical probability distribution), the use of a linear decision boundary is not one of them. The linear decision boundary is used for reasons of simplicity following the Zen mantra – when in doubt simplify. In those cases where we suspect the decision boundary to be nonlinear, it may make sense to formulate logistic regression with a nonlinear model and evaluate how much better we can do. That is what this post is about. Here is the outline. We go through some code snippets here but the full code for reproducing the results can be downloaded from github.
- Briefly review the formulation of the likelihood function and its maximization. To keep the algebra at bay we stick to 2 classes, in a 2-d feature space. A point [x, y] in the feature space can belong to only one of the classes, and the function f (x,y; c) = 0 defines the decision boundary as shown in Figure 1 below.

- Consider a decision boundary that can be expressed as a polynomial in feature variables but linear in the weights/coefficients. This case lends itself to be modeled within the (linear) framework using the API from scikit-learn.
- Consider a generic nonlinear decision boundary that cannot be expressed as a polynomial. Here we are on our own to find the model parameters that maximize the likelihood function. There is excellent API in the scipy.optimize module that helps us out here.
1. Logistic Regression
Logistic regression is an exercise in predicting (regressing to – one can say) discrete outcomes from a continuous and/or categorical set of observations. Each observation is independent and the probability p that an observation belongs to the class is some ( & same!) function of the features describing that observation. Consider a set of n observations [_x_i, y_i; Zi] where _x_i, yi are the feature values for the ith observation. _Zi equals 1 if the ith observation belongs to the class, and equals 0 otherwise. The likelihood of having obtained n such observations is simply the product of the probability _p(x_i, yi) of obtaining each one of them separately.

The ratio p/(1-p) (known as the odds ratio) would be unity along the decision boundary as p = 1-p = 0.5. If we define a function f(x,y; c) as:

then f(x,y; c) =0 would be the decision boundary. c are the m parameters in the function f. And log-likelihood in terms of f would be:

All that is left to be done now is to find a set of values for the parameters c that maximize log(L) in Equation 3. We can either apply an optimization technique or solve the coupled set of m nonlinear equations d log(L)/dc = 0 for c.

Either way, having c in hand completes the definition of f and allows us find out the class of any point in the feature space. That is logistic regression in a nut shell.
2. The function f(x,y; c)
Would any function for f work in Equation 2? Most certainly not. As p goes from 0 to 1, log(p/(1-p)) goes from -inf to +inf. So we need f to be unbounded as well in both directions. The possibilities for f are endless so long as we make sure that f has a range from -inf to +inf.
2.1 A linear form for f(x,y; c)
Choosing a linear form such as

will work for sure and that leads to traditional logistic regression as available for use in scikit-learn and the reason logistic regression is known as a linear classifier.
2.2 A higher-order polynomial for f(x,y; c)
An easy extension Equation 5 would be to use a higher degree polynomial. A 2nd order one would simply be:

Note that the above formulation is identical to the linear case, if we treat x², xy, and y² as three additional independent features of the problem. The impetus for doing so would be that we can then directly apply the API from scikit-learn to get an estimate for c. We see however in the results section that the c obtained this way is not as good as directly solving Equation 4 for c. It is a bit of a mystery as to why. But in any case solving Equation 4 is easy enough with modules from scipy. The derivatives we need for Equation 4 fall out simply as

2.3 Other generic forms for f(x,y; C)
A periodic form like _f(x,y; c) = sin(c_0x + c2y) = 0 will not work as its range is limited. But the following will.

We can again evaluate the derivatives and solve for the roots of Equation 4 but sometimes it is simpler to just directly maximize log(L) in Equation 3, and that is what we will do in the simulations to follow.
3. Simulations
We generate equal number of points in the x, y feature space that belong to a class (Z = 0 when f(x,y; c) > small_value) and those that do not belong (Z = 1 when f(x, y; c) < small_value) as we know the explicit functional form of f(x,y; c).
The data is split 80/20 for train vs test. We use the API from scikit-learn for the linear case that we want to compare with.
We use the scipy.optimize module when choosing to solve the Equations in 4 for c or for maximizing the likelihood in Equation 3. LL and dLLbydc in the code snippet below are simply implementing Equations 3 and 4 respectively. The scipy routines are for minimization so we negate the sign in each case as we want to maximize the likelihood.
Finally we solve for c starting with a small initial guess close to 0.
4. Results
We look at the polynomial and the other generic case separately.
4.1 f(x,y; c) = c_0 + c_1 x + c_2 y + c_3 x² + c_4 x y + c_5 y²
We get an ellipse as the decision boundary for the following c values

We generate 1000 data points for each class based on the above function and apply logistic regression to that data to see what it predicts for the decision boundary.
pipenv run python ./polynomial.py 1000 2.25 -3.0 -18.0 35.0 -64.0 50.0
Figure 2 below shows the data distribution and the decision boundaries obtained by the different approaches. The contour traced by green triangles separates the data perfectly, i.e. gets an F1 score of 1. It is obtained by solving Equation 4 for c. The red line is obtained by the traditional logistic regression approach – clearly a straight line. It tries to do the best it can given the nonlinearities and gets an F1 score of well 0.5… The contour traced by the blue triangles is interesting. It is obtained by applying traditional logistic regression on the augmented feature space. That is "x²", "x*y", and "y²" are treated as three additional features thereby linearizing f for use with scikit-learn. It gets an F1 score of 0.89 which is not bad. But it is not entirely clear why it should be worse than solving dlog(LL)/dc = 0.

The actual values for the coefficients are shown in Table 1 below. c_0 is scaled to unity so we can compare. Clearly the coefficients obtained solving dlog(LL)/dc = 0 are closest to the actual values used in generating the data.

4.2 f(x,y; c) = c_0 + c_1 sin(c_2 x²) + c_3 y = 0
Figure 3 shows the data distribution and the predicted boundaries upon running the simulation with
pipenv run python ./generic.py 1000 1.0 -1.0 1.0 -1.0
The straight line from traditional logistic regression obtains an F1 score of 0.675 whereas the direct minimization of log(LL) gets a perfect score.

Table 2 below compares the values for the coefficients obtained in the simulation. It is interesting to note that the signs of c_1 and c_2 are opposite between the actual and those predicted by minimization. That is fine because sin(-k) = -sin(k).

5. Conclusions
Logistic regression has traditionally been used to come up with a hyperplane that separates the feature space into classes. But if we suspect that the decision boundary is nonlinear we may get better results by attempting some nonlinear functional forms for the logit function. Solving for the model parameters can be more challenging but the optimization modules in scipy can help.
Originally published at xplordat.com on March 13, 2019.