The world’s leading publication for data science, AI, and ML professionals.

Polynomial Regression – An Alternative For Neural Networks?

A discussion of polynomials and neural networks, which theoretically are both able to approximate continuous functions infinitely closely.

Example of polynomial regression performed on a data set, using three different degrees of polynomials. Theoretically, polynomials can achieve infinitely close fits. [image from WikiMedia by Shobha]
Example of polynomial regression performed on a data set, using three different degrees of polynomials. Theoretically, polynomials can achieve infinitely close fits. [image from WikiMedia by Shobha]

To start: this is more a discussion piece than anything else. Although I will attempt to compare polynomials and neural networks, there is not a great deal of mathematical rigor or literature support to back up this article. It is not a tutorial either: no Python implementation or numerical experiments (at least not by myself) this time. Finally, I’m also not trying to convince you to ditch neural networks and embrace polynomial regression.

With that out of the way, let’s say where I do want to go with this article. Pretty much everyone learned to design polynomials at some point in life, and their ability to fit about any function has been proven long ago. Adding those two should make for interesting use cases, yet attention for polynomial regression in Data Science is rather underwhelming. If this article succeeds in provoking some thought and reinstating a bit of attention for the age-old art of polynomial design, that would be more than enough.

Neural networks

Let’s start with neural networks and why we like them. From a mathematical perspective, the great appeal of neural networks is that they can approximate any continuous function infinitely closely. Even a simple, single-layered network can – in principle – accomplish this task. Yes, there are strict theoretical requirements and you may never achieve that perfect fit in practice, but it is a reassuring property nonetheless.

[To provide some background on Universal Approximation Theorems: George Cybenko (1989) proved that a single-layer network with sigmoid activation functions could approximate any continuous function infinitely closely. Hornik et al. (1989) showed that the result also held for other activation functions in multi-layer networks. Ever since, theorems have been extended into other directions, e.g., arbitrary layer widths and depths, various activation functions, non-continuous functions, etc. Again, these results do not provide any practical guarantees, as the conditions may be unrealistic.]

Example of a neural network. The multiple layers often capture specific features of the data set. [Image by OpenClipArt on FreeSVG]
Example of a neural network. The multiple layers often capture specific features of the data set. [Image by OpenClipArt on FreeSVG]

I know this was not the most elaborate description of neural networks, but we have plenty more materials to cover. Let’s move on.

Polynomials

Another class of functions capable of fitting pretty much any data set seems nearly forgotten – in the data science community, at least – but has similar properties as neural networks do (Stone-Weierstrass theorem). As you may have surmised from the title, that is the class of polynomials.

Probably, you’ve had some polynomials in high school. Then, depending on your direction afterwards, you may have studied them a lot or not at all. At any rate, they do not seem overly popular in data science nowadays.

To provide a brief refresher, a polynomial is defined as:

"an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and non-negative integer exponentiation of variables." – Wikipedia

An example of a polynomial could be something like:

f(x) = (x − 3)(x − 2)(x − 1)(x)(x + 1)(x + 2)(x + 3)

The corresponding plot looks as follows:

Example of a 7-degree polynomial. A high-degree polynomial can fit very complex patterns. [Image from Wikimedia by Melikamp]
Example of a 7-degree polynomial. A high-degree polynomial can fit very complex patterns. [Image from Wikimedia by Melikamp]

Observe that the polynomial may be written out by multiplying the terms. Polynomials can always be rewritten in canonical form:

Note that, although the Features x^k, k>1 are non-linear, as a statistical estimation problem it is linear (weighted linear combination of features) and can be solved with basic linear regression techniques. This is a powerful property that makes life much easier.

In data science, we typically deal with multiple features (i.e., more than one explanatory variable, say x_1, x_2, x_3), translating to a multi-variate polynomial. For instance, a term like this might represent a feature:

x_1x_2x_3²

Or if you prefer something more descriptive:

x_years_experience * x_salary_last_year * x_num_past_jobs²

And that is just one out of many possible features. As you can imagine, polynomials get quickly get rather messy and complex. Two aspects determine the complexity of a polynomial:

  • The number of variables. The expression2x_1²x_2⁴ - 3x_1x_2² + x_1 + 8 contains two explanatory variables, x_1 and x_2. A polynomial with 2 variables might be visualized in a 3D plot, at higher dimensions our human brains tend to fall short.
  • The degree of the polynomial. This is the highest power in the standard form expression (summation of exponents when multiplying variables). For instance, in the 2x_1²x_2⁴ - 3x_1x_2² + x_1 + 8 example, the degree would be 6 (i.e., 2+4).

At this point you may see a potential problem arising. A polynomial of degree 5 with one variable (e.g., w_5x⁵ + w_4x⁴ + w_3x³ + w_2x² + w_1x + w_0) seems doable, as does a degree 1 polynomial with five variables (e.g., w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 + w_0).

However, if we increase both the degree and the number of variables, the number of features may explode. When we move to contraptions such as x_1⁴x_2x_3³x_4², we realize there are a lot of high-degree multi-variable features we might design to capture complex interactions. Up front, we don’t know which features are relevant – no matter how much domain knowledge we have. Mentally we can conceptualize basic interactions and quadratic effects, but not much beyond that.

In summary, we have two problems to deal with: (i) which features to include and (ii) up to which degree we should go to capture relevant effects. Before moving towards these challenges, let’s circle back to Neural Networks for a moment.

The link between polynomials and neural networks

We rarely bother to explicitly write out neural networks in mathematical form, but we could. In essence, each hidden node takes as input a weighted combination of the preceding layer, transforms it according to some activation function (e.g., a sigmoid or ReLU), and passes on the output to the next layer.

Just as a higher-degree Polynomial takes some initial explanatory variables x_1,x_2,x_3 and embeds them into complex features like x_1²x_2x_3⁴, a neural network also transforms input and implicitly extracts features. You see where this is going. According to researchers at the University of California and Stanford (Cheng et al., 2021), a neural network is actually a special class of polynomials. Each layer adds to the complexity of the polynomial.

"If the activation function is any polynomial, or is implemented by one, an NN exactly performs polynomial regression.

Moreover, the degree of the polynomial will increase from layer to layer." – Cheng et al., 2019

This is a captivating result. It turns out we are not really discussing opposing solution methods. If anything, a neural network adds a specific angle and conceptualization to the overarching solution class of Polynomial Regression.

To avoid overcomplicating matters: in the remainder I will continue using the term polynomials for the typical polynomial regression and the term neural networks for, well, neural networks.

[Note on the paper of Cheng et al.: Only plain artificial neural networks are discussed; special flavors like recurrent, generic, convolutional etc. are not included. It is also worth pointing out that the paper provides no rigorous mathematical proof. At last, the manuscript is not peer-reviewed or published as of yet.]

Polynomial regression – Feature selection and data fitting

Back to the polynomials. We now know what they look like, but how do we use them in a data science context? To address this question, let’s assume we use a given data set to construct a model that both fits the training data and explains unseen data well.

Although there are tons of ways to combine variables into high-degree terms, creating features is not overly complicated these days. Although I won’t go into much depth here, something like sklearn.PolynomialFeatures can generate all features up to a certain degree. For instance, with two variables and degree 2 we would generate [1, x_1, x_2, x_1², x_1x_2, x_2²].

Fitting a polynomial is also not challenging at its essence. For instance, the numpy.Polynomial functionality contains a number of common polynomials (Chebyshev, Hermitian, Legendre, etc.) and fitting is accomplished within a single line of code:

output = Polynomial.fit(x_data, y_data, deg=3)

The true challenge is in finding the right subset of features. Perhaps x_1⁷x_2³x_3² proves a profound predictor, or maybe x_1⁵x_2⁴x_3² is. Even for more reasonable degrees, there are often so many possibilities that an exhaustive search over all models (i.e., combinations of features) is infeasible.

Some techniques may be helpful in dealing with large sets of features, e.g.,:

  • Principle Component Analysis (PCA) to reduce the dimensionality of the feature set prior to designing polynomials;
  • Chi-square test to test the significance of features in predicting the outcome;
  • Decision tree (or random forest) to select successful features;
  • Elastic net regression (combining LASSO and RIDGE) to weed out features with low significance.

I admit: not necessarily easy. However, keep in mind that serious efforts in neural network training also require feature selection, architecture selection and hyperparameter tuning. Neither solution method will instantly yield a perfect model.

Another challenge lies in overfitting. A 20-degree polynomial likely captures a lot of peculiar effects, outliers and all. We don’t necessarily want to have a perfect fit to the training data though; we want the model to work well on data not seen before. The aforementioned elastic net helps, as does avoiding high-degree polynomials. Unlike neural networks – which are also prone to overfitting – for polynomial regression we can calculate the variance inflation factor – as we are still performing ordinary least squares (OLS) regression – warning us when we are about to introduce multicollinearity into our model.

In summary: we have all the necessary tools to fit a complicated polynomial to a complicated dataset. That’s not to say polynomials are the magic bullet of data science. It doesn’t say fitting an appropriate polynomial will be a breeze. It definitely doesn’t say we will find that perfect fit attainable in theory. It’s not an incredibly bleak outlook, either.

(Dis)advantages of polynomials and neural networks

One of the greatest benefits of neural networks is that they effectively capture hierarchical structures. [Photo by Eugene Tkachenko on Unsplash]
One of the greatest benefits of neural networks is that they effectively capture hierarchical structures. [Photo by Eugene Tkachenko on Unsplash]

The origins of polynomials date back centuries; by now, they are very well studied and understood. Obviously, they do have some drawbacks – if not, nobody would have bothered studying new techniques in the first place. Neural networks have earned a rightful place in the field.

Perhaps the greatest appeal of neural networks lies in their ability to capture hierarchical effects. This is one of the reasons they work so well on computer vision, with specific layers dedicated to large shapes, small shapes, colors, etc. Due to the hierarchical architecture, they are often able to generalize well, introducing the inductive bias required to handle new data.

Also, neural networks are convenient to work with in many ways. Yes, you have to play around with the number of layers and nodes, learning rates, and additional hyperparameters when networks get more advanced. Finding the appropriate architecture takes time and effort.

In terms of feature design, however, a lot of work is taken out of your hands. Basically, you feed a number of explanatory variables, and the network figures out the relevant interactions and higher-order effects by itself. With all the libraries and sophisticated gradient descent algorithms available, tailoring a neural network is quite doable.

In contrast, for polynomial regression the training itself is straightforward (basic regression). The challenge is in selecting appropriate features and to combine them into a sensible model. A small aid here is the aforementioned multicollinearity warning, which makes it comparatively easier to avoid overfitting than it is for neural networks. Outlier detection is a bit of a problem. In linear regression outliers are easily discerned; here, they might be conceived as part of some odd non-linear pattern.

With the aid of libraries, both solution methods are pretty manageable nowadays. You don’t need to be a mathematician to fit a polynomial, and you don’t need to be a machine learning engineer to train a neural network.

Some results (from Cheng et al.)

Ultimately, the relevant question is ‘which one works better?’. Hopefully, you did not expect to find a conclusive answer here.

Still, it is noteworthy that Cheng et al. (2019) compared the performance of both on multiple data sets (both classification and regression tasks). Interestingly, they found that polynomial regression performed equally well as or even better than neural networks. In addition, their experiments showed that polynomials rarely need to be higher than second- or third degree. As such, they are also fairly explainable.

"For this and other reasons, e.g. large fitted values at the edges of the data, many authors recommend not using polynomial models of degree higher than 2 or 3, and in fact in our empirical experiments in this paper, we have seldom found it necessary to use a higher degree." – Cheng et al., 2019

Intuitively this makes sense. Interaction effects tend to drastically weaken for higher orders. Multiplying two or three variables yields relevant insights, as does capturing linear and quadratic effects. But do we really encounter meaningful 20th power effects or interactions between a dozen variables all that often in real life? If anything, the results imply that neural networks are often needlessly elaborate for the task at hand, leading to the notorious overfitting effect.

"Most importantly, we have shown that PR should be an effective alternative to NNs. We performed experiments with a variety of data of various types, and in all cases, PR performed either similarly to, or substantially better than, NNs – without the NN troubles of trying to find good combinations of tuning parameters.

The fact that in some cases PR actually outperformed NNs reflects the fact that NNs are often overparameterized, in essence fitting a higher-degree polynomial than they should." – Cheng et al., 2019

Despite testing a variety of datasets, it would be a stretch to claim that polynomial regression outperforms neural network fitting. Surely – especially with sufficient finetuning – it is easy enough to come up with counterexamples in which neural networks outperform polynomial regression. I also doubt whether polynomials would be equally successful in completing a game of Super Mario or recognizing faces (i.e., the more specialized neural networks).

For typical data science tasks – estimating salaries, predicting travel times, classifying songs by genre – it definitely appears that polynomials are a more than reasonable alternative to neural networks. Linear regression often falls short due to oversimplification of reality, yet deep neural networks might look for highly complex patterns that are simply not there.

Closing words

As mentioned at the start of the article, this is more of a discussion piece than anything else. It is not intended to be complete or rigorous, nor does it advocate for polynomial regression to displace neural networks.

Nonetheless, it is definitely interesting to see how competitive polynomial regression can be with neural networks. The advances made on the latter are intriguing, yet most data science tasks are not on the level of the average DeepMind challenge.

Although deep learning strongly appeals to many people, it is questionable whether it is always the most appropriate approach for the job. Polynomials offer very rich representations of reality; even low degrees and interaction effects capture the bulk of real-world effects. In that form, they are relatively compact and comprehensive, avoiding the ‘black box’ effect that neural networks tend to suffer from.

Modern toolkits take away many practical problems experienced with polynomial regression in the past. Theoretically, polynomials can achieve the same ‘perfect’ fit as neural networks, while requiring little more knowledge than basic linear regression.

All in all, perhaps it is time to give polynomials a second chance.


References

Cheng, X., Khomtchouk, B., Matloff, N., & Mohanty, P. (2019). Polynomial Regression As an Alternative to Neural Nets. ArXiv preprint. https://arxiv.org/abs/1806.06850v2

George Cybenko, G.(1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.

Stone, M. H. (1948). The generalized Weierstrass approximation theorem. Mathematics Magazine, 21(5), 237–254.

Weierstrass, K. (1885). Über die analytische Darstellbarkeit sogenannter willkürlicher Functionen einer reellen Veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften zu Berlin, 2, 633–639.

Polynomial – Wikipedia

Polynomial regression – Wikipedia

Variance inflation factor – Wikipedia

Universal approximation theorem – Wikipedia

sklearn.preprocessing.PolynomialFeatures

Convenience Classes – NumPy v1.22 Manual


Related Articles