The world’s leading publication for data science, AI, and ML professionals.

Selecting the Correct Predictive Modeling Technique

Using statistics, probability, and data mining to predict future outcomes.

What is Predictive Modeling?

Predictive modeling is the process of taking known results and developing a model that can predict values for new occurrences. It uses historical data to predict future events. There are many different types of predictive modeling techniques including ANOVA, linear Regression (ordinary least squares), logistic regression, ridge regression, time series, decision trees, neural networks, and many more. Selecting the correct predictive modeling technique at the start of your project can save a lot of time. Choosing the incorrect modeling technique can result in inaccurate predictions and residual plots that experience non-constant variance and/or mean.

Regression Analysis

Regression analysis is used to predict a continuous target variable from one or multiple independent variables. Typically, regression analysis is used with naturally-occurring variables, rather than variables that have been manipulated through experimentation. As stated above, there are many different types of regression, so once we’ve decided regression analysis should be used, how do we choose which regression technique should be applied?

ANOVA

A scatterplot for data that may be best modeled by an ANOVA model looks as so
A scatterplot for data that may be best modeled by an ANOVA model looks as so

ANOVA, or analysis of variance, is to be used when the target variable is continuous and the dependent variables are categorical. The null hypothesis in this analysis is that there is no significant difference between the different groups. The population should be normally distributed, the sample cases should be independent of each other, and the variance should be approximately equal among the groups.

Linear Regression

Linear regression is to be used when the target variable is continuous and the dependent variable(s) is continuous or a mixture of continuous and categorical, and the relationship between the independent variable and dependent variables are linear. Furthermore, all the predictor variables should be normally distributed with constant variance and should demonstrate little to no multicollinearity nor autocorrelation with one another.

Logistic Regression

https://www.researchgate.net/figure/Linear-Probability-Versus-Logistic-Regression-6_fig2_224127022
https://www.researchgate.net/figure/Linear-Probability-Versus-Logistic-Regression-6_fig2_224127022

Logistic regression does not require a linear relationship between the target and the dependent variable(s). The target variable is binary (assumes a value of either 0 or 1) or dichotomous. The errors/residuals of a logistic regression need not be normally distributed and the variance of the residuals does not need to be constant. However, the dependent variables are binary, the observations must be independent of each other, there must be little to no multicollinearity nor autocorrelation in the data, and the sample size should be large. Lastly, while this analysis does not require the independent and dependent variable(s) to be linearly related, the independent variables must be linearly related to the log odds.

If the scatter plot between the independent variable(s) and the dependent variable looks like the plot above, a logistic model might be the best model to represent that data.
If the scatter plot between the independent variable(s) and the dependent variable looks like the plot above, a logistic model might be the best model to represent that data.

Ridge Regression

For variables that experience high multicollinearity, such as X1 and X2 in this case, a ridge regression may be the best choice in order to normalize the variance of the residuals with an error term.
For variables that experience high multicollinearity, such as X1 and X2 in this case, a ridge regression may be the best choice in order to normalize the variance of the residuals with an error term.

Ridge regression is a technique for analyzing multiple regression variables that experience multicollinearity. Ridge regression takes the ordinary least squares approach, and honors that the residuals experience high variances by adding a degree of bias to the regression estimates to reduce the standard errors. The assumptions follow those of multiple regression, the scatter plots must be linear, there must be constant variance with no outliers, and the dependent variables must exhibit independence.

Time Series

https://simplystatistics.org/2016/05/05/timeseries-biomedical/
https://simplystatistics.org/2016/05/05/timeseries-biomedical/

Time-series regression analysis is a method for predicting future responses based on response history. The data for a time series should be a set of observations on the values that a variable takes at different points in time. The data is bivariate and the independent variable is time. The series must be stationary, meaning they are normally distributed: the mean and variance of the series are constant over long periods of time. Furthermore, the residuals should also be normally distributed with a constant mean and variance over a long period of time, as well as uncorrelated. The series should not contain any outliers. If random shocks are present, they should indeed be randomly distributed with a mean of 0 and a constant variance.

Classification Analysis

Decision Trees

https://hackernoon.com/what-is-a-decision-tree-in-machine-learning-15ce51dc445d
https://hackernoon.com/what-is-a-decision-tree-in-machine-learning-15ce51dc445d

Decision trees are a type of supervision learning algorithm which repeatedly splits the sample based on certain questions about the sample. These are very useful for classification problems. They are relatively easy to understand and very effective. Decision trees represent several decisions followed by different chances of occurrence. This technique helps us to define the most significant variables and the relation between two or more variables.

Neural Networks

Neural networks help to cluster and classify data. These algorithms are modeled loosely after the human brain and are designed to recognize patterns. Neural networks tend to be very complex, as they are composed of a set of algorithms. This type of analysis can be very useful, however, if you are trying to determine why something happened, this may not be the best model to use.

http://www.asimovinstitute.org/neural-network-zoo/
http://www.asimovinstitute.org/neural-network-zoo/

In conclusion, these are just a handful of the options of different predictive techniques that can be used to model data. It should be noted that making causal relationships between variables when using predictive analysis techniques is very dangerous. We cannot state that one variable caused another in predictive analysis, rather, we can state that a variable had an effect on another and what that effect was.

Let’s connect:

https://www.linkedin.com/in/mackenzie-mitchell-635378101/

mackenziemitchell6 – Overview

Resources:

https://www.statisticssolutions.com/manova-analysis-anova/

https://dss.princeton.edu/online_help/analysis/regression_intro.htm#targetText=Regression%20analysis%20is%20used%20when,logistic%20regression%20should%20be%20used.

https://www.statisticssolutions.com/assumptions-of-logistic-regression/#targetText=Third%2C%20logistic%20regression%20requires%20there,independent%20variables%20and%20log%20odds.

https://www.microstrategy.com/us/resources/introductory-guides/predictive-modeling-the-only-guide-you-need

https://skymind.ai/wiki/neural-network

https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf

https://www.analyticsvidhya.com/blog/2015/01/decision-tree-simplified/2/


Related Articles