The Data Scientist’s Guide to Selecting Machine Learning Predictive Models in Python

Learn how to select the best model for your ML project.

Nick Minaie, PhD
Towards Data Science

--

Photo by Franck V. on Unsplash

This article provides a quick overview of some of the predictive machine learning models in Python, and serves a guideline in selecting the right model for a data science problem.

In recent years and with the advancements in computing power of machines, predictive modeling has gone through a revolution. We are now capable of running thousands of models at multi-GHz speed on multiple cores, making predictive modeling more efficient, and more affordable than ever. Virtual machines, such as those provided by Amazon Web Services (AWS), provide us with access to practically unlimited quantitative power (at a high cost of course!).

However, one of the fundamental questions every data scientist faces is:

Which predictive model is more appropriate for the problem in hand?

Answering this question comes down to a basic, fundamental question in every machine learning problem:

What does the target you are trying to predict look like?

If you are trying to predict a continuous target, then you will need a regression model.

But if you are trying to predict a discrete target, then you will need a classification model.

Regression Models in Python:

Regression Modeling — https://en.wikipedia.org/wiki/Regression_analysis
  • Linear Regression: When you are predicting a continuous model and your target varies between -∞ and +∞ (such as temperature), the best model would be a linear regression model. Depending on how many predictors (aka features) you might have, you may use Simple Linear Regression (SLR), or Multi-Linear Regression (MLR). Both of these use the same package in Python:sklearn.linear_model.LinearRegression() Documentation for this can be found here.
  • Gamma Regression: When the prediction is done for a target that has a distribution of 0 to +∞, then in addition to linear regression, a Generalized Linear Model (GLM) with Gamma Distribution can be used for prediction. Details on GLM can be found here.

Classification Models in Python

Python offer many classification models. In this section we review some of the widely used models in the scikit-learn library.

  • Logistic Regression (LogReg): This model is used when predicting a multi-class target. Unlike K_Nearest Neighbors (kNN), this model works well in linear cases. SciKit-Learn offers the package in its linear model libray: sklearn.linear_model.LogisticRegression() Documentation for this can be found here.
  • KNN (or K-Nearest Neighbors) is a non-parametric model, where logistic regression is a parametric model. Generally speaking, KNN is less efficient than a LogReg model and supports non-linear solutions. This model classifies the targets based on the number of nearest neighbors (as the name suggests) to a specific class. Documentation for the sklearn.neighbors.KNeighborsClassifer can be found here. It should be noted that sklearn also offer a KNeighborsRegressor that this article does not cover.

Confusion Matrix for Classification Problems

My article on Evaluating Machine Learning Classification Problems in Python: 5+1 Metrics That Matter” provides an overview of classification performance metrics and the definition of Confusion Matrix and Confusion Metrics for these models.

Structure of a Binary Classification Confusion Matrix https://medium.com/@minaienick/evaluating-machine-learning-classification-problems-in-python-5-1-metrics-that-matter-792c6faddf5

Advanced Classifier/Regressor Models

There are many algorithms offered by Python libraries such as SciKit-Learn, XGBoost, and … Some of these algorithms offer both classifiers and regressors, and also provide many parameters for customization.

  • Decision Trees: Decision Trees offer customizable models, and also serve as the foundation for more optimized models such as RandomForest or GradientBoosting. Documentation can be found here. Decision trees are non-parametric supervised-learning, and thus capable of handling outliers and also inter-dependant variables. However, they can easily overfit to the training dataset, and the user should be cautious about this issue.
Visualization of a Decision Tree — https://scikit-learn.org/stable/modules/tree.html
Bagging Approach — “Data Mining: Accuracy and Error Measures for Classification and Prediction” , Paola Galdi Roberto Tagliaferri
  • Random Forest: Random Forest model is like a Bagging model, but with a difference. According to the Sklearn documentation for sklearn.ensemble.RandomForstClassifier: “A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.” There are many advantages to this type of model including high learning speed, scalability, non-iterative nature of the model (it always converges). Another important advantage of the model is that it can deal with imbalanced cases, and can leverage bootstrapping to handle such cases. However, the model can take up a lot of memory, and can overfit to the training dataset. This article provides a good summary of the model.
General Scheme of a Random Forest Model — https://www.mql5.com/en/articles/3856
  • Voting Models: Voting models can be used to package multiple models under one model. Sklearn documentation calls this a “Soft Voting/Majority Rule classifier for unfitted estimators.” In this model, a weight can be assigned to each voting model, and thus unfitted models would be discounted. This is similar to bagging but for different models and with different weights (Bagging works only with one base model and then averages the predictions). sklearn.ensemble.VotingClassifier has more details on this model.
  • Boosting Models: In boosting models, each tree gets an importance weight based its accuracy. The more accurate models would have a higher weight and thus contribute more to the final prediction. Highly inaccurate models would be penalized with a negative weight, which means their prediction would be reversed in the final prediction. There are multiple boosting models, but the noteworthy ones are: sklearn.ensemble.GradientBoostingClassifier and sklearn.ensemble.AdaBoostingClassifier.

Scikit-Learn Algorithm Cheat-Sheet

Scikit-Learn has developed a flowchart for selecting the right model for a machine-learning problem based on the characteristics of the samples, the features (or predictors) and the target. This interactive cheat-sheet can be found here.

Scikit-learn Algorithm Cheat-Sheet (https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

Final Thoughts…

This article barely scratches the surface when it comes to machine-learning predictive models. Numerous packages have been developed for this purpose (and still counting) that will require extensive time dedication to review and learn. The best way to learn these models is to use them in a real project. I hope this article could serve as a guideline in selecting the right model(s) for your data science project and in helping you through your data science journey.

Nick Minaie, PhD (LinkedIn Profile) is a senior leader and a visionary data scientist, and represents a unique combination of leadership skills, world-class data-science expertise, business acumen, and the ability to lead organizational change. His mission is to advance the practice of Artificial Intelligence (AI) and Machine Learning in the industry.

--

--

Machine Learning and Artificial Intelligence Leader, Blockchain Enthusiast, Entrepreneur