Tips and Tricks

What Do 101 Dalmations and Machine Learning Have in Common?

There are at least 101 Examples of Machine Learning Data Scientists can use to Create Valuable Insights. Here they are.

Ron Ozminkowski, PhD
Towards Data Science
33 min readNov 30, 2021

--

Junior — a very good boi, Courtesy of Jim Ozminkowski

Recently I was asked to help prepare an organization for an expected increase in business requiring artificial intelligence. AI for that firm would focus on machine learning (ML) and deep learning, techniques which use computers, guided by humans, to find patterns in data that can be used to create insights for industrial and public policy applications.

This organization is very well steeped in statistics, econometrics, and visualization. Their empirical work is on par with or better than anyone else’s, evidenced by a 50-year history of excellent public policy research, evaluation, and consulting experience, especially in the federal and state government space. To help broaden perspectives and expand capabilities I searched for and then documented examples of many empirical approaches that could add to their work. Many of these approaches are used by computer scientists but are not as generally associated with economics or other social sciences, and therefore may be less familiar to policy analysts and policy makers.

The No Free Lunch (NFL) Theorem shows why considering many analytic approaches is valuable. Goodfellow et al. (2015) explain the meaning of this theorem in their textbook on Deep Learning. They cite Wolpert’s (1996) definition of NFL, which says, “averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than any other” (page 113).

This equality holds only in the limit though. Indeed, there are many local exceptions. In any given case, Goodfellow et al. say there may be one or more ML algorithms which generate better predictions than others. Moreover, the set of better predictors is likely to vary from one study to another. In my view this is the most important point and the takeaway lesson from their discussion of NFL: it means that data scientists and other policy analysts should hunt for the best prediction algorithm to address each of their prediction challenges. For example, it implies that using some form of logistic or linear regression for every classification or regression problem may not be ideal, even if those approaches are the ones most familiar to analysts and policy makers. This is true even for inferential work, where association or causal attribution, rather than just prediction, is the goal, and the underlying causal mechanisms are non-linear.

If we must hunt for the best approach, and if our proficiency with each potentially worthwhile approach is not the same, it would be useful to have a quick and practical reference pointing to many helpful analytic methods. This post provides such a reference.

An Annotated List of Several Machine Learning Articles

Below you will find an annotated list of links to many articles on ML. Most of these are blogs or tutorials, so they are not published in peer-reviewed journals. However, many times comments follow the articles, and faults, questions, or issues are often posted by knowledgeable readers. Many of the writers are prominent in their fields and often respond to these queries or provide justifications for and details of their methods on GitHub or in other places.

These articles are generally easy and fast to read, usually taking fewer than 15 minutes. Sometimes readers will find equations that look intense, but they are typically described well, and the underlying intuition is usually clear. Many authors use visualization to clarify major points.

Most policy analysts work with supervised models, so there are many links to information about those models. A few links to discussions of ensemble models are provided as well, because these offer great extensions of supervised approaches.

Then I provide links to unsupervised models. The clustering, visualization, and data reduction techniques described in those articles can help data scientists use huge and varied data sets to specify models in ways that account for more variation in the underlying data. A few links to causal inference models are also provided because interest in moving from prediction to assessing causality has been building in recent years. Limitations are noted, as are suggestions for using these articles with other sources to build better models.

Every link below is accessible as of this date (11/30/21) but I do not know how long they will be active.

Photo by Tim Mossholder on Unsplash.com

Supervised Learning

This is a form of Inductive Learning that uses a statistical or other model to learn how to use input data to generate a correct prediction of an outcome of interest. The input data for supervised models are referred to as predictors, independent variables, or explanatory variables in the parlance of economics and other social sciences. They are called features in the computer science / engineering / ML world. Outcome (i.e., dependent) variables are also available in the data used for supervised learning, allowing statistical packages to differentiate between correct and incorrect predictions made by the ML model. Statistical packages also report the ability of the ML models to correctly predict the outcomes of interest, using several model performance metrics. The focus of supervised learning is typically on regression or classification problems.

In this section you will find links to several articles about various forms of supervised learning and related topics. We begin with links to articles describing general methodological concepts and issues. Then I provide links to articles describing several prediction methods. These include naïve Bayes, logistic regression, support vector machines, linear discriminant analysis, and decision trees, all which can be used to classify new objects or predict binary (yes or no) outcomes. These are followed by links to articles about quantile regression, spline regression, ridge and LASSO regression, and generalized additive models, which address continuous outcomes. We then turn to forecasting with time series models, and we finish this long section with links to articles describing neural network and deep learning approaches, which can be used to predict either binary or continuous outcomes.

Some of the articles compare multiple approaches or describe ensemble applications, helping analysts address the no-free-lunch problem. We will turn to automated machine learning to facilitate such comparisons in a future post.

High-level Methodological Issues:

This section includes links to several articles that help data scientists better understand how ML can be used. We start with an article on combining ML with human expertise to make better models. Every machine learning text I have read strongly recommends this team-sport aspect of ML — humans should be guiding the work to make sure it is relevant and likely to produce useful insights. The machine is meant to enhance, not replace, the human perspective.

Following that link we provide a link describing the utility of notebook technologies to track and describe our work. Then we focus on data preparation, programming languages, and other tools and approaches which facilitate optimal use and evaluation of ML models. After that we provide links to articles describing,

· Where to get data for our machine learning models,

· Feature (i.e., independent variable) selection processes,

· The bias-variance tradeoff that must be considered when choosing and interpreting models,

· How to use LIME, SHAP, and other methods to assess the importance of each model predictor, and

· How to create dashboards to help people who are not experts interpret the results of ML models.

Here we go …

1. J. Langston, Machine Teaching: How Peoples’ Expertise makes AI even more Powerful (2019), on https://blogs.microsoft.com/ai/machine-teaching/. When humans guide the work of the machine, the machine will produce more useful results. This is true even with automated ML approaches, which I will come back to in a future post.

2. S. Sjursen, The Pros and Cons of Using Juypter Notebooks as Your Editor for Data Science Work (2020), on https://medium.com/better-programming/pros-and-cons-for-jupyter-notebooks-as-your-editor-for-data-science-work-tip-pycharm-is-probably-40e88f7827cb. Notebooks are used to house code, drive execution of machine learning, and store results, for future reference. There are a few competing notebook products out there. Jupyter is the one most used. Alternatives to it are compared.

3. Active Wizards, Data Science for Managers: Programming Languages (2019), https://www.kdnuggets.com/2019/11/data-science-managers-programming-languages.html This article describes the pros and cons of several programming languages used for supervised ML models. Examples include Python, R, Scala, Matlab, and others.

4. V. Kashid, What are the Major Differences Between Python and R for Data Science? (2020), on https://www.quora.com/What-are-the-major-differences-between-Python-and-R-for-data-science. A nice high-level comparison of Python and R, which are among the leading programming languages used for data science in industry and academia.

5. C. D., Best Data Science Tools for Data Scientists (2020), on https://towardsdatascience.com/best-data-science-tools-for-data-scientists-75be64144a88. Many tools are described for handling large and varied data, building analytic data sets, and for statistical analysis, machine learning, deep learning, model deployment, and reporting and visualization.

6. D, Johnston, How Deeply Should a Data Scientist Study Optimization Techniques? (2019), on https://www.quora.com/How-deeply-should-a-data-scientist-study-optimization-techniques/answer/David-Johnston-147?ch=8&share=64abfe6f&srid=KAgTn. Every supervised ML approach “learns” by solving an optimization problem, usually defined in terms of minimizing the difference between actual values of an outcome observed in real life vs. the predictions of those outcomes generated by the ML model. It is helpful to understand how the optimization works (hint: it is typically based on an iterative approach to using the chain rule you learned about in calculus class). In addition to this article, there are some good YouTube videos as well, describing backpropagation to apply the chain rule.

7. T. Waterman, Google just Published 25 Million free Datasets (2020), on https://link.medium.com/5Ssd71u2y3. Before we can get started with the process of machine learning, we obviously need data which represent the constructs, predictors, and outcomes of interest. Google did not really publish 25 million data sets, and only about half of what they have are free. They did, however, publish searchable metadata about many data sets which can be of great help to finding the right one for your use.

8. M. Dei, Catalog of Variable Transformations To Make Your Model Work Better (2019), on https://link.medium.com/PO7esQMJn4. Sometimes variables need to be transformed (e.g., standardized, normalize, categorized) to make machine learning models run better. This article describes how to do that.

9. Z. Zhang, Understand Data Normalization in Machine Learning (2019), on https://link.medium.com/S6QjeRvFC4. This is a primer on the value of standardizing variables prior to modeling, to reduce model training time and improve accuracy.

10. T. Yiu, Understanding Cross Validation (2020), on https://link.medium.com/gEwvRFTE83. Cross-validation is a way of subdividing your data set many times, each time training a model with one subset left out, then aggregating information across all the models to find the model with the best predictive power. This article describes how it works.

11. R. Agarwal, The 5 Classification Evaluation Metrics Every Data Scientist must know (2019), on https://link.medium.com/IMLmW0IrW3. This is a primer on model performance metrics for classification analyses — how to tell if your model is performing well.

12. S. Kochlerlakota, Bias and Variance — Cut Through the Noise (2019), on https://link.medium.com/XIhInRhIk3. Every machine learning and statistical procedure that is used to predict a phenomenon or explain it must balance the notions of bias (i.e., are we getting the correct answer?) and variance (how certain can we be about that?). This article addresses that tradeoff.

13. B. O. Tayo, Sources of Error in Machine Learning (2020), on https://medium.com/towards-artificial-intelligence/sources-of-error-in-machine-learning-33271143b1ab. This article describes ten types of errors to guard against in machine learning models. These errors are not unique to machine learning; they apply to every type of analysis.

14. J. Zornoza, An Introduction to Feature Selection (2020), on https://towardsdatascience.com/an-introduction-to-feature-selection-dd72535ecf2b. Once one has data it is time to grab useful pieces for analysis. Feature selection is the art of finding independent variables for machine learning models. This can be guided by theory and aided greatly by computer analysis.

15. M. Grogan, Feature Selection Techniques in Python: Predicting Hotel Cancellations (2020), on https://link.medium.com/H5MsJdIJn4. Here is more on finding good predictors, with an example and Python code.

16. W. Koehrsen, Why Automated Feature Engineering Will Change the Way You Do Machine Learning (2018), on https://link.medium.com/L8jojOVlJ1Feature engineering is the process of taking a dataset and constructing explanatory variables — features — that can be used to train a machine learning model for a prediction problem. This can now be automated, as described in this article.

17. G. Hutson, FeatureTerminatoR — a Package to Remove Unimportant Variables from Statistical and Machine Learning Models Automatically (2021), on https://www.r-bloggers.com/2021/07/featureterminator-a-package-to-remove-unimportant-variables-from-statistical-and-machine-learning-models-automatically/. This article describes several types of cross-validation that can be used to eliminate variables which are not helping to improve model prediction. It also addresses ways to avoid multicollinearity, which can result in unstable predictions.

18. J. Brownlee, A Gentle Introduction to Model Selection for Machine Learning (2019), on https://machinelearningmastery.com/a-gentle-introduction-to-model-selection-for-machine-learning/. This article describes how to pick among the many machine learning approaches, based on the problem you want to solve.

19. S. Mazzanti, SHAP Explained the way I Wish Someone Explained it to me (2020), on https://towardsdatascience.com/shap-explained-the-way-i-wish-someone-explained-it-to-me-ab81cc69ef30. The author provides a nice way to understand how SHAP can be used to explain the contribution of various features (variables) in predictive models.

20. D. Dataman, Explain Any Models with the SHAP Values — Use the KernelExplainer (2019), on https://link.medium.com/Fjo6lzakY1. Machine learning models sometimes appear to be black boxes that do not show the relative importance of predictor variables. This article explains how to find that out by using a new Python utility, illustrating results for many different types of ML models.

21. P. Dos Santos, Be Careful What You SHAP For (2020), on https://medium.com/@pauldossantos/be-careful-what-you-shap-for-aeccabf3655c. The author notes that SHAP can produce misleading results in models that combine continuous and binary predictors.

22. A. Nayak, Idea Behind LIME and SHAP (2019), on https://link.medium.com/13VS0JSmD2. This article introduces another way to estimate the relative importance of predictor variables (LIME) and compares it to SHAP.

23. M. Dei, Three Model Explanability Methods Every Data Scientist Should Know (2019), on https://towardsdatascience.com/three-model-explanability-methods-every-data-scientist-should-know-c332bdfd8df. This article goes much deeper into model transparency and interpretability. The author introduces the concept of permutation importance and describes the utility of a partial dependence plot. He references a new version of the scikit-learn library that can help apply these techniques, and SHAP too.

24. S. Mazzanti, Which of Your Features are Overfitting? Discover ParShap: an Advanced Method to Detect Which Columns Make Your Model Underform (2021), on https://towardsdatascience.com/tagged/tips-and-tricks?p=c46d0762e769. This article describes an extension of SHAP that he created, to figure out which independent variables (features) cause models to perform very well on sampled data but poorly on new data.

25. S. Mazzanti, Black Box Models are Actually More Explainable Than a Logistic Regression (2019), on https://towardsdatascience.com/black-box-models-are-actually-more-explainable-than-a-logistic-regression-f263c22795d. The author provides a nice explanation of how to derive probabilities from complex models, so they are a lot easier to explain.

26. H. Sharma, Machine Learning Model Dashboard: Creating Dashboards to Interpret Machine Learning Models (2021), on https://towardsdatascience.com/machine-learning-model-dashboard-4544daa50848. Sharma shows how to create a dashboard to see better how your model is performing.

Some Illustrative Use Cases:

There are thousands of supervised ML applications described in the literature and/or being used in industry and academia all over the world. Here are just a few showing how ML can provide useful insights. Chances are just googling a phrase describing your issue of interest along with “machine learning” will illustrate how ML may be useful for your application.

27. S. Pasutto, What makes a good machine learning use-case? (2019), on https://link.medium.com/Ja4FUL3yG2. The majority of empirical questions we are asked by our clients to address may make good uses cases for ML. This is true even for inferential work because the program impact estimates or other inferences we make can be viewed as predictions of future outcomes, and ML excels at making solid predictions of outcomes they care about. There are many other use cases as well, some of which address internal management challenges such as cost reduction, risk reduction, and profit maximization. These three management challenges are noted in this article.

28. J. Wang, Musings at the Intersection of Data Science and Public Policy (2017), on https://towardsdatascience.com/musings-at-the-intersection-of-data-science-and-public-policy-cf0bb2fadc01. The author provides some words of wisdom about the utility and limitations of data science to address public policy issues. She includes a link to several use cases.

29. J. Rowe, How Natural Language Processing and Machine Learning can help Track the Corona Virus (2020), on https://www.healthcareitnews.com/ai-powered-healthcare/researchers-tap-ai-track-spread-coronavirus. This article is about how natural language processing and machine learning can help track the corona virus — it is an interesting use case illustrating the power of these techniques.

30. Emerging Technology from the arXiv, Machine Learning has Revealed Exactly how much of a Shakespeare play was Written by Someone Else (2019), on https://www.technologyreview.com/s/614742/machine-learning-has-revealed-exactly-how-much-of-a-shakespeare-play-was-written-by-someone/. This use case is radically different from other we consider, but the article opens our eyes to the fact that ML is a flexible tool that can be used in a variety of applications.

31. J. Kent, Machine Learning Could Improve End-of-Life Communication (2019), on https://healthitanalytics.com/news/machine-learning-could-improve-end-of-life-communication?eid=CXTEL000000199563&elqCampaignId=12666&utm_source=nl&utm_medium=email&utm_campaign=newsletter&elqTrackId=891254c5466e49b0bec8c00bda09517b&elq=c3c28e7e780c43faac51db1f0fd44882&elqaid=13294&elqat=1&elqCampaignId=12666. This article describes the use of natural language processing and machine learning to assess the content of end-of-life conversations and then suggest better approaches.

32. J. Harris (with M. Stewart), Data Privacy and Machine Learning in Environmental Science, in https://towardsdatascience.com/data-privacy-and-machine-learning-in-environmental-science-490fded366d5. This is a podcast where Jeremie Harris interviews Matthew Stewart, a PhD student in environmental science at Harvard. Matthew is a frequent contributor to Toward Data Science, and his PhD work is at the nexus of machine learning and environmental science. He and his colleagues focus on climate change models, carbon emissions, and other important issues. In this podcast he focuses on tradeoffs between data privacy, the analytic value of individual observations, and model bias, among other issues.

Photo by Alexander Schimmeck on Unsplash.com

Supervised Learning Methods:

33. J. Brownlee, A Tour of Machine Learning Algorithms (2020), on https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/. We open with a nice review of many types of machine learning models by a master of pragmatic machine learning approaches.

34. S. Patel, Chapter 1: Supervised Learning and Naive Bayes Classification — Part 1 (Theory) (2017), on https://link.medium.com/VZlAzZKeK2. Naïve Bayes takes advantage of Bayes’ Theorem to help generate predictions of binary outcomes when variables used to generate those predictions are independent (i.e., naïve) of each other. This is easy to do and provides readily interpretable results, so Naïve Bayes is often a favored initial approach. This article describes how it works.

35. S. Patel, Chapter 1: Supervised Learning and Naive Bayes Classification — Part 2 (Coding) (2017), on https://link.medium.com/87CBHgWeK2. This article describes how to carry out a Naïve Bayes approach in Python.

36. T. Yiu, Understanding The Naive Bayes Classifier (2019), on https://link.medium.com/OpjLwR7ZF4. Here is another nice primer on Naïve Bayes, a classifier that often generalizes well and is easy to use. This sentence from the article summarizes it in a nutshell: At a high level, naive Bayes is just applying a simplified version of Bayes’ Theorem to every observation based on its features (and for each potential class).

37. M. Maramot Le, Why is Logistic Regression used so Often in Data Science? (2019), on https://www.quora.com/Why-is-logistic-regression-used-so-often-in-data-science/answer/Marmi-Maramot-Le?ch=8&share=7e0c7b2a&srid=KAgTn. This is a nice description of the virtues of logistic regression, which may be the most popular classification method used across many industries.

38. H. H. Strand, When Evaluating a Logistic Regression, what are the Differences Between Interpreting the Brier Score vs the AUROC? When would you use one vs the other? Is it possible to have a high score in one and a low score in the other? (2019), on https://www.quora.com/When-evaluating-a-logistic-regression-what-are-the-differences-between-interpreting-the-Brier-Score-vs-the-AUROC-When-would-you-use-one-vs-the-other-Is-it-possible-to-have-a-high-score-in-one-and-a-low-score-in-the/answer/H%C3%A5kon-Hapnes-Strand?ch=8&share=d7361af6&srid=KAgTn. Here is a quote from the author: The AUC of a ROC curve is one of the most generally applicable metrics for classification models, but it only evaluates the rank of your predictions. The Brier score calculates the accuracy of probabilistic predictions. If you only care about classification, AUC is fine, but if you want to estimate the probability of something, then the Brier score is a good additional metric to look at.

39. F. Qiao, Logistic Regression Model Tuning with Scikit-learn — Part 1 (2019), on https://link.medium.com/nJf2bACEY2 Ways to tinker with logistic regression models are described in this article, to improve model performance. A comparison to random forest is also conducted. Python code is provided.

40. S. Deb, Naive Bayes vs Logistic Regression (2016), on https://link.medium.com/UPGymGIfK2. Since Naïve Bayes and logistic regression are two popular, relatively easy, and interpretable approaches to use, this article describes where each may shine. Logistic regression dominates with huge datasets, but most datasets are not that large and many times Naïve Bayes generates accurate results faster. Empirical error, generalization error, and confounding error are described, including Simpson’s Paradox.

41. M. Sanjeevi, Support Vector Machine With Math (2017), on https://medium.com/deep-math-machine-learning-ai/chapter-3-support-vector-machine-with-math-47d6193c82be. The author provides a nice description of how SVM works, pairing graphics with math. This one should appeal to those with a strong math background.

42. R. Pupale, Support Vector Machines — An Overview (2018), on https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989. This is a nice introduction to SVM, with intuitive graphics to describe underlying theory, and Python code.

43. ODSC — Open Data Science, Build a Multi-Class Support Vector Machine in R (2018), on https://link.medium.com/ztbUrFveK2. The author provides theory, along with an application of SVM in R, for an example with multiple classes in the outcome variable.

44. P. Nistrup, Linear Discriminant Analysis (LDA) 101, Using R (2019), on https://link.medium.com/WWlqD6xQN2. LDA is another form of classification analysis and in this sense can be an alternative to logistic regression, decision trees, Naïve Bayes, and other methods. It can also be viewed either as an alternative to or used in complement with principal component analyses (see Unsupervised Models below for info on PCA). There are some nice visualizations to illustrate LDA in this article, along with R code to see how one might try it out.

45. S. T, Entropy: How Decision Trees Make Decisions (2019), on https://link.medium.com/vXj620nRN2. This article describes how entropy, a measure of purity, and information gain are used to create decision trees.

46. P. Flom, An Introduction to Quantile Regression (2018), on https://link.medium.com/hcLvE9WoK2. In contrast to ordinary least squares, quantile regression makes no assumptions about the distribution of residuals. It also works better than ordinary least squares when the dependent variable is bimodal or multimodal. This article provides an example, along with SAS code, to analyze birth weight data with a quantile approach.

47. C. Lee, Quantile Regression — Part 1(2018), on https://link.medium.com/HtaBXw6oK2. This article describes how quantile regression works, and provides examples using Tensorflow, Pytorch, Light GBM, and Scikit-Learn. Math-based theory is provided as well.

48. M. Ghenis, Quantile Regression, from Linear Models to Trees to Deep Learning (2018), on https://link.medium.com/arc6wSspK2. This article compares quantile regression to random forest, gradient boosting, ordinary least squares, and deep learning approaches.

49. S. Vyawahare, Spline Regression in R (2019), on https://link.medium.com/T7IGk9QpK2. Spline regression (also called piece-wise regression) is a nonparametric technique that can be used well with non-linear data. It can be an alternative to quantile regression.

50. M.S. Mahmood, Fit Non-linear Relationship Using General Additive Model (2021), https://towardsdatascience.com/fit-non-linear-relationship-using-generalized-additive-model-53a334201b5d. This is a nice article describing how general additive models can be used to generate predictions for non-linear relationships. Outcomes are modelled as the sum of non-linear functions of input variables. Comparisons to spline and polynomial regression are provided.

51. V. Sonone, Conventional Guide to Supervised Learning with Scikit-learn — Robustness regression: Outliers and Modeling Errors — Generalized Linear Models (2018), on https://link.medium.com/UtGDO37BL2. Robust regression procedures are valuable when outliers or other data problems exist.

52. S. Bhattacharyya, Ridge and Lasso Regression: A Complete Guide with Python Scikit-Learn (2018), on https://link.medium.com/RGST954ii2 — This quote from the article summarizes its focus well: Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting which may result from simple linear regression.

53. A. Singhal, Machine Learning: Ridge Regression in Detail (2018) on https://link.medium.com/KoK0E7Oxx3. As the title says, here is a more information on ridge regression. The author provides a nice review of the problem of overfitting, which causes models to perform very well on training data sets but poorly when used for completely new data. He shows how ridge regression can be used to avoid this problem.

54. S. Gupta, Pros and Cons of Various Machine Learning Algorithms (2020) on https://towardsdatascience.com/pros-and-cons-of-various-classification-ml-algorithms-3b5bfb3c87d6. This article reviews the pros and cons of Support Vector Machines, Logistic Regression, Naïve Bayes, Random Forest, Decision Trees, k-Nearest Neighbor, and XGBoost for prediction exercises.

55. E. Lisowski, The best Forecast Techniques, or how to Predict from Time Series Data (2019) on https://link.medium.com/tR9OlYzpR1. This article describes how to generate good predictions using time series data. Think, for example, of how to best forecast payments to health care providers that should be made under all-payer (i.e., all-insurance) systems used in several states across the country. ARIMA models like the ones mentioned here are well suited to that and to similar questions. The firm I mentioned in the introduction to this post developed a great ARIMA model for this purpose, greatly enhancing the fairness and accuracy of payments to providers in several states.

56. J. Nikulski, Time Series Forecasting with AdaBoost, Random Forests and XGBoost (2020) on https://link.medium.com/RX2I64fLf5. This is a nice review of three ML techniques that can be used with time series data. The author provides guidance about how to test performance of these models. The performance testing techniques for time series models are fundamentally different than those used for other kinds of ML, for reasons noted in the article.

57. J. D. Seo, Trend, Seasonality, Moving Average, Auto Regressive Model: My Journey to Time Series Data with Interactive Code (2018), on https://link.medium.com/n5qvZnKhY1 — This article reviews more models and has some nice visualizations.

58. B. Etienne, Time Series in Python — Exponential Smoothing and ARIMA processes (2019), on https://link.medium.com/IwSvpfzfn2 — This one has more info on checking for stationarity, which is required to make good predictions from time series models. Transformations to obtain stationarity are provided too. Python code and visualizations are provided, as are thoughts about how to pick a time series model from among several choices, and how to plot the predictions.

59. S. Palachy, Stationarity in Time Series Analysis (2019), on https://link.medium.com/SLpze7QdI2. For those who would like a heavier dose of theory and math, here is a description of these in the context of the stationarity property that underlies solid approaches to forecasting with time series data.

60. R. Kompella, Using LSTMs to forecast time-series (2018), on https://towardsdatascience.com/using-lstms-to-forecast-time-series-4ab688386b1f. One can also forecast from time series using a neural network approach. This article describes how to do that in Python.

61. J. Brownlee, How to Code a Neural Network from Scratch with Backpropagation in Python (2021), on https://machinelearningmastery.com/implement-backpropagation-algorithm-scratch-python/. A neural network approach is loosely based on how neurons in the brain pass information along to each other. This biological process has inspired neural network and deep learning models, which can be very good at predicting binary or continuous outcomes. This article explains how to generate neural network models in Python and how to use a technique called backpropagation to minimize model error.

62. A. Ye, You Don’t Understand Neural Networks Until You Understand the Universal Approximation Theorem (2020), on https://medium.com/analytics-vidhya/you-dont-understand-neural-networks-until-you-understand-the-universal-approximation-theorem-85b3e7677126 This theorem states that a neural network with just one hidden layer that has enough neurons in it can approximate any continuous (linear or non-linear) function to a reasonable degree of accuracy, if a sigmoid-like activation function is used. The author provides a nice description and a few examples.

63. S. K. Dasaradh, A Gentle Introduction to Math Behind Neural Networks (2019), on https://link.medium.com/MDZLalMfI2. While “gentle” is in the eye of the beholder, this article describes the intuition as well as the math behind neural networks.

64. V. Sethi, Types of Neural Networks (and what each one does!) Explained (2019) on https://link.medium.com/hWRnjnJEY2. Neural networks can be used for regression or classification problems, so may be alternatives to any of the methods described in the links above. The different types of neural networks are described at a high level in this article.

65. Towards AI Team, Main Types of Neural Networks and its Applications — Tutorial by Towards AI Team (2020), on https://link.medium.com/rgohjQrQG8. This is a short review of 27 types of neural network models. The pictures at the top provide a nice visualization of these approaches.

66. E. Hlav, Activation Functions in Deep Learning: From Softmax to Sparsemax — Math Proof (2020), on https://link.medium.com/sul2IG8799. This article describes Sparsemax, an activation function used in neural networks to predict outcomes with multiple categories.

67. A. Bonner, What is Deep Learning and How Does it Work? (2019) on https://link.medium.com/csFjYHHoR1 — Deep learning is a form of ML based on neural networks. While basic neural networks have just one “hidden” layer that lies between the input data and the output prediction, deep learning has two or more layers. In complex imaging, genomic, or physics applications, there may be hundreds, thousands, or tens of thousands of hidden layers, each which builds upon the previous layer(s), in an effort to generate better predictions.

68. M. Lugowska, A Must-read Tutorial When You are Starting Your Journal with Deep Learning, by Magdelena Lugowska (2019), on https://blog.skygate.io/a-must-read-tutorial-when-you-are-starting-your-journey-with-deep-learning-5fa4da071510. An introduction to deep learning is provided. Some advice about how to handle hidden layers, where the real work gets done, is also provided. Advice about how to tune these models is provided as well.

69. D. Jangir, Need and Use of Activation Functions (2018), on https://link.medium.com/dJYXxPbNf5. The author provides a nice review of activation functions used in neural networks to generate predictions with nonlinear data. Several activation functions are reviewed.

70. T. Folkman, How To Use Deep Learning Even with Small Data (2019), on https://link.medium.com/f57XSYHCB2. Most people think that deep learning requires huge datasets, but this is not always the case, as explained in this article.

Photo by Manuel Nageli on Unsplash.com

Compositional and Ensemble Learning:

Compositional and ensemble learning refer to using several types of machine learning to address machine learning problems.

71. A. Ye, Compositional Learning is the Future of Machine Learning (2020), on https://link.medium.com/89fJQJGV66. Compositional Learning is used when large tasks can be broken into meaningful, disparate parts, with each part being analyzed by a different machine learning approach.

72. J. Roca, Ensemble methods: bagging, boosting and stacking (2019), on https://link.medium.com/93kLZDWPS1 — Bagging refers to running many (i.e., 10s, 100s, or 1000s) of similar models at the same time. Random forest models are bagging models. Boosting refers to running models sequentially, focusing in part on leveraging the error terms of previous models. Stacking refers to running many different types of models in parallel and then combining the results.

73. S. Glenon, Decision Tree vs Random Forest vs Gradient Boosting Machines: Explained Simply (2019), on https://www.datasciencecentral.com/profiles/blogs/decision-tree-vs-random-forest-vs-boosted-trees-explained. A random forest (RF) is constructed from combining many decision trees after they are all generated. Gradient boosted machine (GBM) approaches combine trees as they are being built, one after the other. Some pros and cons of RF and GBM are provided.

74. W. Koehrsen, Random Forest in Python (2017) on https://link.medium.com/zJEfiDWEB2 — Random Forest is one of the most popular and often one of the best supervised approaches to predict binary outcomes. It is often but not always better than logistic regression, support vector machines, gradient boosting, or neural network approaches. RF is a decision-tree approach, combining output from 10s or 100s or 1000s of individual decision trees. The number of trees to use is a hyperparameter to be set by the user — see below for more on hyperparameters.

75. A. Ye, When and Why Tree-Based Models (Often) Outperform Neural Networks (2020), on https://towardsdatascience.com/when-and-why-tree-based-models-often-outperform-neural-networks-ceba9ecd0fd8 While the no-free-lunch principal guarantees that no particular modeling approach dominates in every situation, there are some situations in which random forest and other tree-based models tend to outperform neural networks. See why inside.

76. W. Koehrsen, Hyperparameter Tuning the Random Forest in Python (2018), on https://link.medium.com/TFvgX1DlD2. — Hyperparameters are the settings used by the model to generate predictions. Think of settings as in settings used for email or mobile phones. The analyst tells the software which hyperparameters to use. Hyperparameter choice is not well guided by theory; trial and error and intuition are used to arrive at the final hyperparameter settings, which are those associated with minimum model error. This search can be automated.

77. P. Patidar, HyperParameters in Machine Learning (2019), on https://link.medium.com/VAVKbs6qi4. Another brief primer on hyperparameters is provided in this article.

78. T. S, A Mathematical Explanation of AdaBoost in 5 Minutes (2020) | Towards Data Science on https://towardsdatascience.com/a-mathematical-explanation-of-adaboost-4b0c20ce4382. This is a s short explanation of AdaBoost, an adaptive boosting tree-based approach to ML. An example is provided.

Photo by Jan Antonin Koler on Unsplash.com

Unsupervised Learning:

This is a form of learning that involves a search for patterns within a data set, but without the advantage of having labelled outcome variables. Many unsupervised approaches are used to group similar observations into smaller clusters.

Visualization is a form of unsupervised learning as well; see the articles about Topological Data Analysis (TDA) below as examples. TDA is a way to understand the data we analyze in a more complete way.

Finally, efforts to create lower dimensional representations of data are also examples of unsupervised learning approaches. Great examples include principal component analyses, factor analyses, and autoencoders, which vary according to how the user wants to reduce the information in, say, hundreds or thousands of variables, into many fewer (perhaps just a few) variables, for easier use in ML.

79. J. T. Raj, A Beginners Guide to Dimensionality Reduction for Machine Learning (2019), on https://link.medium.com/wWOFkXNoe3 This article describes how to reduce a dataset with hundreds or thousands of features (i.e., variables, or functions of variables) into a much smaller number, without incurring much information loss. This allows users to capitalize on huge data sets and still get models to converge.

80. A. Ye, Linear Discriminant Analysis Explained in Under Four Minutes: the Concept, the Math, the Proof, & Applications (2020), on https://medium.com/analytics-vidhya/linear-discriminant-analysis-explained-in-under-4-minutes-e558e962c877. LDA is a technique that can be used for supervised or unsupervised analytics. This article focuses more on unsupervised applications.

81. P Nistrup, Linear Discriminant Analysis (LDA) 101, Using R (2019), on https://towardsdatascience.com/linear-discriminant-analysis-lda-101-using-r-6a97217a55a6. This is a nice application showing how to conduct LDA using R statistical software.

82. M. J. Garbade, Understanding K-means Clustering in Machine Learning (2018), on https://link.medium.com/YAkPCunvp1. K-means Clustering is a method whereby the user decides how many subsets (clusters) he or she wants to create from a source dataset (this is the number k) and the software figures out how to build those clusters so that the observations within a cluster are similar, but different from the observations in other clusters.

83. C. Maklin, K-means Clustering Python Example (2018), on https://link.medium.com/9Ro8ahOgn2 — The author provides a nice introduction to K-means Clustering, with good visualizations and Python code.

84. A. Singh, Build Better and Accurate Clusters with Gaussian Mixture Models (2019), on https://link.medium.com/Z0Ls9KUqW3. Cluster analyses help make sense out of unlabeled data. This article describes how to use GMMs to better understand such data by clustering similar observations.

85. T. Yiu, Understanding PCA (2019), on https://link.medium.com/ZHwhuCajK2. Principal components analysis is a technique to combine information from many variables into many fewer, which, according to the author, “helps us uncover the underlying drivers hidden in our data.” This article is a simple, high-level description of how that happens, with some nice visuals.

86. A, Dubey, The Mathematics Behind Principal Component Analysis (2018), on https://link.medium.com/xz2bMKOhK2. This article describes the math used in PCA to transform many variables into many fewer, while keeping as much information as possible in the process.

87. T. Santos, The Differences Between Factor Analysis and Principal Component Analysis (2019), on https://link.medium.com/LHtYPlmjK2. PCA and FA are both data reduction techniques, but PCA is focused on retaining as much variation as possible in the smaller set of variables it creates, while FA is focused on combining information from a large set of variables into a smaller set of latent (hidden) factors that can account for the correlations between the observed larger set of variables.

88. T. Ciha, PCA & Autoencoders: Algorithms Everyone Can Understand (2018), on https://link.medium.com/eYPIiKdiK2. PCA is a linear-algebra-based approach to combining variables into many fewer variables. Autoencoders are a neural-network-based approaches to combine information from many variables into many fewer variables. The attributes of each are described in this article.

89. J. Brownlee, Principal Component Analysis for Visualization (2021), on https://machinelearningmastery.com/principal-component-analysis-for-visualization/?utm_source=drip&utm_medium=email&utm_campaign=Principal+component+analysis+for+visualization&utm_content=Principal+component+analysis+for+visualizationMost people use PCA for dimension reduction, but it is quite useful for visualizing data as well, as described in this article.

90. Z. Singer, Topological Data Analysis — Unpacking the Buzzword (2019), on https://link.medium.com/WKBRmTMjK2. TDA is a geometric approach to understanding how variables in a data set relate to each other. Applications of TDA can often uncover hidden patterns that lead to key understandings of how outcomes of interest arise. Visualizations provided by TDA illustrate the “shape” of data that can help us generate better predictive or explanatory models. This article illustrates how TDA works at a high level.

91. V. Deshmukh, Topological Data Analysis — A Very Short Introduction (2019), on https://link.medium.com/yqlneOEjK2. More information about how TDA works is provided in this article.

Causal Inference:

There are many descriptions of causal inference on Medium. Several are by economists and other social scientists, some are by neuroscientists, and many other disciplines are represented as well. A complete coverage of causal inference is impossible. Below you will find a few that struck my fancy, starting with an overview I wrote a few months ago. Search for “causality” on Medium to find many more. Three great textbooks on causal inference in economics and other social sciences have been written by Maziarz (2020), Pearl and Mackenzie (2019), and Morgan and Winship (2015).

92. R. Ozminkowski, What Causes What and how Would we Know? (2021), on https://towardsdatascience.com/what-causes-what-and-how-would-we-know-b736a3d0eefb. This is a review of five major causal inference approaches identified by leaders in the field, and how they can help us think about the major socio-political-economic issues of the day.

93. A. Kelleher, A Technical Primer on Causality (2016), on https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41. This primer on causality goes deep into underlying concepts, intuitively and with some heavy-duty math. It should satisfy people with many different backgrounds and training who are interested in causal inference.

94. P. Prathvikuman, Causal Inference via Causal Impact (2020), on https://link.medium.com/rSFRdqmW66. This is a short article about Google’s Causal Impact software — basically about how to find counterfactuals and use those to make inferences about causality in studies where randomization is not feasible. Many clients interpret program impact or other inferential estimates we generate as if they are causal, and regression coefficients do not always meet the standards of causal inference. Design issues, and particularly how to handle selection bias, are key considerations to discuss.

95. P. Rajendran, Causal Inference Using Difference-in-Differences, Causal Impact, and Synthetic Control (2019), on https://towardsdatascience.com/causal-inference-using-difference-in-differences-causal-impact-and-synthetic-control-f8639c408268. The author provides a nice introduction to three ways to find the impact of an intervention, describing advantages and disadvantages of each.

96. G. M. Duncan, Causal Random Forests on https://econ.washington.edu/sites/econ/files/old-site-uploads/2014/08/Causal-Random-Forests_Duncan.pdf. This is a short PowerPoint presentation describing random forests as used for causal inference.

97. A. Tanguy, Bayesian Hierarchical Modeling (or more reasons why autoML cannot replace Data Scientists yet (2020), on https://towardsdatascience.com/bayesian-hierarchical-modeling-or-more-reasons-why-automl-cannot-replace-data-scientists-yet-d01e7d571d3d. The author says, “Bayesian networks allow one to model causal relationships between variables, compensating the lack of information provided by data.”

98. The Ctaeh, What are Bayesian Belief Networks? (Part 1) (2016), on https://www.probabilisticworld.com/bayesian-belief-networks-part-1/ Bayesian Belief Networks are often described as efforts to infer causality between variables by using conditional probabilities and Bayes Theorem. The intuition about these networks is explained here.

99. The Ctaeh, What are Bayesian Belief Networks? (Part 2) (2016, on https://www.probabilisticworld.com/bayesian-belief-networks-part-2/ In this part, The Cthaeh goes a little deeper by showing the math behind BBNs.

100. C. Y. Wijaya, A Quickstart for Causal Analysis Decision-Making with DoWhy: Predict the Causal Effect from the Intervention (2021), on https://medium.com/geekculture/a-quickstart-for-causal-analysis-decision-making-with-dowhy-2ce2d4d1efa9. The author describes the DoWhy package developed by Microsoft to generate causal models, based on graph theory and counterfactual analyses. Python code is provided. A nice companion piece to this article is the paper by the DoWhy developers, Amit Sharma and Emre Kiciman (2020).

101. H. Naushan, Causal Inference Using Natural Language Processing (2021), on https://towardsdatascience.com/causal-inference-using-natural-language-processing-da0e222b84b. This article provides a nice link between machine learning, causal inference, and natural language processing. At its heart, NLP can be viewed as ML with text data, and using it to draw causal inferences from text is a novel and useful idea.

Limitations:

While this article covers a lot of ground, it is not possible to explore the huge variety of important machine learning topics in one post. Many other types of machine learning are worth the time and effort required to understand them. Examples include reinforcement learning (i.e., goal-focused learning) and transfer learning (training, validating, and testing a model for one purpose then leveraging it for a completely different purpose). There are many other examples too. Those and other ML topics are defined and described by Brownlee (2019), a leader in practical approaches to machine learning. His many books are well worth the cost.

My focus here has just been on short articles on Medium, in Toward Data Science, and elsewhere. These are quite useful because of their pragmatic approaches. The authors provide clear and brief explanations, often mixing theory, math, and modeling tips with lists of or links to Python, R, or SAS code and many more detailed references. Many provide greater detail in links to GitHub or other places where code and notes can be found.

This still leaves at least five limitations though. First, it is impossible to find and describe every good article on even the short list of topics included above. The reader is encouraged to supplement this list with his or her own search to fully investigate interesting topics.

Second, the brief descriptions included above are not meant to include a critical review. No article is perfect, but in the interest of brevity I chose not to dwell on any errors, leaving that up to the reader.

Third, I chose not to provide lists of peer-reviewed methodological publications and textbooks that describe ML theory and applications in much greater detail. That decision was made only to save space, and I highly recommend reading such material. The combination of theory, deep explanation, and pragmatic approaches that readers will find if they supplement the above articles with peer-reviewed articles and textbook material will generate gains in understanding that are well worth the time invested.

Fourth, the links above describe how to conduct training, testing, and validation of ML and deep learning models. I purposely decided not to focus on model deployment (i.e., what it takes to use these models in local or widescale industrial applications). This is a field unto itself. More information about deployment can be found in an excellent O’Reilly publication by Ted Malaska and Shivnath Babu (2019).

Fifth and finally, the articles described above focus on the mechanics of ML. In recent years a parallel and very important focus has arisen, describing how to use ML models responsibly. More about this topic can be found in one of my recent posts (Ozminkowski, 2021), and in another great O’Reilly publication by Hall et al. (2021). The Data Science and Public Policy Team at Carnegie Mellon University (2018) has a fine software program called Aequitas, which also promotes responsible machine learning. Obermeyer et al. (2020) provide another great description of how to avoid bias in machine learning as well.

Conclusions:

This post provides links to many ML approaches that can help data scientists generate useful insights for their clients. It was motivated by the No Free Lunch Theorem, which suggests the need to hunt for the best learner for any analysis. This means testing many different ML approaches to find the one which generates the best (more accurate, sensitive, specific) predictions.

Recently some work has been conducted on what is called the Super Learner. This is an approach which combines information from many different ML models, to generate an even better overall prediction. The Super Learner has been shown to work better with large numbers of input ML approaches. One could, for example use many of the approaches described in the links above as input into the Super Learner. A nice example of how Super Learners work in healthcare analytics (i.e., to predict mortality among patients treated in hospital intensive care units) is provided by Romain Pirracchio (2016). The paper which introduced the Super Learner was published by Van der Laan, et al. (2007) and much more information about it can be found there. Think of the Super Learner as ML applied to many ML models.

For those who want to hunt for their best model in other ways, Neo (2019) provides links to many other websites about ML and data science that offer value as well. Some of these websites are free and some are not.

The field of machine learning and deep learning is growing rapidly and the task of keeping up with it can be daunting. It is impossible to cover every new development in one post. My focus here has been on many of the timeless approaches that have provided value for decades and will continue to do so. I hope you enjoy reading!

References:

J. Brownlee. 14 Types of Learning in Machine Learning (2019), on https://machinelearningmastery.com/types-of-learning-in-machine-learning/

Data Science and Public Policy Team at Carnegie Mellon University, Aequitas: An Open Source Bias Audit Toolkit for Machine Learning (2018), on http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (2016), Cambridge, MA: The MIT Press

P. Hall, N. Gill, and B. Cox, Responsible Machine Learning (2021), Sebastopol, CA: O’Reilly Media Inc.

T. Malaska and S. Babu, Rebuilding Reliable Data Pipelines Through Modern Tools (2019), Sebastopol, CA: O’Reilly Media Inc.

M. Maziarz, The Philosophy of Causality in Economics (2020), Routledge — Taylor & Francis Group, New York, NY

S. L. Morgan and C. Winship C, Counterfactuals and Causal Inference: 2nd Edition (2015), Cambridge University Press, Cambridge, UK

B. Neo, Top 20 Websites for Machine Learning and Data Science (2019), on https://link.medium.com/zwUcGQCvz3

Z. Obermeyer, R. Nissan, M. Stern, et al., Algorithmic Bias Playbook (2020), Chicago Booth: The Center for Applied Artificial Intelligence. Also on https://www.ftc.gov/system/files/documents/public_events/1582978/algorithmic-bias-playbook.pdf

R. Ozminkowski, Garbage in Garbage out: Saving the World is just one good Reason to Address this Common Problem (2021), on https://towardsdatascience.com/garbage-in-garbage-out-721b5b299bc1

R. Ozminkowski, What Causes What, and how Would we Know? (2010), on https://towardsdatascience.com/what-causes-what-and-how-would-we-know-b736a3d0eefb

J. Pearl and D. MacKenzie, The Book of Why: The New Science of Cause and Effect (2018), Basic Books, New York, NY

R. Pirracchio, Mortality Prediction in the ICU Based on MIMIC-II Results from the Super ICU Learner Algorithm (SICULA) Project (2016), Chapter 20 in Secondary Analysis of Electronic Health Records, Cham, Switzerland: MIT Critical Data Team

A. Sharma and E. Kiciman, DoWhy: An End-to-end Library for Causal Inference (2020), on arXiv:2011.04216v1 [stat.ME] 9 Nov 2020

M. J. Van der Laan, E.C. Polley, and A.E. Hubbard AE, Super Learner (2007), Berkeley, CA, University of California at Berkeley Division of Biostatistics Working Paper Series, Pager 222

--

--

Internationally recognized executive leader and chief scientist whose research has been viewed by people in over 100 countries