‘Meta’ machine learning packages in R

Dror Berel
Towards Data Science
13 min readJul 2, 2018

--

This post was published around mid 2018.

Also check a followup post from end of 2019 about 2nd generation meta packages.

Do you remember learning about linear regression in your Statistics class? Congratulations! You are now a fortunate member of a diverse community of Statisticians, Mathematicians, Data Scientists, Computer Scientists, Engineers, and many (many) others who have used a ‘machine learning’ approach! What the heck, you can even tell your friends that you are doing some fancy ‘Artificial Intelligence’.

Linear regression is only one of many supervised models. There are hundreds of other models including: non-linear models, polynomial models, Tree based models (e.g. CART, XGBoost), SVM, Neural networks etc. These supervised models can handle either a continuous outcome (regression), or a binary and multi-class outcome (classification).

What if I told you that instead of looking at multiple packages, each focused only on a single method/model, there is a single package, termed here a ‘meta-package’, that will enable you to get easier access to all of these models as dependencies. This meta-package can also facilitate the deployment of multiple models in parallel, and then aggregate the results. Each model can be still carefully controlled for its various parameters, however the ‘meta-package’ facilitates meta-analysis methods at a higher layer of analysis, wrapping and tying together results from the individual models.

If fact, focusing at the meta level, the individual models can be applied by simply listing the names of the specific models that are considered, with a set of their default parameters, to be tuned if needed. This allows users who may not be fully familiar with the fine details of the model, to use it as some sort of a ‘black box’.

In this post I will use the analogy of car’s mechanics for the entire machine learning framework, and the car’s engine to any of the multivariate models. Early cars with basic performance used to have simple engines. As technology advanced, today’s cars are much faster and more reliable. Simple transmission systems were replaced with an automate gear. Modern cars are a complex system of advanced components, each is carefully designed and well integrated into a broader system, well synced, which eventually allows the car’s smooth operation.

Different programing languages have various implementations for this approach. This post focuses on my very own journey of exploring this topic, specifically implemented in R.

pixabay

The old scripts:

Back in the day, the first statistical model I was taught was linear regression. This is perhaps the most basic method to introduce predictive models. There are, however many other models with the same scope, including both supervised (classification, regression) and unsupervised (clustering), which were (and still are) springing up like mushrooms after the rain. In addition, since the field itself is very diverse, it was hard to review the hundreds of scientific papers; each has its own unique framework and mathematical notation. Then, around the year 2002, I was given a great reference to an amazing book, ‘The Elements of Statistical Learning’, written by the founders of this emerging field themselves. The 764 pages of this book cover everything one needs to know, starting with linear regression models to Neural Networks. It also covers topics of over-fitting, cross-validation, and ensemble. Methods are described with detailed consistent mathematical notations and supported by illustrative colorful visualizations.

Alas, that book might not be a good fit for everyone, specifically to the ones who lack the mathematical background, myself included. After reading it, I wished there would be a similar book that is more applied. Ideally, it should cover the exact same topics, but perhaps with simplified mathematical concepts, and more applied examples, with some actual code demonstration for real datasets. My wildest dream was that the code would be in R.

My dream has come true around the year of 2013, when the students of the original authors of the above book, and some of the original authors, wrote another book, ‘Introduction to Statistical Learning’, with this exact same purpose. They were even able to compress it into (only) 441 pages.

pixabay

First (random) steps in the forest:

Even during those early days of R, there was more than one package to do almost anything. However, I felt fearless. Feeling safely secured and guided by the second book, I knew exactly which recommended R packages to use for each model. I was excited to learn specific packages for GLM, CART, and others.

The more I learned about applying the individual models, the more I became curious (and greedy) to learn about additional new models, though they were usually less trivial to understand and implement with my datasets. My motivation to learn and implement more models was partly to achieve higher performance for my models and datasets (in terms of accuracy of the model’s prediction). However, this competitive race was not the only reason. Guided by the ‘no free lunch’ theorem, by Wolpert and Macready, arguing that there is no single model that will always have the best performance for any type of dataset, I had a legitimate scientific excuse to keep trying as many models as possible.

pixabay

too early excitement ?

For the basic models, with abundant documentation and examples, learning how to apply them was almost straightforward. However, there were still many other complex models that were less trivial to apply. Their complexity to me was often due to additional tunable parameters that were either lacking specific clear guidelines on how to use, or perhaps the techniques to do so were beyond my understanding. For example, choosing alpha and beta parameters for a penalized regression, LASSO, via internal cross-validation technique, or other parameter tuning approaches.

In parallel to my endless search for new models, another challenge I experienced was to implement methods that deal with the issue of overfitting, such as cross-validation and bootstrapping. Even though there were packages that specialized in such methods, it was not trivial to integrate them, and wrap them around the various models/packages. The same was true for advanced search methods for tuning parameters, and ensemble (specifically stacking).

Even though I thought I had a guided map how to navigate through the various models and packages, sooner than later came along some disappointment. It was too hard to integrate all of these methods all together. Not only can each of these models be complex to understand themselves, wrapping multiple models together, nested within each other, and tied into other ‘heavy lifting’ approaches such as resampling, benchmarking, stacking, and others, made it even more complex and overwhelming.

Frustrated and exhausted, I was still hoping that there will be some efficient way to integrate it all together, since in the end, all of these methods share the same ‘statistical learning’ workflow: train/fit a model to a training set, predict outcomes with a new dataset (test), and measure some performance metric. Indeed, each of these packages had its own ‘dialect’, reflecting the diverse background of the open source R developer community. However, under the above framework of predictive models, there should be a way to unify and reformat the inputs and outputs of each model/package so it all can be under the same hood, and then aggregated together.

Photo by Neil Thomas on Unsplash

The (first?) missing link: the caret package

In spite of these challenges, I still felt determined! I was not going to let some mathematical notations or less documented packages to scare me away, and I didn’t lose my hope.

Around the same time of the former book, another book was published, ‘Applied Predictive Modeling’, This book was mostly using only a single package, caret. This package was attempting to fill the above gap, providing the meta-approach I was so desperately looking for. It unifies the access (input) and returned output from many (currently 237) supervised models (both classification and regression). There was no need to re-write these models from scratch. Instead, the caret package calls the individual packages as dependencies.

Once multiple models were unified and integrated, the developer(s) moved on to integrate the other advanced methods at the higher meta-level: 1. benchmarking (including tuning as a private case); 2. Resampling (cross-validation); and 3. Ensemble (bagging, boosting and stacking).

By that time I was even more excited! Even though I may have not yet fully understand and appreciate the comprehensive capabilities of the caret package (perhaps maybe not even today), I felt ready more than ever to start using this very efficient, well oiled, ‘machine’.

There must be more than one meta-package to do it! The mlr and SuperLearner packages:

With similar goal in mind, two additional meta packages were developed (I do not know the exact history and timing. This post represent only my own, self-selectively biased, way of learning about these packages). Each of these meta-packages has different focus. The mlr package, in addition to supervised models, also implements other groups of models (termed ‘learners’) for unsupervised (cluster) analysis, and time-to-event (survival) models. It also list a large collection of performance measures, and very detailed documentation for common analysis workflows. The SuperLearner package has an emphasize on the ensemble (stacking) part and scalability via multi core optimization.

pixabay

Feature engineering: Specific extensions for composable preprocessing steps/operations/pipelines:

A crucial step for the success of a predictive model performance is ‘feature engineering’. This is done at an early pre-processing step, before a multivariate model is applied to the data. It may consist of either transformation of the dataset, imputation of missing values, and filtering (early screening) of features or samples/observations. It may also include any combinations of these operators, at any order. Once compiled together, it is then integrated back to learner/model.

Recommended implementation of this step is to apply whatever transformation that is done on the training dataset, to the test/validation dataset as well. However, nesting these steps within the cross-validation is not trivial at all. It is not as simple as writing a long function that will include all of the pre-processing steps. Instead, it require a more generalized function framework that will allow composing together different steps/operations on the data, that can be integrated in any sequential way, and will also allow proper transfer of specific parameters of the transformation, to the test/validation part. Another (obvious) use of this composable function would be to directly apply to the actual raw dataset itself, outside of the nested resampling, to simply examine the data, in a trial and error fashion.

The above meta-packages were recently extended with new complementary packages to facilitate such composable pre-processing operations. The recipes package offers multiple common ‘steps’ that are ‘piped’ together, and can be ‘baked’ into raw data outside of the nested resampling. Similarly, the mlrCPO extends the mlr package and sl3 extend SuperLearner. A new book ‘Feature Engineering and Selection: a Practical Approach for Predictive Models’ was recently written by the authors of the caret/recpies packages to discuss such transformation.

Which of these three meta-package to choose? De gustibus non est disputandum:

There is no accounting for taste(s). It will make no justice to list the advantages or disadvantages for each of these meta-packages, mostly because each is still constantly updated and extended with up-to-date new tools. Nonetheless, this fruitful competition also has players from other programming languages (that will NOT be named here), all aiming towards the same ultimate goal.

What really matters is the user’s ability to learn each of these meta-packages, and the success of implementing these solutions with their own datasets. This depends on the availability of the specific solution one need, and the level of documentation and support that is required for implementation. Each of these meta-packages has different style of documentation/support (e.g. github issues, stackoverflow, etc), either influenced by the background of its developers (industry, academia, other), or just personal style. Scalability may also pose a critical bottleneck one should care about. Each of these meta packages deal with it at different ways. mlr offers its own cloud environment, OpenML, to deploy large pipelines. SuperLearner has a parallelizing framework extension called delayed.

If you are very motivated (as I am), I would challenge you to try each of these three packages. In theory, you should get the same result.

pixabay

Meta-meta-package (this is not a typo).

Still not sure which meta-package to choose? Why compromise with only a single meta package, when you can try all? Consider trying the next higher level of ‘meta’ analysis: meta (multiple) ‘meta-packages’. Wait, what?!? I know, this idea might be scary to capture at a first read, but please bear with me. Each of these three meta packages offer advanced meta solutions for individual models. Why not try doing the exact same thing around these three meta pacakges. We already know what needs to be done. It is the exact same thing each of them is doing across individual packages, but this time at the next higher level: unify, aggregate, compare, stack.

In fact, there are already cross-meta-package functions to render objects with the same purpose from one package to another. For example, the mlr package has a function to convert caret’s preprocess object into mlr. While this is currently done for only one component, and in one direction only, my guess is that more cross-packages objects conversion functions will be available soon, as well as for the composable pre-processing operators across all possible combinations and directions of ‘recipes’, ’mlrCPO’, and ‘sl3’. Perhaps this meta-meta approach can be extended even beyond R, but also across different programing languages.

pixabay

‘The two cultures’ (oy vey):

So far I was carefully not tempted to get into the endless discussion of the ‘two cultures’, though it may not be so hard to guess the origin of my breed. My understanding of this controversial topic, with my car’s analogy, is that is about not only the car/engine one has built, but also about the driving style you use it for. Are you a carful driver, who carefully evaluate the environment around you in order to safely reach your destination? or are you willing to speed over, miss-obey some traffic laws, just so you could be the first to arrive? Perhaps you are a little bit of both, depends on the situation. My take on this is that getting to a point where at least there is a unified way to access and evaluate the different cars/engines, allow us to answer these questions in a better systematic way.

pixabay

How to design a good complex system? A cohesive design:

The ideas of integrating multiple models, comparing them, and nest them within resampling are definitely not new. However, putting it all together is what each of the above meta-packages were attempting to achieve. If you ever tried to write a cross-validation technique by yourself, you might have very soon face some programming design gap issues (edge cases, or other bugs) and other computational bottlenecks such as scalability.

Unifying the input and output names/formats into a single common one is only a technical semantic rendering task. However, once unified, designing a cohesive system, that will be well tied together, knitting all of the various components together to have one powerful well-oiled machine, and at the same time also allow independent development of the individual components, is not trivial at all.

pixabay

(Today) anyone can drive a car, but do you know how to fix it when it breaks?!?

Occasionally I go back to the ‘old scripts’ described above, and look for some fine details on a specific model. I feel lucky for having access to these resources, which were guided me though the slippery path of statistical learning. Though I am still not sure if I should also feel privileged to follow this path while it emerge, rather than starting it somewhere in the future, with immediate access to these modern tools. After all, these ‘old scripts’, are actually not that old. They were written relatively very recently. Suppose I would have knew nothing about the field, and would have read this review, and overnight become an expert of these meta-packages, would I still have the skills I have slowly gained by taking the long, curvy trail?

We live in an exciting era when such powerful tools are available at the tip of our fingers with open source distributions making them freely available to all. These types of meta-packages allow us access to a comprehensive tool that allows an integration of multiple combinations of models, performance measures, pre-processing steps, and tunable parameters values, etc. All of these combinations can be nicely wrapped into a long list of parameter groups, and evaluated in a single run of benchmarking function.

Using these meta-packages, it is no longer an overwhelming task to accomplish. Complex components can be easily dissected, disassembled back into simpler sub-components, to be carefully tested for their behavior and influence on the overall performance. It also enable a beginner user to skip the methodological understanding of these sub-components, and address each as a set of black boxes that are nicely connected to each other.

Back to the car driving analogy. A driver is not required to understand the car’s complex mechanics in order to drive it. However, if the driver is interested to increase the car’s performances, one should better understand what is going on under the hood, or have access to good mechanics.

In this post I have described my own journey to become my car’s ‘mechanic’. In my case it started with a couple of books. Hopefully after reading it, you might find a better starting point. After become familiar with these tools, sooner than later you might want to modify something in your methods by yourself, either tailoring it to your very own specific need, or perhaps suggesting an improvement to an existing method. That is the moment my friend, when you can no longer avoid the dirty work. It is now the time to take a long breath, learn how to extend the methods, and dive in. I can assure you that you will learn many useful things on the way just as I did. Have fun!

Also check a followup post from end of 2019 about 2nd generation meta packages.

Check my blog: https://medium.com/@drorberel

Check more related topics here: https://drorberel.github.io/

Consultant: currently accepting new projects!

--

--