3 Probabilistic Frameworks You should know | The Bayesian Toolkit

Build better Data Science workflows with probabilistic programming languages and counter the shortcomings of classical ML.

Richard Michael
Towards Data Science

--

The tools to build, train and tune your probabilistic models. Photo by Patryk Grądys on Unsplash.

We should always aim to create better Data Science workflows.
But in order to achieve that we should find out what is lacking.

Classical ML workflows are missing something

Classical Machine Learning is pipelines work great. The usual workflow looks like this:

  1. Have a use-case or research question with a potential hypothesis,
  2. build and curate a dataset that relates to the use-case or research question,
  3. build a model,
  4. train and validate the model,
  5. maybe even cross-validate, while grid-searching hyper-parameters,
  6. test the fitted model,
  7. deploy the model for the use-case,
  8. answer the research question or hypothesis you posed.

As you might have noticed, one severe shortcoming is to account for certainties of the model and confidence over the output.

Certain about being Uncertain

After going through this workflow and given that the model results looks sensible, we take the output for granted. So what is missing?
First, we have not accounted for missing or shifted data that comes up in our workflow.
Some of you might interject and say that they have some augmentation routine for their data (e.g. image preprocessing). That’s great — but did you formalize it?
Secondly, what about building a prototype before having seen the data — something like a modeling sanity check? Simulate some data and build a prototype before you invest resources in gathering data and fitting insufficient models.
This was already pointed out by Andrew Gelman in his Keynote at the NY PyData Keynote 2017.
Lastly, get better intuition and parameter insights! For deep-learning models you need to rely on a platitude of tools like SHAP and plotting libraries to explain what your model has learned.
For probabilistic approaches, you can get insights on parameters quickly.
So what tools do we want to use in a production environment?

I. STAN — The Statisticians Choice

STAN is a well-established framework and tool for research. Strictly speaking, this framework has its own probabilistic language and the Stan-code looks more like a statistical formulation of the model you are fitting.
Once you have built and done inference with your model you save everything to file, which brings the great advantage that everything is reproducible.
STAN is well supported in R through RStan, Python with PyStan, and other interfaces.
In the background, the framework compiles the model into efficient C++ code.
In the end, the computation is done through MCMC Inference (e.g. NUTS sampler) which is easily accessible and even Variational Inference is supported.
If you want to get started with this Bayesian approach we recommend the case-studies.

II. Pyro — The Programming Approach

My personal favorite tool for deep probabilistic models is Pyro. This language was developed and is maintained by the Uber Engineering division. The framework is backed by PyTorch. This means that the modeling that you are doing integrates seamlessly with the PyTorch work that you might already have done.
Building your models and training routines, writes and feels like any other Python code with some special rules and formulations that come with the probabilistic approach.

As an overview we have already compared STAN and Pyro Modeling on a small problem-set in a previous post:

Pyro excels when you want to find randomly distributed parameters, sample data and perform efficient inference.
As this language is under constant development, not everything you are working on might be documented. There are a lot of use-cases and already existing model-implementations and examples. Also, the documentation gets better by the day.
The examples and tutorials are a good place to start, especially when you are new to the field of probabilistic programming and statistical modeling.

III. TensorFlow Probability — Google’s Favorite

When you talk Machine Learning, especially deep learning, many people think TensorFlow. Since TensorFlow is backed by Google developers you can be certain, that it is well maintained and has excellent documentation.
When you have TensorFlow or better yet TF2 in your workflows already, you are all set to use TF Probability.
Josh Dillon made an excellent case why probabilistic modeling is worth the learning curve and why you should consider TensorFlow Probability at the Tensorflow Dev Summit 2019:

TensorFlow Probability: Learning with confidence (TF Dev Summit ’19) by TensorFlow Channel

And here is a short Notebook to get you started on writing Tensorflow Probability Models:

Honorable Mentions

PyMC3 is an openly available python probabilistic modeling API. It has vast application in research, has great community support and you can find a number of talks on probabilistic modeling on YouTube to get you started.

If you are programming Julia, take a look at Gen. This is also openly available and in very early stages. So documentation is still lacking and things might break. Anyhow it appears to be an exciting framework. If you are happy to experiment, the publications and talks so far have been very promising.

References

[1] Paul-Christian Bürkner. brms: An R Package for Bayesian Multilevel Models Using Stan
[2] B. Carpenter, A. Gelman, et al. STAN: A Probabilistic Programming Language
[3] E. Bingham, J. Chen, et al. Pyro: Deep Universal Probabilistic Programming

--

--

I am a Data Scientist and M.Sc. student in Bioinformatics at the University of Copenhagen. You can find more content on my weekly blog http://laplaceml.com/blog