Uncertainty Sampling Cheatsheet

Robert (Munro) Monarch
Towards Data Science
4 min readJul 29, 2019

--

When a Supervised Machine Learning model makes a prediction, it often gives a confidence in that prediction. If the model is uncertain (low confidence), then human feedback can help. Getting human feedback when a model is uncertain is a type of Active Learning known as Uncertainty Sampling.

The four types of Uncertainty Sampling covered in the cheatsheet are:

  1. Least Confidence: difference between the most confident prediction and 100% confidence
  2. Margin of Confidence: difference between the top two most confident predictions
  3. Ratio of Confidence: ratio between the top two most confident predictions
  4. Entropy: difference between all predictions, as defined by information theory

This article shares a cheatsheet for these four common ways to calculate uncertainty, with examples, equations and python code. Use it as a reference the next time you need to decide how to calculate your model’s confidence!

Data Scientists often use Uncertainty Sampling to sample items for human review. For example, imagine that you are responsible for a Machine Learning model to help a self-driving car understand traffic. You might have millions of unlabeled images taken from cameras on the front of cars, but you only have the time or budget to label 1,000. If you sample randomly, you might get images that are mostly from high-way driving, where the self-driving car is already confident and doesn’t need additional training. So, you will used Uncertainty Sampling to find the 1,000 most “uncertain” images, where your model is the most confused.

Confusing traffic lights

When you update your model with the newly labeled examples, it should get smarter, faster.

An uncertain robot

This cheatsheet is excerpted from my book, Human-in-the-Loop Machine Learning: https://www.manning.com/books/human-in-the-loop-machine-learning

See the book for more details on each method and for more sophisticated problems than labeling, like predicting sequences of text and semantic segmentation for images. The principles of uncertainty are the same, but the calculations of uncertainty will be different.

The book also covers other Active Learning strategies, like Diversity Sampling, and the best ways to interpret your models probability distributions (hint: you probably can’t trust the confidence of your model).

Download Code and Cheatsheet:

You can download the code here:

You can download a PDF version of the cheatsheet here:

I’ll also be releasing the code in PyTorch at a later date, along with a larger variety of algorithms.

EDIT: the PyTorch version of this cheat sheet is now available: http://robertmunro.com/Uncertainty_Sampling_Cheatsheet_PyTorch.pdf

Further Reading

Uncertainty Sampling has been around for a long time and there is a lot of good literature. A good early paper on Least Confidence is:

Aron Culotta and Andrew McCallum. 2005. Reducing Labeling Effort for Structured Prediction Tasks. AAAI. https://people.cs.umass.edu/~mccallum/papers/multichoice-aaai05.pdf

A good early paper on Margin of Confidence is:

Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active Hidden Markov Models for Information Extraction. IDA . https://link.springer.com/content/pdf/10.1007/3-540-44816-0_31.pdf

A good early paper using Information Theory for Entropy-based sampling is:

Ido Dagan and Sean P. Engelson. 1995. Committee-based Sampling for Training Probabilistic Classifiers. ICML’95. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.6148

A foundational paper for Uncertainty Sampling more generally is:

David D. Lewis and William A. Gale. 1994. A Sequential Algorithm for Training Text Classifiers. SIGIR’94. https://arxiv.org/pdf/cmp-lg/9407020.pdf

For more academic papers on Uncertainty Sampling, look for well-cited recent work that cite the above papers.

These examples are obviously all covered in my own text, too:

Robert Munro. 2020 (expected). Human-in-the-Loop Machine Learning. Manning Publications. https://www.manning.com/books/human-in-the-loop-machine-learning

The chapters of my text are being published as they are written — the Uncertainty Sampling chapter is out now and the Diversity Sampling chapter will be next. I’ll share excerpts as I go, like I did with the Knowledge Quadrant for Machine Learning recently:

Implementing Uncertainty Sampling

All the equations in the cheatsheet return uncertainty scores in a [0,1] range, where 1 is the most uncertain. Many academic papers don’t normalize the scores, so you might see different equations in the papers above. However, I recommend that you implement the normalized [0–1] equations for any code that you will use for non-research purposes: a [0–1] range makes it easier for downstream processing, for unit tests, and to spot-check the results.

The code shared in this article and in the GitHub repository is open source, so please use it to get started!

Robert Munro

July, 2019

--

--

Private/Global Machine Learning at @Apple | Runs @BayAreaNLP | Wrote bit.ly/human-in-the-l… | Prev @StanfordNLP @AWSCloud | Opinions my own | 🚲🌍 | they/he