Hands-on Tutorials
How to Handle Uncertainty in Forecasts
A deep dive into conformal prediction.
Anytime we develop a forecast, there is uncertainty in our estimate. For instance, let’s consider a lemonade stand that is looking to forecast demand. If the forecast is very precise, it’s actionable and they can optimize their lemon-buying strategy. On the other hand, if the forecast has a large range of possible values, it’s useless.
In a 2017 paper, researchers at Carnegie Mellon University tackled this very problem. They proved that with minimal assumptions, Conformal Prediction guarantees correct coverage. And this proof holds for datasets with small sizes. Subjects where uncertainty is important, such as object detection for self-driving cars, can really benefit from this method. However, despite its proven effectiveness, conformal prediction is a fairly underdeveloped method and has not been adopted by much of the DS industry.
Below, we will outline how it works and provide some implementation notes...
Technical TLDR
Conformal prediction guarantees alpha-level coverage on forecasts with just the exchangeability assumption — IID data are not required. It does this by creating a distribution of conformal scores using an out-of-sample calibration set. Note that these conformal scores are essentially residuals based on conditional probability — see figure 2.
From the distribution of ordered conformal scores, we find what quantile corresponds to our confidence level (α). We then use that quantile to determine a prediction interval with guaranteed 1-α coverage, regardless of IID data or a correct model.
But, how does conformal prediction actually work?
Ok that was a lot. Let’s slow down and really understand what’s going on.
Background on Prediction Intervals
First, let’s discuss prediction intervals and their problems.
In one sentence, prediction intervals are confidence intervals for predicted values i.e. we expect that a large percent of our future predictions will fall within the range.
In figure 3, we can see some examples of prediction intervals for some fake data. The horizontal lines above and below each data point correspond to the 95% prediction interval range— 95% of our future predictions will fall within that range.
Unfortunately, prediction intervals often require several assumptions. Some examples include independent and identically distributed (IID) data as well as a correctly specified model. If you really care about accurate prediction interval coverage, those assumptions can be problematic.
So what does conformal prediction bring to the table?
Conformal Prediction is Assumption-Lean
Conformal prediction only requires one assumption called exchangeability. Exchangeability is the notion that any ordering of the data are equally likely to occur.
For those of you familiar with the IID assumption, it’s the “ID” portion — exchangeable data are identically distributed with respect to the predictors.
By only requiring one assumption, conformal prediction is arguably the most statistically robust way to develop prediction intervals.
How does the Method Work?
Let’s start with a binary classification example. From here, we’ll extend the concepts to multi-class classification as well as forecasting continuous variables.
As shown in figure 4, we’re looking to determine whether an image shows a camel or a cow. We will predict y=1 if we think the image shows a cow and y=0 if we think the image shows a camel.
Step 1 — Create a Conformal Score Distribution
In our first step, we develop a train/test split and fit a classification model using the training data. Let’s say we’re using logistic regression. After fitting the model, we develop probabilistic predictions of our dependent variable (cow vs. camel) using the testing set, which is called the calibration set.
At this stage we have a vector of predicted probabilities for each observation in our calibration set.
From there we calculate conformal scores using each predicted probability. The formula for a conformal score is defined in figure 5.
Conformal scores can be thought of as non-conformal scores. They take on a value between -1 and 1, and the closer they are to either extreme, the more they diverge from the observed label. Note that we often use an absolute value because we only care about the magnitude of the nonconformity, not the direction.
After getting conformal scores for all labels in our calibration set, we order the absolute value of the conformal scores from low to high.
Step 2 — Create a Prediction Interval
Armed with a distribution of conformal scores, we can connect the concept back to prediction intervals.
As noted before, prediction intervals require that the engineer specifies a false-positive rate (α), often 0.05 or 0.01. To find the upper and lower bounds of our prediction interval, we find the quantile in our distribution that corresponds to α.
In figure 6, the α-level quantile is shown by the vertical line labeled “Critical Threshold” in green. The x-axis corresponds to an ordered set of conformal scores. Let’s say there are 500 of them. The intersection between the blue and red, the “Critical Threshold,” is the 1-α percentile. So if α is 0.1, then the cutoff for statistical significance will be the conformal score at the 90th percentile e.g. the 450th conformal score.
Conformal scores in blue are not statistically significant. They’re within our prediction interval. Very large conformal scores (red) indicate high divergence from the true label. These conformal scores are statistically significant and thereby outside of our prediction interval.
Also note that we derive a quantile cutoff for each label.
Step 3 — Estimate the Probability of Each Label
From here, for each observation we determine the probability of being a cow, a camel, both, or neither. How is this done?
Well, for each observation, we calculate the conformal score for each label e.g. camel and cow. From there, we determine whether each conformal score is beyond its corresponding critical threshold. If it’s beyond the threshold, we deem it statistically significant and thereby false. If not, we deem it true.
There are four possible outcomes for our binary classification problem, as shown in figure 7. Red indicates that the forecast is statistically significant and blue indicates that it’s not. Here, statistical significance means that the forecast is false and a lack of statistical significance means the forecast is true.
Let’s take a look at each cell in turn…
- Top-left: forecasts where both labels are not statistically significant. Here, the model predicted that both classes are the true class.
- Bottom-right: forecasts where both labels are statistically significant. Here, the model predicted that neither class is the true class.
- Top-right: forecasts where all camel labels (0) are not statistically significant. Here, the model predicted that camel is the true class.
- Bottom-left: forecasts where all cow labels (1) are not statistically significant. Here, the model predicted that cow is the true class.
This step can be a bit counterintuitive. How can a model predict more than one class or even no classes at all?
Well let’s take an intuitive approach. Simple binary classification models find which class has the highest probability and determine that to be the forecast. But with conformal prediction, we can allow the model to say that neither or both are the true label. In effect, the model can say “I don’t know.”
Since we don’t force the model to make a prediction, we get much more robust forecasts in our prediction sets.
And, our 1-α prediction sets are those with just one forecast, the top-right and bottom-left cells.
Extension to More Complex Labels
Now that we understand how conformal prediction works for binary classification, this concept can be (sort of) easily extended to multi-level classification. As in step 3 where we found the probability of being labeled a 1, 0, both, or neither, we just do the same operations but for three levels instead of two.
Conceptually this is a simple next step, but computationally the complexity increases exponentially, so doing conformal prediction for labels with many levels is not always a good idea.
For continuous variables, there are several approaches but the easiest is simply forecasting a quantile instead of the mean. As noted before, we can think of a conformal score as a residual: the true y minus the predicted y. However, unlike with categorical variables, we don’t have a set number of labels to try.
So, for continuous labels we try forecasting a finite number of different quantiles and see where they become statistically significant. For values that are on the edge of statistical significance, we deem them to be the α-level prediction interval. Note that quantile loss can be added to nearly any modeling loss function, so that’s why it’s cited as being the “easiest.”
If you’re curious to learn more, here’s a really good resource for using conformal prediction on continuous variables.
And there you have it! Conformal prediction in all its glory.
Summary and Implementation Notes
To hammer home the concepts, let’s summarize.
Conformal prediction is the most assumption-lean approach to developing prediction intervals. It is thereby the most universally defendable method. It also works on datasets with small sample sizes and can produce prediction intervals/sets that are more precise.
The method leverages a distribution of conformal scores to develop permutations of the probability that any combination of labels is true. If the probability of a prediction is low enough i.e. the conformal score is large enough, we deem it to be outside of our prediction interval. These permutations are then used to develop prediction intervals on continuous variables or prediction sets for categorical variables.
The method still hasn’t gotten much attention, but it holds tremendous potential for predictive modeling.
Thanks for reading! I’ll be writing 36 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.