Notes from Industry
When evaluating an intervention, whether a simple A/B test or a complicated contextual bandit, the benefits of a holdout group are overrated.

I’m a co-founder at Aampe, where we embed contextual learning algorithms into Mobile apps’ push notifications to learn about and adapt to individual user preferences. We do a lot of tests and a lot of experimental design, as well as a fair amount of machine learning. This post is about a particular request we often get from potential customers: that we hold out a subset of users who will have neither the content nor the timing of their notifications chosen by Aampe’s learning systems. This request seems to stem from the mistaken belief that a holdout comparison is somehow inherently a "scientific" practice.
Science necessarily involves making comparisons, but not all comparisons are necessarily scientific. The details matter, and several details speak to why a holdout comparison isn’t necessarily a good practice. Below, I’ve written about several issues that make it extremely difficult to use a holdout sample to evaluate the effectiveness of Aampe’s learning algorithms, but the principles apply to any situation where you’re dealing with an extended, adaptive intervention – a bandit algorithm, for example.
Comparability
It may seem that it should go without saying that we should only compare a holdout group to a test group if the two groups are comparable, that that question of comparability is exactly the question we beg each time we talk about holding users out. One of my co-founders, Sami Abboud, has written about the limits of random assignment: if we randomly assign some percentage of users to the holdout and then keep the rest in a "test group" where Aampe manages all the communication, it’s entirely possible – likely, in fact – that certain user attributes are going to be disproportionately reflected in one of the other group. The holdout group is going to have a disproportionate number of men, or the test group will have a disproportionate number of new users. The more ways the apps users can vary (Does the app offer several different products? Is it available in several different geographic areas?) the greater the likelihood that random assignment will be inherently biased assignment.
At Aampe, we insist on conditioned assignment, where we explicitly measure similarities and differences between users to ensure that we assign similar messages to users with different behaviors and attributes, and assign different messages to users with similar behaviors and attributes. That allows us to estimate and discount the impact of these baseline influences when assessing the impact of the messaging choices we make. (You can find a simple explanation of conditioned assignment in the "User Landscape" section of A User Story, or you can walk through an illustration of it here.)
Ok, so why not use conditioned assignment to select the holdout group? That would ensure the two groups are comparable. Yes, it would…but only on the day that we made the assignments. Most apps have high churn – one study of 37,000 apps in late 2018 showed that, on average, over half of an app’s users churn within one month of downloading the app. For many apps, that churn happens within the first few days of that first month – users come in, maybe look around, and then never come again. Many of those users would be in any holdout group we constructed, which means the holdout group will shrink over time – probably quickly. We could repopulate it with new users, but those users by definition offer us very little information about themselves, because they’re new.
When dealing with humans, holdout groups are difficult to create and even more difficult to maintain. If we’re dealing with millions of daily active users, then the difficulties possibly shrink a little – throwing massive amounts of data at a problem can work wonders – but the difficulties never disappear. Using a holdout group despite these challenges isn’t scientific. It’s just wishful thinking.
Ethics
Let’s say that we could magically make all of the aforementioned difficulties go away, so there was no longer a strong analytic reason to question the validity of a holdout comparison. I strongly believe that, in most cases, we still wouldn’t want to do it.
Consider a situation where there’s a drug that trials have indicated successfully treats a common disease. We have early indicators that the drug works, but we still don’t feel we know for sure. So when people come into the pharmacy to buy the drug, we randomly choose whether they’ll get the actual drug or a sugar pill. We don’t tell them they’re involved in an ongoing trial – we just make the switch and monitor what happens. It’s obviously not ok to do that.
Now consider a slightly less fraught situation, where we’re offering education support services to students who are struggling academically. Most of the students who come and pay for our services get our approved program, which we’ve spent years developing into something that we believe really works. But we randomly assign a percentage of students to get bland, boilerplate materials that were designed just to fill space rather than provide educational value. We don’t tell students that there’s a possibility that they’ll get the fake materials – we just make the switch and watch. It’s clearly not ok to do that.
Alright, now consider: we make decisions about our app’s offerings and user experience that we think more clearly provide value for our users. Each user downloaded our app for a purpose, and we spend a lot of time and resources trying to make sure they get easy access to the things that brought them to us in the first place. We make changes to our app that we think are improvements. In what world is it suddenly ok to take a random subset of those users, give them an experience that we believe offers less value, and not tell them that’s what’s going on or give them a choice to give informed consent to an experiment?
It’s not ethical.
It’s especially not ethical in cases where our users are paying us to provide them with value. They pay us to give them the best experience we can. We don’t get to go back on that bargain just so we can feel a little bit better about making decisions under conditions of uncertainty.
Now, everything I’ve said here about ethics applies to a global holdout group – people we exclude from all interventions in order to assess the effectiveness of those interventions. It’s less of an issue if we do temporary holdouts – randomly stop interventions for a certain number of users for, say, a month, and then pick a new holdout the next month, and then a new holdout the next. That way, everyone has an equal chance of losing certain benefits, but the loss is always temporary. But that still doesn’t address the analytic problems we talked about earlier. And, anyway, it’s unnecessary because of what we’re going to talk about next.
Competing hypotheses
It seems to me that many people who insist on holdout samples view science as mainly a matter of null-hypothesis testing (sometimes called "statistical significance" testing). I don’t have a very high opinion of null hypothesis testing (and here’s a good summary of the most prominent problems, if you’re interested.) My main complaint against the approach is that I don’t think it often makes for good science.
At its core (and at its best), all science follows a simple process:
- Make up a plausible story. This could be descriptive ("X and Y both happen together") or explanatory ("X causes Y"), but it’s still just a story.
- Make up as many alternative stories as you can. Maybe the relationship between X and Y isn’t actually very consistent. Maybe Y actually causes X. Maybe X only causes Y when Z is present. Maybe Z causes Y, but X tends to happen at the same time as Z so it looks like X causes Y. Make up as many stories as possible that (1) are plausible and (2) can’t be true if your original story is also true.
- Try to kill all the stories. This includes your original (usually preferred) story. Set up conditions where observations really shouldn’t match a story unless the story really is more than just plausible. The stories that generate fewer contradictions deserve more of our belief than stories that are more easily contradicted.
So which stories live or die based on the data that come from a holdout group? At Aampe, our preferred story, of course, is that users engaged with the app in ways and to extents that we wouldn’t have seen if Aampe’s learning systems had not managed the communication. What are the plausible alternatives?
- Maybe we could have done nothing. After all, why pay for a service that optimizes user communication through continuous, massively-parallel learning if you can get the same results by not doing anything at all? That’s a fair question, but it’s not one that a holdout is capable of answering. This goes back to the challenge of comparability that I mentioned earlier. A totally random holdout will almost certainly exhibit systematic differences from the rest of the population, but even a holdout created through a more sophisticated experimental design will age away from the rest of the population. In fact, the more successful the automated management of notifications is, the more we would expect a constantly-backfilled holdout to contain a disproportionate number of new users, because successful management would result in more new users turning into returning users.
- Maybe we could have managed notifications manually. Yes, while an automated system is managing notifications for one subset of users, you could be hand-crafting the perfect messages for another subset. Even if your messages outperform the managed messages(and they typically won’t), unless you’re prepared to permanently hire an entire agency – or hire the in-house equivalent of one – to keep baking up those artisanal notifications, then the comparison is meaningless, because it’s not sustainable. A comparison shouldn’t be an academic question – it should help you decide between two courses of action. If fully-manual management of notifications at scale was a feasible course of action, more people would be doing it successfully. The fact that practically no one is doing that successfully is one of the main reasons we created Aampe in the first place.
- Maybe we could have just used a set of triggers and rules. Try breaking this down into the actual rules you might use. Maybe you have in mind some really simple rules like "message everyone on Monday afternoon", or some more complex rules like "message people who bought something two days after their purchase and then two days after that if they don’t respond." Those rules contain implicit hypotheses: "Monday afternoons are a good time to message users," "two days is the right amount of time to wait before following up on a purchase," and so forth. A holdout sample rolls all of those implicit hypotheses into one big bundle from which you cannot unpack insights. Maybe Monday afternoons are good, but maybe Tuesday mornings are good, or maybe Wednesday afternoons, or Thursday nights. Any decent automation system should already test all of those hypotheses against each other. Same thing goes for waiting periods after actions: maybe 2 days is good, but maybe one day is good too, and maybe three days is better. Any rules you might employ for a holdout sample are going to be only a small subset of all the rules you could employ. Instead of stuffing a handful of alternative stories into an undifferentiated box labelled "holdout", you should actually evaluate each story in comparison to all of the others. (And that’s exactly what Aampe does.)
The purpose of this post has not been to tout the wonders of Aampe – I accept that I’m biased on that point. What I’ve tried to show is that, in most cases, holdout groups are more scientif-ish than scientific. A holdout sample too often gives a test the trappings of rigor while not necessarily introducing rigor. Rigor requires the formulation and testing of alternative hypotheses, not only a null hypothesis. When it comes to something as complicated as user notifications, with countless permutations of message content and timing, a null hypothesis just isn’t all that useful.
Schaun Wheeler is a co-founder at Aampe, a software company that transforms push notifications into a proactive user interface. Schaun is both an anthropologist and a data scientist, and has worked across the security and intelligence, travel, investment, education, advertising, and user experience industries. And he recently wrote a children’s book to explain his company’s algorithms. You should read it: https://www.aampe.com/blog/a-user-story.