A couple of weeks ago, I was simulating some A/A tests data for my workmate Andrew. We wanted to produce a chart like the below for a company-wide presentation he was doing that week about A/B testing and the dangers of peeking:

The goal was to show how p-values moved around and how, even with an A/A test where the null hypothesis should definitely be true, we saw p-values < 0.05 at some point.
Hence, I went on to simulate 1,000 A/A tests in order to calculate how many of them reached significance (p-value < 0.05) at some point. Each test ran for 16 days. With this simulation, I got that ~23% of them did.
However, if I did a much simpler simulation where I just:
- Drew 16 random numbers from a Uniform(0,1) – because p-values under the null are uniformly distributed,
- Took note of whether there was at least one number < 0.05, and
- Repeated this many times, I got that ~56% of the 16-number arrays had at least one number < 0.05.

I was confused. With the A/A test Data, even though I was seeing that p-values were uniformly distributed, I was getting that 23% of the tests reached significance at some point during the 16 days. In contrast, when doing a much simpler simulation and drawing randomly 16 times from a Uniform(0,1), I got a much higher proportion of 56%.
The next day I woke up with a theory in my head: could it be the case that the p-values in the A/A tests were not truly independent for each test throughout the 16 days? That is, could it be that if a test randomly got a very high p-value on the first day then it was much more likely to have a high p-value on the second day too? After all, we were not using completely different data every day, but accumulating new data to the old data that was used to calculate the previous p-value.
Hence, I:
- Grouped the A/A tests by whether their first-day p-value was high (≥ 0.9) or low (≤ 0.1), and
- Looked at the distribution of second-day p-values for each group
This is what I saw:

This suggested quite strongly that I was onto something. A/A tests that happened to have a very high p-value on their first day were much more likely to have a high p-value on their second day too, and vice versa. This is how different the distributions were when looking at the full 16-day period (instead of just the second day):

Same story. Even stronger evidence.
While I was figuring all this out I also asked Simon (our Data Science consultant and the brain behind Coppelia) about it as I was interested in his thoughts as well as his seal of approval to my absence-of-independence hypothesis.
When grouping by day, he also saw that about 5% of A/A tests were significant at any one point, as we would expect. However, when grouping by test the story was pretty different – the dependency was such that if we had a false positive by day d
then we were much more likely to still have that false positive at day d+1
. Likewise for true negatives. Hence the overall rates grouped by test were so variable.
It is most clear in this plot he sent me, which shows how when a false positive (white square) occurs it is then likely to persist for subsequent dates:

Oh well, mystery solved then! And this is it for today. I’m conscious I haven’t covered in detail some of the statistical concepts in this blog post, but if you’re interested I’d definitely encourage you to go and learn more about:
- p-values: https://en.wikipedia.org/wiki/P-value
- significance: https://en.wikipedia.org/wiki/Statistical_significance
- independence: https://en.wikipedia.org/wiki/Independence_(probability_theory)
- the uniform distribution: https://en.wikipedia.org/wiki/Continuous_uniform_distribution
Thanks for reading!