We’ve learnt about p-values many times before. First, in the statistics book we crammed at university; second, when we read some articles assuring us they were legitimate; and finally now as we decide how to run our businesses. Throughout this, we’re told definitions, and hopefully, we’ve learnt enough to regurgitate something about null hypotheses and probabilities – but do we really understand?
Understanding is more than knowing a definition, and to effectively use p-values it’s necessary to have a ‘feel‘ for what they mean. To do this, we’ll take a visual approach.
In fact, when you look into p-values through visuals, they’re remarkably straight-forward, and that’s what I’d like to share with you today.
We’ll follow this rough guideline:
- How can I be sure my Conversion Rate is what I think it is?
- Measuring Probabilities
- The AB Test Scenario
The first few paragraphs will cover some basics so feel free to skip over these, but don’t forget to check the visuals!
Let’s start by looking at a simple p-value before moving on to the all-important AB Test scenario that shapes product decisions around the world in all your favourite apps, services and programs. Hopefully, after reading this (or bookmarking) you can understand the p-value in all its glory, like I finally have.
Note that the following content will be around e-commerce products but take note the understanding can be applied when dealing with various outher scenarios.
Your First CVR
Imagine this: you’ve launched a new website, and you’d like to know what your conversion rate (CVR) is, so you begin to measure the sessions and the conversions. Using these you can calculate a simple proportion for the conversion rate as:
CVR = Conversions / Sessions
You measure 1000 sessions and 50 conversions which gives you a CVR of 5% – awesome! But wait, how accurate is that 5%?
If you measure another 1000 sessions, will you get another 50 conversions or will it be something different? This is where our friend Statistics comes in – to help us understand how we can interpret the absolute truth from our samples of data.
To understand the p-value in its simplest form, let’s pose a different question. If the actual CVR of our website is 5%, what’s the probability that I’ll measure a CVR of 6% or greater if I measure 1000 sessions as I did above? If you don’t like to think in percentages, it’s the same as saying "what’s the chance of getting 60 or more conversions from 1000 sessions?".
Getting Experimental
With a problem statement like this, we can now run an experiment through computation. To do this, we’ll set up a program that simulates 1000 sessions and for each session, checks if there is a conversion. With these simulated sessions we can calculate a CVR for that experiment. We’ll then do this 10,000 times to see what it is the most common outcome.
cvrs = []
# Run the test 10,000 times.
for i in range(10000):
# Simulate 1,000 sessions.
conversions = 0
for j in range(1000):
# Get a random number between 0 and 1
# and check if it's above 5% (0.05).
if random.random() > 0.05:
# Count the conversion.
conversions += 1
# Calculate our CVR.
cvr = conversions / 1000
# Store the CVR for later.
cvrs += [cvr]
After running this, we’re left with a long list of possible CVRs which will look something like:
# cvrs = [0.053, 0.049, 0.050, 0.051, ...]
We can then group all these values together in some bands and see how many of each we recorded. Plotting this gives us the histograms below.
As a quick recap, each column in the histogram shows how many CVRs ended up within the band, e.g. for 0.05 the band is 0.0495 to 0.0505 and the graph shows that this band had the most number of CVRs.


What does this graph tell us? Well, we can see that the most common value is 5% and this is exactly what we’d expect since that is the actual CVR. We can also see that there are many values which aren’t 5% and this is due to the random behaviour of users.
Any individual user landing on a site has a 5% chance to convert, but that doesn’t mean that 5 out of 100 users will convert. It’s still possible that 0 out of 100 users will convert, just from chance.
That being said, on average, there will tend to be 5 converting users out of 100 and that’s why the number of experiments where the measured CVR is different to 5% drops off as we move away from the centre. This distribution is known as the normal distribution and is used across many fields of study due to its capacity to approximate natural phenomena.
Calculating Probabilities
Now you might wonder what the point of this was, but with our newfound knowledge, we can answer our original question, which was:
If the actual CVR of our website is 5%, what’s the probability I’ll measure a value of 6% or greater if I measure 1000 sessions
This is easy now. We simply have to count all the trials that ended up with a CVR of over 6% and divide by the total number of trials.
P(CVR > 6%) = Count of trials with CVR > 6% / Total count of trials
= 200 / 10000
= 2%
There we have it. The chance of measuring 6% or more for our CVR is just 2%. We can calculate similar numbers for different potential CVRs, but we’ll see the trend. As we move away from the value of 5% (our assumed CVR in the experiment), we’ll see the probability of measuring a value decrease towards 0.
Let’s recap what we just said:
The chance of measuring 6% or more for our CVR is just 2%.
Whether or not you noticed: that’s a p-value.

A p-value is the probability that we observe something as, or more, extreme assuming the null hypothesis is correct.
In our case, the null hypothesis is that the CVR is 5% and it being correct means that the actual CVR is 5%. So there you have it – "the chance of measuring 6% or more for our CVR is just 2%." – a p-value without even meaning to find one.
Reversing It
In a typical application, you’d normally start with a target p-value, and then check how the measured CVR compares to this to see if it is significant. This is the opposite to the procedure we just followed. Let’s try it out.
I’m adding advertising to a page which has previously performed well with a CVR of 4% but I want to be pretty sure (95%) that my advertising doesn’t distract users from making purchases. Let’s go back to our graph. Instead of counting all the trials above a certain value, we need to find the CVR where the number of trials has reached 5%. This will be something like:
CVR where No. Trials = 5% of Total Trials = 0.05 * 1000 = 50
If we line up our measured CVRs in increasing order, we just need to choose the 50th CVR. A neat way to do this is using a Cumulative Distribution which is created by adding up all the CVRs to the left until you get to the end of the list. As you go from the left to the right you’ll get the total number of trials less than a certain CVR. At a CVR of 4%, you’d expect 500 sessions, exactly half.
Typically, we will plot cumulative distributions as percentages so we don’t have to keep track of the number of trials.
Percentage of Measurements = Measurements So Far / Total Measurements


Now that we have our cumulative distribution we can find our CVR by drawing a line out to the distribution and coming down to find our CVR. This gives a result of 3%. So we can conclude that if after 1000 sessions we record a value of 3% or less, then we can be confident that the CVR is less than 4% and we should reconsider our page design.
AB Testing
Hopefully the above has given the basic understanding for how we can visualise p-values. We’ve now arrived at the summit of this article – AB testing.
AB tests shape product decisions around the world in all your favourite apps, services and programs and for this reason, understanding them is of the utmost importance.

An AB Test is used when one would like to test the difference between a variation and the original. An example is a new purchase flow on an e-commerce website. The original design would be called the Control and the new design would be the Variation. In this case, we would want to know: "is the CVR for the Variation better than the Control?".
In such a test, you might measure the results of 1000 sessions and find that the CVR for the Variation is 9% but the CVR for the Control is 7%. Can we conclude that the Variation is better? Let’s do some more statistics.
We will take the same approach as before by first restating the question
What is the probability that we measure a difference of 2% due to pure chance?
With this, we can simulate a bunch of experiments as we did before. Given our measured CVRs of 7% and 9%, the most likely average CVR is 8% so we can run two experiments with a CVR of 8% and see how many times the difference is greater than 2%.


We can see the results of our trials here. As before, we see a normal distribution and our cumulative distribution. To work out the chance of measuring a 2% difference we can find the number of trials that the value of 2% corresponds to. In this case if we draw a line up we see that it’s around 90%. This tells us that 10% of the time we’ll see a difference of more than 2% by pure chance. Rephrasing:
There is a 10% chance that we will see a difference of greater than 2% due to pure chance.
There we have it, another p-value. Typically for business testing we’ll aim for a p-value of 0.95, equivalent to 5% in this case, so we can see that with the amount of data we have it’s impossible to make the conclusion that the Variation is better, instead we must collect more data.
Why AB Test?
The attentive reader might be curious as to why we even run an AB Test. In the previous section we showed how to find p values when you make a change and know the underlying CVR of your website.
The reason we run an AB Test is to control for known and unknown factors that may influence the outcome. For example, if users are more likely to convert at the end of the month and we launch our test at the start of the month then it is likely that the variation will do worse even if it is a better design. Alternatively, if during the test period one of the suppliers discontinues a product this would put the variation at a loss.
By running an AB test where users are randomly assigned to a group, we can be confident that the two samples will have similar characteristics, creating a fair test.
Conclusion
We found that the p-value is the probability that we observe an effect as or more extreme assuming the null hypothesis is correct.
We saw that the application of p-values are fairly straightforward to understand with the power of visualisation and some computational experiments. This method of understanding helped me greatly and serves as a bookmark whenever someone else needs help with grasping the actual reality of a p-value of 0.96.
Know another way to simplify p-values? Let me know!