The world’s leading publication for data science, AI, and ML professionals.

Three Steps to Better A/B Tests

A simple guide to improving testing at your organisation

Photo by National Cancer Institute on Unsplash
Photo by National Cancer Institute on Unsplash

A/B tests originate from the land of academia and science, where they are known by their much fancier name, Randomised Control Trials. They have been used by organisations in every sector to improve things for their users and in turn, improve results. Whilst A/B test processes are well known and well documented, there is some art to the science. This article has a look at some of the steps you can take to extract the most out of your Testing regime.

Step 1 – Ask Better Questions

Namely, how can I make things better for my users and how do I prioritise this?

Making things better for your users needs to be the guiding principle of any test. Photo by Adam Wilson on Unsplash
Making things better for your users needs to be the guiding principle of any test. Photo by Adam Wilson on Unsplash

There are two ways of answering those questions, evidence-based and assumptions-based. There are pros and cons to both these approaches. Questions based on assumptions will likely be faster but are also more likely to create less effective tests. On the other hand, evidence will likely be slower to gather, but give a better understanding on what’s best to test. The approach to take depends on the circumstances, but I would always recommend evidence-based, where possible.

Collecting Evidence to Generate Tests

There are multiple ways of collecting useful evidence. This can be both qualitative and quantitative.

Qualitative evidence could come from, amongst other things, survey responses or interviews. There is nothing better than talking to your users or customers or supporters and asking them what frustrates them, what problems they have or what things they like about your organisation or others. People are normally happy to be asked their opinion and if they interact with you, they generally want you to succeed. The answers they give will leave you with a treasure trove of ideas, and it is up to you how to prioritise testing them. Or again, you could ask people what’s most important to them.

Quantitative evidence could come from the analysis of data in your organisation. Are there any places or processes where users or customers or supporters are dropping off? Are there products or processes with better than expected results that could be wheeled out to the wider organisation? Are certain groups of people reacting better to a product than another group of people? There are lots of ways where quantitative data can help generate possible tests which is why organisations pay good money for great analysts.

Generating Good Tests From Assumptions

In an era of big data and analytics, expertise can be unnecessarily discarded. A deep learning neural network with 10 years worth of training data wouldn’t be ignored, and neither should experienced people.

Just like a deep learning neural network, there are a couple of ways to fine tune to get more effective results. The key to this is always to start and finish with user needs. If you want to run a test based on an evidence-free idea, it should be grounded in a user need. Writing it down in a specific format can help to clarify and justify. It could be something as simple as:

As a (type of user), I need…

A solution for testing to satisfy this need would be…

Don’t just generate one or two ideas in this format, get together as a team and generate loads. Role play as your different types of users and try out your products or processes yourself. You could make it fun by gameifying things and post it notes are always your friend!

Once you have the ideas, you need to prioritise them. Have your overall goals and objectives in mind and work out which tests are going to be the most effective in achieving those.

Step 2 – Make Better Samples

The golden sample is a random, representative sample. A representative sample allows you to be confident that the thing you are testing will work on the wider population. In academia, the classic analogy involves some soup. If I were to put salt into some soup without stirring it and took a spoonful close to where I added the salt, I would probably conclude that it was too salty. If I took a spoonful from the other side, not salty enough. The soup needs a good stir and you need to test people across the entire group if you want the right answers to your questions.

Making a good sample can be similar to stirring soup. Everything needs to be well distributed. Photo by Jason Briscoe on Unsplash
Making a good sample can be similar to stirring soup. Everything needs to be well distributed. Photo by Jason Briscoe on Unsplash

There are many ways to make sure your sample is properly stirred and free from bias. First things first is to make sure you have enough of a sample to make statistically significant conclusions. You can do this by doing a power analysis. This is a standard formula that takes into account expected response rates and acceptable margins of error to provide the number of people required for your sample.

Second, you need to consider the possible biases that might arise from your sampling and your testing. There are many different types of bias, which will affect your testing by different amounts. It is important to minimise bias, but it is also important to not get too bogged down in changing things for marginal gains to the point where you end up doing less testing.

The solutions to minimising bias depend on what you are testing and how you are doing it, but there are many standard approaches. For example in UK politics, how people vote is strongly correlated with age. Polling companies often get more people of certain ages than there are in the voting public, meaning voting intentions of the sample will be different to that of the country. Their solution to minimise this bias is to weight responses so that the answers of people in undersampled age groups contribute more to the end result than those in a group with lots of responses.

Bias may also appear through the method of collecting the results. For example, surveys may be affected by psychological biases. A current example is covid related surveys. If you were to look at the percentage of people on Yougov who say they wear masks at all times and compare that to simply counting the percentage of people you can see wearing masks (as representative as you can!), which percentage would be higher? This is an example of social response bias, where people answer surveys in the way they think society expects them to, even if it’s anonymous and online. Bias minimisation is therefore important throughout testing, and it’s important to consider which types are most likely to affect your tests so you can guard against them.

Step 3 – Act on your results

Act, Iterate, Ignore should be the three word mantra of anyone with even the slightest connection to testing in an organisation. Statistically significant test results are only useful if you actually roll out the changes more widely. For example, if a test shows that a large call to action button at the top of a retention email improves retention, then always have a large call to action button at the top of your emails. Making the changes permanent means you can then further build on them and continue to improve the user experience.

You need to act on your results. Otherwise, what is the point of testing? Photo by Jonny Rogers on Unsplash
You need to act on your results. Otherwise, what is the point of testing? Photo by Jonny Rogers on Unsplash

Obviously, sometimes maintaining those changes and communicating them within an organisation can be tough, so think of ways of mitigating that. This could be anything that works for the organisation, such as centralised learning logs shown to people at inductions, guidelines, training sessions or even go/no go checklists for people or teams responsible for final outputs of social media, emails content etc.

Iteration occurs when your results weren’t significant but you have an idea for tweaks that might make a difference, unexpected biases cropped up, or where the result was significant but there was an unexpected consequence. For example, when testing two different user journeys, one journey had a higher final page success rate but absolute numbers across the whole journey decreased. Iterating should be quicker to set up as most stuff is already in place.

Ignoring occurs when the results show no significant change and no tweaks can be made to run an iteration. Don’t be disheartened. Most tests end up with no significant changes, which is why it is important to keep on testing. Some organisations run hundreds of tests a year, whereas I have seen others running literally one a year. Be the former, not the latter, and you will soon see your user experience and therefore your results improving.


Related Articles