Validating A/B Test Results: SQL Case Study 2

Determining whether a feature is perfect to add or too good to be true.

Shweta Yadav
Towards Data Science

--

“I am who I am today because of the choices I made yesterday” — Eleanor Roosevelt

The above lines by the former First Lady of the United States signify the importance of choices for individuals and companies alike. The companies that choose to regularly update or change features and products based on customers' demands tend to grow more as compared to the companies that avoid adaptation.

While making a change whether in the form of a minor UI tweak or a large-scale new feature, it is always best to test our assumptions. That is where A/B Testing comes into the picture.

What is A/B Testing?

A/B testing (also known as split testing) is a process of showing two versions of the same webpage or app to different segments of users at random and comparing which one drives more conversions.

Testing takes the guesswork out of website optimization and enables data-informed decisions that shift business conversations from “we think” to “we know”. It also helps to make the most out of existing traffic and increase conversion without having to spend on acquiring new traffic.

“If you double the number of experiments you do per year, you’re going to double your inventiveness “— Jeff Bezos

Challenges of A/B Testing

A structured A/B testing program can make marketing efforts more profitable by pinpointing the most crucial problem areas that need optimization. But there are a number of challenges associated with performing A/B testing. A few of them are — deciding what to test and for how long, formulating correct hypotheses, deciding the sample size for the control and treatment group, and most importantly analyzing the test results.

“Less than 25% of A/B tests produced significant positive results” — AppSumo

Analyzing the results of A/B test on Yammer Website

In the previous blog post, I performed a case study to analyze a decrease in the engagement of users on Yammer data.

Here I will be using the Yammer data again to understand a feature’s effect on the user behavior and the overall user experience.

The Problem

Yammer is planning to improve the core “publisher” — the module at the top of a Yammer feed where users type their messages. The product team ran an A/B test from June 1 through June 30. During this period, some users who logged into Yammer were shown the old version of the publisher (the “control group”), while other users were shown the new version (the “treatment group”).

Source

On July 1, the results of the A/B test indicates that message posting is 50% higher in the treatment group — a huge increase in posting. The job is to determine whether this feature is the real deal or too good to be true.

The table below summarizes the results:

Image by Author

The chart shows the average number of messages posted per user by the treatment group. The additional test result details are the following:

  • users: The total number of users shown that version of the publisher.
  • total_treated_users: The number of users who were treated in either group.
  • treatment_percent: The number of users in that group as a percentage of the total number of treated users.
  • total: The total number of messages posted by that treatment group.
  • average: The average number of messages per user in that treatment group (total/users).
  • rate_difference: The difference in posting rates between treatment groups (group average — control group average).
  • rate_lift: The percent difference in posting rates between treatment groups ((group average/control group average) — 1).
  • stdev: The standard deviation of messages posted per user for users in the treatment group. For example, if there were three people in the control group and they posted 1, 4, and 8 messages, this value would be the standard deviation of 1, 4, and 8 (which is 2.9).
  • t_stat: A test statistic for calculating if the average of the treatment group is statistically different from the average of the control group. It is calculated using the averages and standard deviations of the treatment and control groups.
  • p_value: Used to determine the test’s statistical significance. The smaller the p-value more likely you are to reject the null hypotheses.

The Data

The dataset consists of four tables.

Table 1: Users — This table includes one row per user, with descriptive information about that user’s account.

Table 2: Events — This table includes one row per event, where an event is an action that a user has taken on Yammer. These events include login events, messaging events, search events, events logged as users progress through a signup funnel, events around received emails.

Table 3: Experiments — This table shows which groups of users are sorted into for experiments. There should be one row per user, per experiment (a user should not be in both the test and control groups in a given experiment).

Table 4: Normal Distribution — This table is purely a lookup table, similar to what you might find in the back of a statistics textbook. It is equivalent to using the leftmost column in this table, though it omits negative Z-Scores.

Disclaimer: The data was generated for the purpose of this case study. It is similar in structure to Yammer’s actual data, but for privacy and security reasons it is not real. Here is the link to the datasets and original case study on Mode.

Validating the Results

Before digging around the data, first I listed a few possible causes that could explain the anomalous test results.

Cause 1: The test was calculated incorrectly. There are chances of some mistake in the calculation of variables or methods used in the test.

Cause 2: There might be something wrong with message posting rates as a metric. It can’t describe if customers are getting value out of Yammer and is insufficient to measure the overall success. It makes sense to dig into a few other metrics to be sure that their outcomes were also positive.

Cause 3: The problem can be with the way users were treated into test and control groups. For example — all users in one group might be using the same device like a phone where message posting is convenient and quick, they belong to the same company that has a more messaging culture or they are all old users causing the novelty effect.

Getting brainstorming over with, Phewww!, I analyzed the data to understand the root cause of anomaly in A/B test results.

Source

Analyzing Cause 1:

The A/B test above, which compares average posting rates between groups, uses a simple Student’s t-test for determining statistical significance. Furthermore, the test uses a two-tailed test because the treatment group could perform either better or worse than the control group. These methods however don’t considerably affect the test results here. I also checked the query that calculates lift and p-value and there is no problem there either.

Analyzing Cause 2:

Metrics — whether they be revenue, customer satisfaction, or engagement — become the driving force of the advancement of a company. Having the right metrics can mean the difference between success and failure.

Apart from the number of messages sent, the login frequency as a metric (Yammer uses it as a core value metric) can help to understand test’s success.

The query result shows that the number of logins in the case of the test group is also higher. It means that not only are users sending more messages, but they’re also signing in to Yammer more.

Image by Author

Analyzing Cause 3:

Validating the change across different cohorts is always a good idea. It checks whether some group interactions are directly or indirectly affecting the test results. I divided the users into different segments to check if users were correctly assigned to test and control groups.

3.1 Users by Device

SELECT ex.device, ex.experiment_group, 
COUNT(ev.user_id)
FROM tutorial.yammer_experiments ex
LEFT JOIN tutorial.yammer_events ev
ON ex.user_id = ev.user_id
AND ev.event_name = 'send_message'
GROUP BY 1, 2
Image by Author

The above figure illustrates that there is no problem in the way users are put into the groups based on the device they use.

3.2 Users by Company

SELECT u.company_id,
COUNT(CASE WHEN e.experiment_group = 'control_group' THEN u.user_id ELSE NULL END) AS control_users,
COUNT(CASE WHEN e.experiment_group = 'test_group' THEN u.user_id ELSE NULL END) AS test_users
FROM tutorial.yammer_experiments e
LEFT JOIN tutorial.yammer_users u
ON e.user_id = u.user_id
INNER JOIN tutorial.yammer_events ev
ON u.user_id = ev.user_id
AND event_name = 'send_message'
GROUP BY 1
ORDER BY 1
LIMIT 100
Image by Author

The users are not discriminated against on the basis of the company. Even if a company has a more messaging culture like company 1, users are equally divided into control and test.

3.3 Users by their Age in Yammer

SELECT DATE_TRUNC('month',u.activated_at) AS month_activated,
COUNT(CASE WHEN e.experiment_group = 'control_group' THEN u.user_id ELSE NULL END) AS control_users,
COUNT(CASE WHEN e.experiment_group = 'test_group' THEN u.user_id ELSE NULL END) AS test_users
FROM tutorial.yammer_experiments e
JOIN tutorial.yammer_users u
ON u.user_id = e.user_id
GROUP BY 1
ORDER BY 1
Image by Author

Splitting the users into new and existing cohorts reveals the main problem — all new users that were added just before the test was conducted were treated in the control group. The new users have less time to post than existing users. Given their shorter exposure to Yammer, they would be expected to post less than existing users. Including all of them in the control group lowers that group’s overall posting rate.

Long time users of Yammer might try out a new feature just because it’s new, temporarily boosting their overall engagement. That is not the case for new users, for them, the feature isn’t “new,” so they’re much less likely to use it just because it’s different.

Conclusion

After investigating the results of the test based on calculation, the method used, choice of metric, and treatment of users into different groups, it seems to be biased in the sense that all new users are placed into the control group.

The reasonable next course of action would be to either analyze the test in a way that ignores new users or re-conduct the test after randomly assigning new and old users into the test and control groups.

--

--

“If you really look closely, most overnight successes took a long time.” -Steve Jobs