Complementing A B Testing with Machine Learning and Feature Importance

How Feature Importances helped me draw the right conclusions from A B testing

Published in

Towards Data Science

6 min readMay 25, 2019

In this post I want to focus on using Machine Learning to complement other data science tasks, in particular A/B testing. This is supposed to be a practical post rather than a theoretical discussion and I assume that you are at least somewhat familar with A/B Testing and Random Forest and Feature Importance.

I have come across an interesting challenge recently that required me to run an A/B test to test whether one auction type sold cars faster than another. Simply running the A/B test would have concluded that one auction type did indeed sell cars faster than the other, but it turned out that the auction type wasn’t the primary driver behind faster selling times. It was the lower associated prices of that one auction type. This could have had dire consequences if the company selling these cars would have concluded to focus on selling via that one auction type instead of focusing on pricing first.

The way I found out about this was to run a Random Forest on the dataset and then get the feature importances, which helped me uncover this. In fact, I generally believe that machine learning is a great complementary tool for Exploratory Data Analysis (EDA) as well as A/B testing.

The Data

Let’s first have a quick look at the data. For the A/B test what was of interest are the sales_channel, i.e. the two different auction types and the selling time, which is the difference between sold_date and bought_date.

The Challenge

The challenge was to run an A/B test to say whether auction type 1 or 2 sell cars faster — if we can conclude that one type sells faster than the other, the company could focus on selling more through one auction, for example.

A/B Test

It turns out that the selling time distributions of both auction types are not ‘normal’ or t-distributions. Both are skewed with a long tail. Thanks to the Central Limit Theorem however, we can transform these distributions to more normal distributions by randomly sampling from their distributions and taking their mean selling time each time. If we do this for 1000 times, for example, then we go from graph 1 (which shows the selling times per auction type) to graph 2 (which shows the average selling time per auction type).

# Do the samplingresults = [] # create an empty list into which I insert the sampled means
random_state = np.arange(0,1000) # random seeds for reproducibility# sample with replacement using 50% of the data; do this 1000 times
# and append the mean seeling time to the list ‘results’
for i in range(1000):
 sample = df.sample(frac=0.5, replace=True, 
 random_state=random_state[i]).groupby(by=’sales_channel’)[‘selling_time’].mean()
 results.append(sample)
results = pd.DataFrame(results)

Graph1 — Distribution of Selling Time in Days

What wasn’t so obvious in Graph 1 becomes very obvious in Graph 2: The average selling time of auction type 1 is much shorter than that of auction type 2. Running an A/B test confirms this with a p-value of 0.00. Running an A/B test here even seems redundant as the distributions don’t even overlap.

If we now concluded that the company should only sell through auction type 1 as the cars sell much faster then we’d probably be making a mistake.

What if there were other characteristics/features that make the two auction types very distinct apart from selling time? This dataset has very few features, so we could draw more similar distributions, calculate the correlations where possible and so on. An easier and more effective approach would be to use machine learning instead to help us out — this becomes particularly useful — if not necessary — when we have a lot more features.

Feature Importance to the Rescue

Most machine learning algorithms have a method for calculating their feature importance — that is, how big a role did each feature play in predicting the target/dependent feature. Here the target would be selling time.

I have run Random Forest on the dataset and calculated the feature importance. I also used CatBoost, a gradient-boosting tree method, but came to the same conclusion. That’s why I stick with Random Forest as it’s a much easier algorithm to get your head around.

# Set up the model and define its parameters — let’s keep it simple
rf = RandomForestRegressor(n_estimators=100, max_depth=5)# Fit the model
rf.fit(X_train, y_train)# Calculate the mean feature importance
importances = rf.feature_importances_# Calculate the standard deviation of the feature importance
std = np.std([tree.feature_importances_ for tree in rf.estimators_],
 axis=0)# Sort the features by their importance
indices = np.argsort(-importances)[::-1]# Plot the feature importances 
plt.figure(figsize=(12,8))
plt.title(“Feature importances”)
plt.barh(range(X.shape[1]), importances[indices], 
 color=”r”, yerr=std[indices], align=”center”)
plt.yticks(range(X.shape[1]),X.columns[indices], rotation=0)
plt.ylim([len(importances)-6, len(importances)])
plt.show()

On the left you see the feature importance of each feature on predicting selling time — the longer the bar, the more important a feature. Buy Price is by far the most important feature whereas Sales Channel (auction types) hardly matters at all.

I then calculated another type of feature importance called SHAP. The way to read the graph is the following: Each dot is a row/observation in the dataset. Negative SHAP values mean that the feature value reduced the value of the target prediction (selling time). The colour blue indicates that the feature value was low and red indicates that the feature value was high. When looking at buy price we see that all negative SHAP values were blue, i.e. low buy prices led to a prediction of low selling ime. Note however that there are also some blue dots with positive SHAP values. In those cases the low buy price led to the opposite prediction of longer selling time. This can happen when features interact with each other. Interestingly, sales channel (auction types) is very nicely split up with auction type 1 in red and auction type 2 in blue.

# Plot the overall Feature Importance using SHAP
shap.initjs()
explainer = shap.TreeExplainer(cat)
shap_values = explainer.shap_values(pool_val)
shap.summary_plot(shap_values, X_val)

Average selling price distributions per auction type

Before we wrap up let’s briefly look at buy prices by auction type. We had already suspected this from looking at the feature importances. Both auction types have very different price distribution. Auction type 1 (ca. $6k) has much lower prices than auction type 2 (ca. $24k).

Conclusion

Simply running an A/B test to see whether one auction type sells cars faster than another probably would have led to the wrong conclusion. Running Random Forest or any other machine learning algorithm and then calculating the feature importance gives you a much better picture of what features influence the target you’re interested in. I found that using machine learning to complement A/B tests helped me make better decisions. I hope it helps you, too.