The world’s leading publication for data science, AI, and ML professionals.

Key-driver analysis in python

A guide to answering the question, "What drives the winningness of candy?"

Hands-on Tutorials

Image by author, dashboard available here
Image by author, dashboard available here

In a Data Science interview a few years ago, I was challenged to use a small data set from our friends at FiveThirtyEight to suggest how best to design a good-selling candy. "Based on ‘market research’ you see here," the prompt gestured, "advise the product design team on the best features of a candy that we can sell alongside brand-name candies."

The original dataset from FiveThirtyEight, available online here
The original dataset from FiveThirtyEight, available online here

As a data scientist in the applied, commercial world, the word best is always a weasel word intended to test your business awareness. One of the tell-tale signs of a greener data scientist is whether they’re thinking about the best business outcome vs. the best Machine Learning model. ‘Best’ is a balance of ‘what candy elements drive the highest satisfaction/enjoyment?’ and ‘what candy elements drive the highest price?’ We’re basically trying to find a balance between

  1. a candy guaranteed to delight consumers, that
  2. occupies a niche enough space such that it’s not just ‘knock-off, discount M&Ms’, and
  3. is also cost-optimized to increase profit margins by being cheaper than M&Ms.

Our friends at FiveThirtyEight made a grave statistical error (or two) when trying to solve for #1.

This tutorial walks through doing ‘key driver’ analysis in python using the proper statistical tools, breaking away from the FiveThirtyEight methodology. Along the way, I explain 1) why data scientists and Product strategists should trust my numbers more, and 2) how to communicate those results in a way that gains that trust (see my candy dashboard).

Here’s the roadmap of this article:

  • Methodology: linear regression isn’t right, use relative weight analysis instead
  • Implementation: Doing RWA in python for candy flavor and price
  • Triangulation: Why the business should trust the RWA via triangulation

Statistics First, ML Second

Let’s set out for our statistical methodology by way of understanding why linear regression is not the right answer… or at least not really the right answer.

FiveThirtyEight builds a multiple regression, including all possible features of a candy captured in their data set. Importance is abstracted from the coefficients of the linear regression, using the P-value of the dimension to define whether we can take it as reliable.

Equation for multiple regression
Equation for multiple regression

However, looking at the equation for a linear regression, we see a pretty significant problem. Remember: an OLS regression coefficient tells us whether an increase in the independent variable correlates to an increase in the mean of the dependent variable (and vice versa for negative). But it’s measure of magnitude.

In our candy problem, if we build an OLS regression to predict the winning-ness of a candy-bar, and we change the units from grams to pounds, we would get a much higher coefficient. The volume didn’t change anything besides units. The argument here might be made that you can standardize your variables, such as normalizing to 0 mean and unit variance. Still yet, with normalization, you may have an issue of collinearity – if predictors are linearly dependent or highly correlated, the OLS becomes unstable, which will be true even when the independent variables are standardized.

So we need another tool in our toolbelt. Something that can help us get around the bad assumptions about coefficients. While ML may not have the answer, statistics does.

Relative Weight Analysis and why R2 is the real jedi

Here, we’ll implement code that will tell us how much each feature/independent variable contributes to criterion variance (R2). In its raw form, Relative Weight Analysis returns raw importance scores whose sum equals to the overall R2 of a model; it’s normalized form allows us to say "Feature X accounts for Z% of variance in target variable Y." Or, more concretely,

"Assuming that a key driver of what makes a candy popular is captured here, Chocolates with nuts is the winning-est flavor combination."

Relative weight analysis relies on the decomposition of R2 to assign importance to each predictor. Where intercorrelations between independent variables make it near impossible to take standardized regression weights as measure of importance, RWA solves this problem by creating predictors that are orthogonal to one another and regressing on these without the effects of multicollinearity. They are then transformed back to the metric of the original predictors.

Example of 3-feature RWA, from here
Example of 3-feature RWA, from here

We have multiple steps involved, and rather than demonstrate in notation, I’ll walk through a python script that you’ll be able to use (please credit this post if you use this verbatim or spin-off your own version!) I’m assuming that at this point you’ve done your EDA and manipulations necessary to build a logically and mathematically sound model.

Implementing RWA in Python

Step 1: Get a correlation between all of the dependent and independent variables.

Step 2: Create orthogonal predictors using eigenvectors and eigenvalues on the correlation matrix, creating a diagonal matrix of the square root of eigenvalues. This gets around the issue of multi-collinearity. Note the python tricks for getting the diagonal indices.

Step 3: Multiply the eigenvector matrix and its transposition (in python, we can use @ as an operator, called matmul). This allows us to treat X as the set of dependent variables, regressing X onto matrix Z – itself the orthogonal counterpart of X having the least squared error. To get the partial effect of each independent variable, we apply matrix multiplication to the inverse and correlation matricies.

NOTE: As mentioned, the sum of the squares of coef_yz above should add up to the total R2! This will be important in the next step!

Step 4: We then calculate the relative weight as the multiplication of the matrix in Step 2 and step 3. The normalized version is then the percentage of r2 that these account for!

Now, you can just zip up your features and these two lists to get the relative weight of each one as it ‘drives’ (or, more mathematically, accounts for variance in relation to increases in) the percentage of wins in the candy duals.

Raw & Normal relative weights for predictiveness of winning-ness
Raw & Normal relative weights for predictiveness of winning-ness

Chocolate and nuts wins flavors, but we’re not done. We have to also make money by generating value from the candy we create, measured by revenue minus costs. Using price percentiles in our table, we can also look at what drives the prices of candy.

Raw & Normal relative weights for predictiveness of price percentile
Raw & Normal relative weights for predictiveness of price percentile

Communicating results to the business

We have two responsibilities here: make recommendations, and make stakeholders comfortable with acting on them.

plt is always a good scientific way to visualize things on the fly. Good science and bad visualizations go hand in hand for many stakeholders.
plt is always a good scientific way to visualize things on the fly. Good science and bad visualizations go hand in hand for many stakeholders.

Recommendation 1: Chocolate accounts for 38% of what makes a candy a winner, while peanuts account for 15% when we look at key drivers of win percentage. We can confirm this by looking at the relative win percentage of candys that are both chocolate and peanutty in our sample.

Recommendation 2: Pluribus candies account for the lowest driver of cost (3% relative weight) relative to other forms, which helps offset our chocolate and nut costs.

Recommendation 3: According to our data, savory pluribus chocolatey peanutyalmondy candies are not readily available despite higher performance. There were only 2 captured in our dataset.

These recommendations can be further backed up by analyzing candy in the amazon grocery data, which has 2.9M candy reviews.

While fruit out-ranks nuts in flavor/review score in the Amazon data, it remains in the top 3.

Relative weight analysis of flavors' importance in predicting review scores of ~3M candies on Amazon Groceries
Relative weight analysis of flavors’ importance in predicting review scores of ~3M candies on Amazon Groceries

Further, a topic model built form high selling (>50th percentile), highly-ranked (>3 star) Amazon chocolate and nut reviews further suggest that our flavor profile is a winning decision.

This is an example of a insight delivery slide. The final version of my slideshow had 5 slides: Recommendations up front, then RWA visualizations, then confirming results, and finally a 'this is where the risk/uncertainty is, and how we've tried to minimize it' slide.
This is an example of a insight delivery slide. The final version of my slideshow had 5 slides: Recommendations up front, then RWA visualizations, then confirming results, and finally a ‘this is where the risk/uncertainty is, and how we’ve tried to minimize it’ slide.

Conclusion

Indeed, we might think of Key-Driver analysis as somewhat of a hunt for the holy grail, in that there are so many different possible confounding variables that knowing what lever to pull to drive our target variable is a shot in the dark. But we can get better confidence by abandoning the traditional (albeit mistaken) use of linear regression coefficients, and moving to relative weight analysis. And, to be sure, this version has some improvements – if you’re paying attention to that R2 total in flavor profiles, and try it yourself with pricing, you’ll see that there is room for improvement to be had through various mathematical optimizations.

Hidden here is a story about how data scientists must be well-versed in the statistics and math behind reducing business risk and maximizing the potential of decisions. Otherwise, we’re just low rent software devs/engineers solving input/output problems that could cost the business significantly.

For those DS leaders and managers reading, this example shows how building data science teams that don’t have a mix of both statistics-forward and technology-forward data scientists is a recipe for disaster. Why? Well, in this case, ‘how do we make the best candy?’, is at once a statistics problem and business domain question, and isn’t readily solvable with an sklearn API call.


Related Articles