Notes from Industry

The Gap in Your Data Strategy (Part 2)

A Data-Driven Solution to Cross The Gap

Nate Coleman

Published in

Towards Data Science

7 min readApr 10, 2021

In Part 1 of this piece I argued that businesses need to ask two key-questions to become more effectively data-driven.

1. What data do we need to address this problem?
2. Is this data worth collecting?

If you haven’t checked out this post, I recommend you do before reading on.

In Part 1 I presented a scenario for my hypothetical business that curates an email list for data science related content. As a quick recap, my email list follows a freemium model where anyone can sign-up and get one article per-week delivered to them for no charge. Folks can also subscribe to a premium subscription and get three insightful data-related articles delivered to them each week, instead of just one.

As an entrepreneur who’s incentivized to get more content out to more people and earn a living in the process, I want to get as many premium subscribers as possible. Right now, I have this scenario where I get 2%, or 200 out of every 10,000 visitors to my landing page signing up for my email list. Not bad.

Of those 200 people who sign-up, I get 5% or 10 of them to upgrade to premium within the first 3 emails they receive. These are folks that I’m going to assume would have subscribed to premium without any additional intervention on my part. We’re going to ignore these people for now since we think they’ll sign-up regardless of any of the interventions we’re discussing here.

Of the folks that sign up and don’t upgrade to premium within their first 3 emails, 3% will subscribe to premium within the first 52 emails they receive (approximately 1-year, assuming diligence on my part!) So for every 10,000 visitors, I get about 190 folks who I assume signed up, and did not immediately intend to become a premium subscriber. This means that out of 10,000 visitors, I expect about 6 of them to get free content for a while, eventually see the great value of my content and go premium.

How can I increase this number from 6?

As I described in Part 1 of this post, I think I can achieve this by sending free readers one of the three articles I write each week that would most interest them. My hypothesis is that sending the most relevant articles to each user will help them see the value of my email list. Seems pretty reasonable.

So I ask the question: What data do I need to address this problem?

In order to find out what my users are interested in, I want to add an additional step to my signup flow asking for folks to choose up to three data science topics they’re most interested in (e.g. machine learning, visualization, data strategy etc.)

However, there is no free lunch in this world. I have a hypothesis that if I add an additional step to my sign-up flow I’ll observe a decrease in my signup rate since the impatient reader may be deterred by yet another online form.

So now I ask the question: Is this data worth collecting?

This question can be rephrased to fit nicely into an analytical context. Is the net-benefit of collecting this data positive? This type of question can be addressed with a cost-benefit approach.

One common problem with doing a cost-benefit analysis is that they require an analyst to make many static assumptions, which can be unrealistic or ill-informed. The fallibility of these assumptions can serve to undermine your entire analysis. I want to be proactive about addressing this concern.

That’s why I’ll use a simulation that incorporates some randomness to perform a cost-benefit analysis. I’ll walk through the framework for this simulation and its results. If you’d like to see the code for the simulation you can check it out on my Github page.

Cost-Benefit Simulation

Acknowledge Your Bias, and Keep it In Mind

Before you start any coding, the first step an analyst should take is to acknowledge any bias they have. For example, I may be really excited at the idea of collecting this data on my reader’s preferences, which could lead me to design a cost-benefit analysis that comes to the conclusion of “yes — great idea!”

It’s okay to have this bias. What’s important is to acknowledge it, and keep it in mind as you do your analysis. When you make an assumption, ask yourself if it is realistic if you held the exact opposite belief you’re currently biased towards. This will keep you honest and your analysis robust.

Simulation Design

This simulation will have 3 steps, which mirror the key-events that occur in this sign-up/conversion process for my email list. I chose to simulate 90 days worth of activity, but this can be easily changed in the simulation code.

Daily Traffic: Based on historical data I know that I get about 1,000 visitors a day on average, and it falls between 900 and 1,000 ~95% of the time. My daily traffic is distributed approximately normal, so I assume it will continue to follow this distribution going forward.
Signup: For each run of my simulation, I assume we continue to see a signup rate of 2%. However, instead of assuming a fixed impact on our sign-up rate from adding the new form to the flow, we will allow some variability, since at this time we’re not confident in exactly what this impact will be.
Our best guess is that adding this form will decrease our signup rate from 2% to 1.9%, or deterring 10 out of 200 signups. Again, we’re not totally sure about this effect so we allow for some variability, but we assume 95% of the time out signup rate effect will be between 0 and 0.2 percentage points.
Conversion Rate: Similar to the signup rate effect, we want to add some variability in what our outcome will be. We think that it’s reasonable for this new targeting exercise to increase our conversion rate from 3% to 4%, so by 1 percentage point but think it’s reasonable if 95% of the time the effect was between 0 and 2 percentage points.

Note that for the sign-up and conversion rates, we’re assuming that they’ll follow a normal distribution. We chose to model these rates with a normal distribution because it has the properties that will make values closer to our assumed mean more likely, but it wouldn’t rule out unexpected values, such as a negative effect of our targeted article sending. We don’t think that this scenario is likely, but it’s not out of the realm of possibilities, so we want to model that uncertainty.

Presentation of Results: Frame Your Analysis in the Context of the Business

I often hear cries for data scientists to have empathy by putting themselves in the shoes of their audience when presenting their analytical results. I think that this is generally a good practice, because it forces you to put your analysis in the context of the business problem you’re trying to solve. However, instead of focusing on the individuals, if you frame your presentation around the business problem, your stakeholders and frankly, anyone at your company will be able to understand the results.

In this example, I’m trying to decide if I’ll earn more money by collecting more user data and curating my content to users, at the expense of potentially deterring some folks from signing up. As such, I want to frame my results in a way that demonstrates the financial net-benefit of this proposed solution. To do this, I need to layer in one more assumption of how much a paid-subscriber is worth to me today.

In this example, let’s say I’ve run this analysis and have found that on average a paid subscriber generates $60 for my business. So at $5/mo. for this subscription that means the average premium subscribers stays on for 12 months.

Alright, so let’s take a look at the simulation results.

We can demonstrate the range of possibilities by using a histogram. The average 90-day revenue impact is positive and about $550. The median impact is about the same as well — this tells us that in our simulation half of the results generated $550 or more in revenue.

It’s powerful to see the full range of possibilities — given our assumptions we see net-benefits to revenue from -$1,500 all the way to $2,500. What I’d highlight here too is that in 80% of simulation runs returned a non-negative net revenue change.

This tells us that making this change doesn’t come without its risks, but if the effect sizes are in the ballpark of what we’ve assumed, we’d generally expect to earn more revenue than we otherwise would have. This is enough to convince me to take the leap!

Reproducibility and Extensibility

Another important feature of this simulation exercise is that it’s easily reproducible (see code) and its assumptions are easily adjusted. This is especially important if you’re working with multiple stakeholders who have different beliefs about these assumptions. It’s easy to test out different input values.

Recap

This post demonstrates an example of how we can use a simulation based cost-benefit analysis to answer the question: Is this data worth collecting?

This certainly isn’t the only way to answer this question, and it’s likely that the solution won’t be as cut-and-dry as the one I presented here. But that’s why your company hired a smart data scientist like you to figure it out!

Making your case for whether we should collect data or not is important, but you can’t even get to this stage without promoting a culture where you and your team ask these two vital questions.