Welcome to the Data Team! Please Solve Everything. (Part I: The Problem)

Data products aren’t magic elixirs.

Published in

Towards Data Science

11 min readJul 7, 2020

Let’s begin with a common situation. Somebody with at least an ounce of authority, whether it be a plucky MBA marketing intern or the CEO, is faced with a difficult problem. They schedule a project kickoff meeting with various folks to brainstorm a solution, and know that it will need to be supported by data if it has any chance of being approved by their superiors. Although they make the meeting invite optional for a few folks, they ensure that somebody from the data analytics or science team (henceforth referred to as “data team”) will be there.

A quick note: in my opinion, there still isn’t a consensus about the difference between data analytics and data science. The tasks that data analysts perform at one company would fit the job description of data scientists at another. Regardless, know when I refer to the “data team” for the rest of this article and future ones, I’m describing end users of data who are doing data reporting, analysis and/or modeling to help an organization accomplish its goals. They’re not data engineers, tasked with building, maintaining, and improving data pipelines and other infrastructure. They’re also not machine learning engineers, who are building models that are integrated into the company’s product, such as recommender or algorithmic pricing systems.

Let’s imagine you’re on this highly demanded data team. On the one hand, you’re directly involved in solving some of the most important problems your company faces. On the other, it’s very possible that some of the problems you’re faced with fall into one of the following unfortunate categories:

1. The problem is not informed by data.

Although progress is being made in natural language processing, creative writing is best left to humans. Brand messaging in particular comes to mind.

We didn’t get the successful slogans “Where’s the Beef?”, “Just Do It”, or “Got Milk?” from big data piped into fancy modeling frameworks and dynamic dashboards. In fact, if we imagine that a data and modeling team with access to historical brand advertising campaign performance was heavily involved at the ad agencies of the latter two campaigns, we might have ended up with “Where’s the Air?” and “Where’s the Milk?” The goal of these advertisers was to come up with a punchy, original slogan that differentiates themselves from other brands. The algorithmically generated slogans I shared are clearly ridiculous, but they drive home the fact that historical advertising data and typical modeling techniques aren’t appropriate for this problem.

import creativity as crtv. Photo by Pixabay from Pexels.

At best, the data folks who are involved in the project have their time wasted and politely excuse themselves as the project progresses. At worst, they dazzle stakeholders with impressive looking but irrelevant analyses, crowding out the actual work that needs to be done by other project contributors.

As more data, in particular new datasets, are generated over time and therefore reliance on data for decision making increases, we’ll find ourselves less frequently in this situation than the subsequent ones.

2. The problem is informed by data, but you don’t have it.

Now imagine that you’re a junior data scientist assisting the product team at a personal finance startup. You’ve been asked to help them because you’ve been around for a few months, excelled on other projects, and demonstrated that you know your way around your company’s internal datasets. Your company’s current product is a tool that allows people to set budgets and track them by importing and displaying credit card spending. Leadership at your company wants to expand the startup’s offerings and is considering adding either personal loans or an investing platform, but doesn’t have the product development resources to do both. They want to use your data prowess to come up with defensible predictions, such as the addressable audience and predicted loan and investment dollars, to help them pick which product to build.

Let’s start with the personal loans product. I mentioned above that we have access to credit card spending specifically, but perhaps our dataset, upon further examination, contains repayment history or outstanding balances as well. Using this information, we may be able to determine the number of users on our service who we think would be interested in personal loans for credit card payment consolidation. If we have access to outstanding balances, we could even predict loan amounts. Nice!

For identifying possible investors on our platform, this is a bit trickier, but we’re creative and tenacious. We could scan through our datasets and identify purchases that likely investors would make, such as subscriptions to investing publications like Barrons, Investors Business Daily, and Kiplinger’s Personal Finance as well as services like The Motley Fool and Seeking Alpha. For predicting the amounts that could be invested on our prospective platform, we could take our likely investors and applying a few assumptions, predict their income and investible savings. Excellent!

We walk the rest of the project group through our methodologies and results, impressing a few of the executives with our calm, collected answers to tricky questions. They review the information you’ve provided and make a decision on which product to build and launch using your work. You’re excited about the recognition you got, the inevitable success your company is going to have, and are daydreaming about how to describe this experience for the groundbreaking TED talk you’ll give later in your career.

But hold your horses, Simon Sinek in training, you missed a few important things. Off the top of my head, here are some issues you didn’t address that could easily affect the choice of product to launch:

Competitors: There’s so much we don’t know about our competition. For those who we determine are likely to be interested in personal loans, how do we know if they’re actively considering other loan providers? How would our offering compete with those? Similar reasoning applies for the investing example, and it gets even worse, they could already be investing elsewhere and be perfectly happy with their choice of platform. Although they’d be valuable customers to have on our platform, we don’t stand a chance of winning them over and are blind to that.
Other reasons for personal loans: We probably underestimated the market for personal loans by only looking at our internal credit card spending dataset. There are plenty of other reasons for getting a personal loan: refinancing existing loans or covering emergency expenses. Only 23% of personal loans on LendingClub, one of our soon to be competitors, was for credit card consolidation.

**Source**: LendingClub statistics as of 6/26/2020.

Miscellaneous other reasons: We may not have access to data from all the credit cards our users have, which throws off our volume estimates; for our possible dollars invested estimate, we guessed total income from credit card spending, but not all spending happens on credit cards; not all people with credit card debt need or are interested in personal loans for consolidation…

So, what’s the solution? Do we preemptively call out these glaring issues with the proposed analysis during the first project meeting and twiddle our thumbs for a few weeks? No. It turns out that there are some sneaky ways of getting the data we need that can be done instead of or in addition to our internal data analysis:

Beg: It might be that easy. Let’s ask our users by designing a survey and sending it out to them. Obviously we have our own set of challenges here related to accuracy (i.e. are people who said they’d use us for personal loans actually going to do so when we launch the product?), response rates (i.e. what if too many users ignore the survey?), and sampling (i.e. what if non-respondents are structurally different from respondents?). Good survey design can help us ameliorate those issues. As a bonus, we can even describe our proposed offering in a little more detail in the survey itself or let the users tell us what’s important to them in a personal loan or investing service.
Beguile: Let’s get our users to give us useful information by using the increasingly popular placebo button. We add a button on our webpage labelled “personal loans”, “investments”, or one of each. When users navigate there, they’re told that the feature isn’t available yet but they can sign up to learn more. We can even combine this with the Beg strategy by sending a follow up survey to determine possible volumes of loans and investments, desired features of both products, and further gauge interest.
Buy: Let’s partner with a third party data provider to learn more about how many of our users are interested in personal loans or investing, and whether they’re already using a competitive service.

Is it fair to have expected one of these parties, either the data scientist or the product team, to have known to seek alternate data to solve the problem? I’d answer no. Much of the coursework that data scientists do to prepare them for their jobs never involves procurement of data. Even before question 1 on problem sets, they’re typically provided the dataset they need to use to complete it. It shouldn’t be their responsibility to advocate for these alternate datasets. Similarly, the responsibility to ask for new data shouldn’t lie with the product folks either. They’re probably concentrating on everything else: strategy, design, development, and the go to market plan. As frustrating as this situation is, I find the next of the three categories outright infuriating.

3. The problem is informed by data, and you have it, but you’re not asked for it!

In this category, the organization is so close to making the right move, but ends up bungling it anyways. It’s like learning all the concepts covered on the SAT, but bringing the wrong type of pencil to the testing center.

What you forgot at home. Photo by Kim Gorga on Unsplash.

Let’s imagine in spite of the missteps that our startup in the prior example may have made in decision making, we’ve successfully launched both the personal loans and investing services. Things are going well, and we have a newly hired Chief Marketing Officer (CMO) who wants to impress the CEO.

The CMO is a self-described creative type and visionary. He always has vague evening plans. He holds a potted plant in front of his face in his LinkedIn profile picture. He didn’t go to Burning Man last year because “they ruined it” (who’s they?!). And through a combination of intuition, experience, and the long term effects of drug use, he’s unilaterally decided that getting more college students on your site is going to take the company into “hypergrowth” mode. He’s got the whole marketing campaign already planned: the Gen Z specific messaging, planned guerrilla marketing tactics by student brand ambassadors, t-shirts, but there’s one issue: the CEO will want to see data that supports doing this big bet. In order to get the data the CMO needs, he needs you.

One morning, you wake up to a very important e-mail sent by the CMO at 5:53am to you directly. You know it’s very important because he added the very important flag and put “VERY IMPORTANT” in the subject line. He left filling out the data slides of his presentation to the last minute, and needs to know the average lifetime value (LTV) of your company’s active users.

You’re excited that the sophisticated user level LTV prediction model you built is getting traction, and it’s all the way up through the c-suite no less! You’ve already set up automated reporting on this important metric and rush him back an answer.

He thanks you and tells you the number is “way better than I expected” and that he owes you a beer, but like a very specific beer that edgy people drink. After he gets your response, he puts the average LTV number on a slide, contrasts it with the much lower customer acquisition cost (CAC) of these college students, and proves that there’s plenty of profit to be made. He knocks his presentation to the CEO out of the park on two hours of sleep. Hypergrowth, here we come!

Not so fast. Spot the various issues with how your analysis was used? Here are some attributes of the acquired users that were ignored, probably to our detriment:

College Students: College students may have specific characteristics that may make them more or less valuable than the average user. They may require personal loan services, but not necessarily for large credit card debt consolidations, but perhaps for student debt. Is your company’s offering there competitive? They’re also unlikely to need an investing platform until well after they graduate. Both of these characteristics may mean the LTV number you provided the CMO is too high.
New: Your average LTV includes those of users who have been using your service for a while as well as ones that recently joined. If the ones with longer tenures are worth more, as is often the case, this is another reason the LTV number you gave to the CMO is too high.
Active: The pain train keeps coming. We sent the average LTV of active users, but how is active user defined exactly? Let’s assume a user is active if they set up an account and import credit card data, get a personal loan, or invest money. What if the CMO was using CAC estimates which include users who just set up accounts? Yet another reason our LTV number is too high.

The worst part of the whole situation is that we may have been able to use the data we already have to have sent a more accurate estimate of LTV for these users if we had known how the CMO was planning to use our number. We could have subset our LTV predictions dataset to newly signed up college students, both active and inactive, in order to get a better estimate of how much similar users could be worth.

Summary, with a Hint of Solution

In all three scenarios, data is being used in ways that don’t best accomplish an organization’s goals. In the brand slogan example, the data team’s efforts distract from the actual work that needs to be done, which doesn’t even require data analysis or modeling. In the personal loans or investment features example, preexisting data is used to decent effect, but new data that needed to be procured would be better. In the marketing acquisition example, the wrong datapoint is used to inform the initiative even though the right one was available.

As illustrated in the examples, these problems are more frequent and severe because of data illiteracy. The data illiterate come in various forms. Some, like in the first example, expect data to deliver more than it can, acting anywhere between steroid (accelerating growth) to panacea (fixing everything). Others view it as a box to be checked off: ☑ Used data. In our second example, the product team lets their decision rely entirely on poor data, which I think is reflective of a common attitude: if a decision is informed by data, it must be a good one. In the third example, the decision to attract more college students to our platform has already been made by the CMO. He just needs to deal with the hassle of showing a datapoint to the CEO to get her approval. Large differences in authority between data requester and provider can exacerbate these issues. The higher up the chain the data illiteracy goes, the less likely the data team will clarify — or outright refuse — requests.

The solution, then, is to have somebody explicitly tasked with making sure that data is used responsibly and effectively. This person first determines if data is necessary to accomplish the goals of a project, seeks it out, and even generates or procures it if it isn’t already available.

I encourage you to spend a little time reflecting on who you would want tasked with these responsibilities. What would their job description look like? What skills and background would you want them to have? Does a similar role or field that does this already exist? All that and more in “Welcome to the data team! Please solve everything. (Part II: The Solution)”, linked below!

Welcome to the data team! Please solve everything. (Part II: The Solution)

Data products aren’t magic elixirs — but please drink responsibly anyways.

medium.com