Defining A Data Science Problem

The most important non-technical skill for a Data Scientist

Published in

Towards Data Science

4 min readAug 19, 2019

According to Cameron Warren, in his Towards Data Science article Don’t Do Data Science, Solve Business Problems, “…the number one most important skill for a Data Scientist above any technical expertise — [is] the ability to clearly evaluate and define a problem.”

As a data scientist you will routinely discover or be presented with problems to solve. Your initial objective should be to determine if your problem is in fact a Data Science problem — and if so, what kind. There is great value in being able to translate a business idea or question into a clearly formulated problem statement. And being able to effectively communicate whether or not that problem can be solved by applying appropriate Machine Learning algorithms.

Is it a Data Science problem?

A true data science problem may:

Categorize or group data
Identify patterns
Identify anomalies
Show correlations
Predict outcomes

A good data science problem should be specific and conclusive. For example:

As personal wealth increases, how do key health markers change?
Where in California do most people with heart disease live?

Conversely, a vague and unmeasurable problem may not be a good fit for a data science solution. For example:

What is the link between finances and health?
Are people in California healthier?

What type of Data Science problem is it?

Once you’ve decided that your problem is a good candidate for Data Science, you’ll need to determine the type of problem you’re working with. This step is necessary in order to know which type of Machine Learning algorithms can be effectively applied.

Machine Learning problems generally fall into one of two buckets:

Supervised — predicts future outputs based on labeled input & output data
Unsupervised — finds hidden patterns or groupings in unlabeled input data

*There is a third bucket (Reinforcement Learning) which is not covered in this post, but you can read about it here.

Supervised learning can be broken down into two additional buckets:

Classification — predicts discrete categorical response (ex: benign or malignant)
Regression — predicts continuous numerical response (ex: $200,000 home price or 5% probability of rain)

What are the use cases for each type of Machine Learning problem?

Unsupervised (mainly considered “Clustering”) — market segmentation, political polling, retail recommendation systems, and more
Classification — medical imaging, natural language processing, and image recognition, and more
Regression — weather forecasting, voter turnout, and home sale pricing, and more

______________________________________________________________

Pro Tip: By using conditional logic to convert a continuous numerical response into a discrete categorical response, a Regression problem can be turned into a Classification problem! For example:

Problem: Estimate the probability that someone will vote.
Regression response: 60% probability
Classification response: Yes (if Regression response is greater than 50%), No (if Regression response is less than 50%)

______________________________________________________________

Getting granular on the subtype for your problem

Before honing in on your final problem definition, you’ll want to get very specific about the Machine Learning subtype for your problem. Getting clear on the terminology will inform your decisions about which algorithms to choose. The diagrams below illustrate an example workflow for deciding on Classification subtypes (based on classes) and Regression subtypes (based on numerical values).

Finalizing the problem statement

Once you’ve determined the specific problem type, you should be able to clearly articulate a refined problem statement, including what the model will predict. For example:

This is a multi-class classification problem, which predicts whether a medical image will be in one of three classes — {benign, malignant, inconclusive}.

You should also be able to express a desired outcome or intended usage for the model prediction. For example:

The ideal outcome is to provide healthcare providers with an immediate notification when a prediction is malignant or inconclusive.

Conclusion

A good data science problem will aim to make decisions, not just predictions. Keep this objective in mind as you contemplate each problem you are faced with. In the example above, some action might be taken to reduce the number of inconclusive predictions, thereby avoiding the need for subsequent rounds of testing, or delaying needed treatment. Ultimately, the predictions from your model should empower your stakeholders to make informed decisions — and take action!