The world’s leading publication for data science, AI, and ML professionals.

How to solve any problem in data science

A simple approach for solving complex problems in data science.

A simple approach for solving complex problems.

TLDR:

Understand the problem

  1. Write it down
  2. Write down what you know
  3. Write down your assumptions/conditions

Plan your solution

  1. Connect what you know to what you don’t
  2. Write down a plan

Implement your solution

Evaluate your solution

Introduction

The data scientist’s job is to solve problems. Most tutorials emphasize specific skills, tools, or Algorithms. Instead, this tutorial will outline a problem solving approach that you can apply to almost any situation.

The book How to Solve It: a New Aspect of Mathematical Method by George Polya outlines a four step methodology for solving math problems.

This methodology, with some minor changes, is perfect for data scientists.

The four steps are:

  1. Understand the problem
  2. Devise a plan
  3. Implement the solution
  4. Evaluate the solution

To demonstrate this methodology in action, we will go through a fairly simple coding exercise.

Understand the problem

Understanding the problem is the most neglected aspect of Problem Solving.

The single most important thing you can do is write the problem down.

Once you’ve written it down, make sure it’s the real problem. One thing I learned when I worked in IT is that the problem I was asked to solve was only seldom the actual problem that needed solving.

Once you have identified the correct problem, write down everything you know in relation to the problem.

Items to include when you do this:

  • Similar problems that you have solved before
  • Limitations
  • Requirements
  • Anything you know that relates to finding a solution

Lastly, write down all conditions that relate to the solution. Some of these are organizational, others are more technical.

As an example, if you’re going to be using your company’s Amazon EC2 instance to run your solution with Python, and that python environment uses Pandas version .23 you shouldn’t use features from Pandas 1.x in your solution.

The Pangrams problem:

This problem is taken from the Python track at exercsism.io

Determine if a sentence is a pangram. A pangram (Greek: παν γράμμα, pan gramma, "every letter") is a sentence using every letter of the alphabet at least once. The best known English pangram is: The quick brown fox jumps over the lazy dog. The alphabet used consists of ASCII letters a to z, inclusive, and is case insensitive. Input will not contain non-ASCII symbols.

exercsism.io

The problem

Is a given string input an English Language Pangram?

What do we know?

  1. Pangram: A string that contains at least one of every letter in the alphabet
  2. There are 26 letters in the alphabet
  3. The letters are ABCDEFGHIJKLMNOPQRSTUVWXYZ
  4. If a letter is not in a string it cannot be a pangram
  5. The commonality of letters from least to greatest is: QJZXVKWYFBGHMPDUCLSNTOIRAE Source: https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html

What are the conditions of our solution?

  1. We don’t care about non alphebetic characters
  2. We don’t care whether or not a letter is capitalized
  3. We don’t care if letters have duplicates
  4. We don’t care about the order of the letters

Planning a solution

Once we have understood our problem we can then plan our solution. This is where we take what we know and sketch out how we use that to solve the problem.

Questions we should ask in this stage:

  • How can we connect what we know to what we don’t know?
  • Are there additional problems that we need to solve to get there?
  • Have we solved something like this in the past and can adapt a solution we already know?
  • If we can’t solve the problem as it exists, can we solve a simpler version of it first?

The Pangrams Plan:

For solving the pangrams problem we can start to use what we know to game out how we can determine whether or not an input is a pangram

  • If the length of the input is less than 26 it cannot be a pangram
  • If a letter is not in the input it is not a pangram
  • If we progress through the alphabet from least common to most common letters, that will be more computationally efficient because a failure is more likely to occur on an early iteration
  • If we iterate through the alphabet we have at most 26 cycles to go through

Implement the solution

This is where you go out and actually build your solution.

This is the part many of us do quite well; so well that this is often the first step we go to when we try to solve a problem (possibly right after we frantically look through google or stack overflow).

The key to a successful implementation is methodically follow the plan. If the plan isn’t working, return to the planning phase to see if you’ve missed anything.

The pangrams implementation:

This is the Python function I wrote to execute the plan I made:

# Execute the plan:

letter_list = ['Q', 'J', 'Z', 'X', 'V', 'K', 'W', 'Y', 'F', 'B', 'G', 'H', 'M',
               'P', 'D', 'U', 'C', 'L', 'S', 'N', 'T', 'O', 'I', 'R', 'A', 'E']

def is_pangram(input_string):
    pangram = True
    if len(input_string) < 26:
        pangram = False

    else:
        input_string = input_string.upper()
        for letter in letter_list:
            if letter not in input_string:
                pangram = False
                break
    return pangram
# This will evaluate to false, it is missing a 'Z'
is_pangram("The quick brown fox jumps over the lay dog")
# This will evaluate to True
is_pangram("The quick brown fox jumps over the lazy dog")

Evaluate the solution:

This last step is easy to skip, but vital if we want to continuously improve. We need to evaluate the quality of our answer.

Questions we can ask are:

  • Is this the best solution?
  • Is there a better way to approach the problem?
  • What are the most expensive parts of our solution in terms of memory and speed?
  • Is my code readable?
  • Are there enough comments?
  • When this breaks in 3 months will I have any idea what it does?

Pangrams evaluation:

While I am fairly satisfied with this solution there are some improvements we can make.

Rather than evaluate the entire input we could use the Python set() function to remove duplicate characters which will make it easier for our "if not in" evaluation.

We could also approach the problem completely differently, instead assigning a dictionary that pairs a letter with its ASCII code, and then using an index function to look it up.

I am particularly proud of my idea to iterate through the alphabet in reverse order of commonality. While researching this article I checked several code practice websites and didn’t see anyone else implement that particular idea.

Conclusion

It’s super easy to jump right into a problem and try to solve it immediately. This works incredibly well on simple familiar problems. On complex unorthodox problems however, adopting a rigorous methodology like this will help you stand out from your peers.

When my team looks to hire people we are far more interested in their problem solving process than the specific answers they come up with.

Lastly, I strongly recommend you read Polya’s book

About the author:

Charles Mendelson is a marketing data analyst at PitchBook. He is also working on his masters in Psychology from the Harvard Extension School. If you’re looking for a guest for your data oriented podcast or YouTube channel, the best way to get in touch with him is on LinkedIn.


Originally published at https://charlesmendelson.com on February 11, 2021.


Related Articles