Problem solving competence and data skills are the superpowers of the 21st century. In this article I describe my scientific method for solving problems, with a focus on data-related ones.
I’m a physicist, but I left the fields of academia and have a job in the universe of data now. Some of the things I do are commonly attributed to the "data scientist" role, but I have a bit of a problem with this deeply misunderstood term – because many self-proclaimed data scientists completely lack the science part. Also, I believe it’s counter-productive to draw hard lines between the various data job roles (data scientist / analyst / engineer / etc.) – after all, these are strongly interdependent. That’s why I prefer "data-oriented jobs" as an umbrella term for all the strongly interwoven roles that require good data skills.
I found that the things I learned during my studies and in my job as post-doc researcher in experimental physics, are extremely helpful in data-oriented jobs. That is because physics is all about problem-solving. Sure, there’s a lot of stuff in physics which you probably won’t need, like a bunch of natural constants and the equations of general relativity. But all this stuff is completely useless without problem-solving competence.
But what exactly is problem-solving competence? Any competence consists of knowledge and experience. You will have to learn the things you need (like Coding and statistics), and you must actually solve problems to build experience. But in addition to that, it greatly helps to have a structured approach you can apply. A guideline to adhere to. A scientific method.
I thought quite a bit about my own problem-solving process, and I decided to break it down into six steps for you. By the way, none of this was taught to me that explicitly at university. These are my condensed thoughts based on many years of studies and experience. Here’s how you solve problems like a physicist.
1. Clarify the goal
While it may seem obvious, this first step is where many attempts at problem-solving fail. Let’s say I present a problem to you: climate change. All clear now? Of course not. Do I want you to come up with strategies to mitigate the greenhouse effect? Or maybe I want you to run a prediction on how the climate will change if emissions stay at their current level. Or I could be interested in how climate change is already impacting global economy.
If you don’t know what is expected of you, you’re unlikely to perform well. That’s why your first step in problem-solving is to get a clear picture of the goal. Ask these two questions:
1) What are the questions I’m supposed to answer?
These questions should be as precise as possible to avoid misunderstandings. Your assignment may not be presented to you in the form of questions (e.g., "Set up a database for this project."), but you can always look at it by asking questions (e.g., "What database architecture serves us best?").
2) For whom am I preparing my results?
This is important because you should always consider the background and expectations of your audience. Explaining a complex analysis to people with a MINT background is very different from presenting the same data to someone who works in marketing.
2. Check your data
Once you know what you want to find out, the next step is to do an inventory by asking yourself:
1) What data do I have?
2) And, maybe even more importantly, what data do I not have?
You know these trick questions where the correct answer is that there is no answer? Something like, "Alice is 2 years older than Bob. Charly is 1 year younger than Daisy, and Daisy is 3 years older than Alice. How old is Bob?" You need the age of at least one of them to answer that question.
If you’re unlucky, the problem you’re supposed to solve may be just like that, only more complex, making its insolubleness a lot less obvious. That’s why you must always check if you have all the data you need to solve a problem. If not, you have two options:
1) Discuss the lack of data with your client or manager.
If the missing data can easily be obtained, that’s the way to go. If not, they may want to reconsider the questions you’re supposed to answer. But I guess it’s more likely that they will want you to go for option 2:
2) Make assumptions about the missing data.
This can be quite tricky and certainly deserves an article on its own. What I want to stress here is that transparency is super important when it comes to assumptions. Communicate your assumptions clearly and try to put uncertainty estimates / credible intervals on them. This is important for judging the expected accuracy of your results.
3. Decompose and simplify
Let’s say you are reasonably sure that you have all the data you need to solve the problem. Now what? There are two important methods to apply in the next step: decomposition and simplification. I mention them in one breath here because they are strongly interwoven. We will start with decomposition.
What I mean by decomposition is to split a problem into smaller ones. For example, your task may be to answer the question, "What type of newsletter email we sent out was most successful so far?" One approach would be to first come up with a way to classify all the newsletters (e.g., using a decision tree model), and then compute statistics such as average read and click-through rates for each of the thus obtained newsletter types. By doing so, you decompose the problem into two relatively independent parts.
Decomposition makes your life easier in two ways:
- It adds structure to your problem. That makes it easier for you to find a way to solve the problem and to explain your solution process.
- It tends to make re-using code easier when you’re faced with a similar problem in the future.
Now let’s talk about simplification, which is just as important as decomposition. Physicists love simplification. There is this misconception that physicists enjoy doing a lot of complicated math. We don’t. Actually, most physicists are quite lazy when it comes to doing math. That’s why we simplify problems.
To simplify a problem, ask yourself these questions:
1) What can I ignore without changing the outcome too much?
For instance, when you compute the trajectory of a relatively slow, heavy object (like a basketball), you would usually ignore air drag. That makes the equations simpler and affects the outcome only slightly.
2) Can I describe features which should not be ignored in a more convenient way?
Physicists are quite often faced with the situation that the complexity of a given condition stands in the way of finding an exact solution. In this case we make an approximation. A good example is friction: Instead of considering every microscopically tiny bump on the surfaces of two objects in contact, we simply use an average value known as the friction coefficient.
When I mention features and approximations, you may be thinking of principal component analysis. This is not what I’m talking about here. PCA certainly has its use cases, but it should (if at all) be applied at a later stage. The current step of the solution process happens in your head. Why? Because if you’re looking for a good solution, there’s nothing that beats a deep understanding of the problem.
The hard part about simplifications is to avoid over-simplifying. Make sure you understand how your simplifications will affect your results and under what circumstances they are valid. To continue with the friction example: Calculations based on a friction coefficient will not be valid anymore when you’re dealing with objects so small that their size is not much larger than their surface bumps.
4. Make a plan
If you arrived here, you should already have a pretty good understanding of your problem. You know what you want to find out, what data you have, and how your problem can be decomposed and simplified. At this point you may be tempted to crack your knuckles and jump right into the implementation. I advise against it.
It’s not that I’m a one-in-a-million wise guy who always has a perfect plan. It’s just that I’ve made the "who needs a plan – let’s get started" mistake often enough to know it’s a bad idea. What happens is that you first make quick progress and then at some point you realize that you forgot about something important. Then you back-pedal, rewrite your code, re-run your evaluations, simulations, or experiments, and end up wasting time. Lots of time.
Resist your urge to start with the implementation and plan carefully first. I recommend the following, especially for problems to be solved by coding:
- Visualize the problem. Draw a flowchart to outline the various steps needed to arrive at a solution. Give names to these steps and to their inputs and outputs – these may later become the names of your code’s functions and their arguments and return values.
- Go backwards from the final step in your flowchart and ask yourself, "What needs to happen to achieve this, given the result of the previous step?" This will help you identify errors in your flowchart, such as required data for a particular step not being available at this point of the execution flow.
- For each step, make notes about what exactly needs to be done. Try to be precise and make an ordered list for more complex steps.
5. Implement
Having a plan is good, but – big surprise – you also need to implement it to get a result. Unfortunately, this is quite often the hardest part. Not necessarily in terms of the intellectual challenge – after all, you’ve already put a lot of thought into the problem in the four previous steps. But writing hundreds of lines of code or (this is for the physicists out there) filling page after page trying to solve a complicated integral can be quite tedious.
For any reasonably complex problem, you’re likely to experience some troubles at this stage. It could be that
- Your code is just not working as expected and you’re having a hard time debugging it.
- You’re stuck because you don’t know how to deal with a difficult mathematical expression or how to implement the algorithm you need for your data transformation.
- You realize that you overlooked something important at a previous stage.
Point (3) could be considered the worst of these, as it forces you to leave the implementation phase and revisit a previous step. But it happens and it’s not the end of the world. Realizing that something has gone wrong and dealing with it in a meaningful way is far better than producing an incorrect solution or giving up on the problem altogether.
For points (1) and (2) my advice is to stick with it. You’ve come that far, so don’t give up now. Keep on working in a structured manner and test your code (or check your calculations) thoroughly. If you’re so stuck that you can’t find a way to make progress, you may need to go back to a previous step to get a fresh perspective and attack the problem from a different angle. To use a Physics analogy: If a particle-based description of the problem is not serving you, maybe you need to switch to a wave-based description (wave-particle duality allows you to do that).
6. Question your results
This is probably the most neglected and underrated point on my list. I can’t stress enough how important this is, especially if you call yourself a data scientist. The very core of scientific methodology is trying to prove something wrong. No matter how sophisticated and mathematically beautiful a theory is, if anyone finds something that contradicts the theory (the result of an experiment or a counter-example in mathematics), that theory is proven wrong. And wrong theories are obviously not the goal of Science. Therefore, every scientist is obliged to try and prove their own theories wrong and to question their results. That applies to data scientists, too, but unfortunately it is often neglected, as I commented on in a previous article.
So, how do you go about questioning your results? I recommend asking yourself the following questions.
1) Are my results realistic?
If you followed the above steps, you put a lot of thought into the problem to be solved. Hence, you should have some expectations before you see the results. If what you got is far from what you expected, it’s time to re-examine your solution process.
2) Does my solution really work as it should?
You may have heard of the Clever Hans effect, originally a psychology term, which can also cause false results in machine learning applications, as explained in this Nature article. In essence, what happens is that you build something which appears to solve a given problem, but it actually solves another one which just happens to be correlated with your problem. Change your input parameters significantly and check if the thus obtained solutions appear reasonable.
3) Am I fooling myself?
If your results look suspiciously good, ask yourself this question. You may have inadvertently shaped your solution according to your initial expectations by making certain assumptions. Always check how much impact your assumptions have on your results.
Only if your solution made it through this final step, is it time to claim that you solved the problem. And you did it the scientific way – like a physicist!
Thank you for reading. If you found this article interesting, please consider subscribing to get more like this in the future. If you don’t have a Medium membership yet, you can get one here.