What does “work on a real problem” mean for prospective data scientists?

Breaking down the expectations around the most common advice

Published in

Towards Data Science

8 min readJul 7, 2020

If you ask any data science hiring manager for interview advice, they’ll tell you to work on a real problem and upload your code/presentation online. It’s hard to argue against this since who doesn’t want someone who can already do the job?

To a student (undergrad or PhD) without any experience, however, this advice is frustrating. The dark secret is that most hiring managers cannot tell you exactly what they’re looking for either. We just know that, with experienced candidates, the interview is smoother and our confidence high. Talking to candidates without experience, on the other hand, we are often left with a sense that something is missing. This post is mostly meant for students but new managers might get some clarity as well.

Context

Before I breakdown the specifics, let me tell you what just happened before a hiring manager talks to you on the phone: we just came out of a different meeting, recalled what the position was for, and started skimming your resume while we dial your number. The resume helps us understand your likely strengths/weaknesses and hopefully provides some material for a good conversation. At this point, we are tired from work, bored from the repetitive questions, but really hoping you can be the one to end the hunt.

On the bright side, we are rooting for you. On the other hand, we have interviewed enough people that this conversation will likely feel like a replay. Thanks to your dedicated instructors from school, your class projects are likely too manicured to provide the small inspiration or educational snippet we enjoy, not too different from watching a good episode on the Discovery channel (back in the days).

If you think this seems prone to bias, you’re not wrong. Fortunately, we are usually required to write down what we thought about the interview and how we are judging your various talents around communication, technical skills, and fit for the role. This documentation process is quite good at filtering out biases since I always write reviews with the belief that the candidate will see them someday.

What we are looking for in a “real problem”

I’ll skip over the non-technical portions of the interview and describe what I look for within a “real problem”.

Do you understand what solving a problem means?

Who is the audience? A problem in healthcare can be targeting clinicians, patients, insurers, or the institutions. Healthcare in different countries are also very different. You should be able to articulate this and your potential future audience.
How can your audience verify the solution is working? In agriculture, this could require a strip trial that requires an entire planting season. In tech, your team might need to see the sign-ups improve in an AB test. In some cases, qualitative feedback could be sufficient. Remember you’re the one selling so your audience may not know what they want to see.
Is there a timeline? How much time are you given to solve the problem and how will you allocate it? A typical student’s mistake is to spend too much time on the solution generation with little time in validation or delivering the solution.
What aspect of the problem is your solution tackling? If you’re trying to increase members for a non-profit cause, is your solution to increase the top of the funnel by increasing exposure, or decreasing fallout by increasing retention? Not articulating the nuances in the problem is the number one sign that you have not been humbled by real life yet.
What resources or limitations exist? Does an existing data collection exist, are there regulation requirements, is there a budget constraint?
What is the status quo and why aren’t you spending your time on another problem? Is this really a problem worth solving? If it’s just for fun, make sure you know that.

Can you formalize ideas with mathematical rigor?

What does it mean to formalize a feedback? On Kaggle, the evaluation metric and labels are given to you. For a real problem, coming up with the objective and data that guide the algorithms is the more challenging task. By choosing an objective, you should have also consciously rejected its alternatives. This can be due to the feasibility of obtaining labels or the metric being irrelevant to your end goals. Make sure you can articulate the rationale.
Do you have a mathematical understanding between your problem and your data? Students often think data is “correct” without realizing the challenges that go into sampling, measuring, and processing. Sadly, this is also the first lesson in statistics that people often forget: data at best is a contaminated glimpse of reality. Understanding this contamination is critical to understanding the limitations of your work. If you don’t understand your data, why should I trust your solution that’s based on it?
What are you learning from your model? Statistics students love to talk about the models they tried and programming students love to talk about how many models their code iterated through. The issue is that fitting models is a task that can be automated as machine learning platforms mature. Are you gaining an understanding of the data, the model, and/or the problem? I want to hire a human because the system is complex and constantly changing, can you tell me when we should retrain the model? Talk to any scientist, models are meant to simplify reality and help advance our knowledge. It saddens me when junior data scientists show me pages of output from scikit-learn with little insight beyond which model fits the data the best.
Can you help non-mathematical people quantify their intuitions? If you can do this, you should be comfortable communicating/selling your results to a non-quantitative audience — a skill commonly advocated for data scientists. Unfortunately, this is again a hard skill for beginners to understand and train. The ability to quantify someone else’s opinions requires empathy, mathematical rigor, and practice. Fortunately, unlike “communicating your models to leadership”, formalizing a non-mathematical intuition can start with talking over a problem with your non-quantitative friends.

Can you deal with uncertainty?

How do you deal with surprises? Real problems are never short of surprises and how do you respond to these? Your data source may no longer be available, your package may be deprecated, your audience may no longer be interested in your solution, you partner may suddenly quit, so what do you do? This gives us a small glimpse into how you might deal with problems. Do you continue to work on the problem or do you stop and how have you made that decision? There is a time to show grit and there’s a time to quit early. Sadly, good teaching is often coupled with minimal surprises and so this is hard to learn from school.
Do you ask questions that decrease the uncertainty around a problem? One of my biggest complaints about interviewing junior data scientists is that they are often afraid to ask questions. Questions in school can be obvious but real problems rarely are. Questions can align people on the same problem and demonstrate the level of understanding around the nuances. To an interviewer, this is the most realistic data point on “how will working with you feel like?” So ask!
Can you focus? Some people can talk endlessly on the nuances of a problem and go on various tangents. While this demonstrates deep and broad understanding of the subject and problem, people often lose track of the original problem. In junior data scientists, this often materializes as a mindless attempt of all the analyses you’ve seen in the past or a regurgitation of random facts around a topic. When presented with seemingly endless choices, show me that you can navigate with little guidance.

Can you identify problems?

It would be great if you could show us some entrepreneurial insight but simply noticing whether an anomaly in the data requires further investigation is sufficient. Data, models, and relationships can be messy so when do you identify them as problems that need and can be solved? Similar to the naiveté around data, identifying problems requires a level of attention to detail, some training, and a curious mind that does not take things for granted. This is likely the hardest skill to learn but this is why people often suggest you to work on problems in a domain you are passionate about. Someone without the curiosity to identify problems really should consider a different career path.
What type of work excites you? It’s often clear from your tone on the type of work that motivates you and whether you will stay in the role if we hired you. I discourage you from faking this but I do encourage you to be open to discovering excitement in seemingly boring problems out there. #DataQuality

Can you execute?

Of everything you’re suggesting, how much can you do yourself? To solve a real problem, can you write down the problem, can you collect the data, can you fit the model, can you validate the results, can you sell it, can you support it, and can you expand it? Each of these steps requires a different skill set. While data scientists’ role focuses mostly on the modeling and coding aspect of this process, you definitely need an appreciation for the challenges and amount of the work required at each step. A friend has told me that we should not tell others to do what we are unwilling or incapable to do. I hope you follow that advice as well.
Are you familiar with the basic terminology and tools? You should assume that all candidates that reached an interview can check this box. If you know a particular tool (e.g. tensorflow) or skill (e.g. remote sensing) is especially desirable for this role, try to sneak it into your project and interview. Just note that this aspect is the bare minimum and rarely differentiates you from the rest.

Conclusion

No one expects an early data scientist to perfect every aspect mentioned above. If this post is giving you more questions than answers, you’re on the right track.

As I mentioned in my previous post, the key that separates an expert and an amateur is not only the quality of execution but more importantly the number of conscious decisions made. Working on a “real problem” will surface all the decisions that your instructors in school have hidden away and force you to make many conscious choices (remember not making a decision is also a choice). Hopefully, this post gives some clarity to students around that suggestion.

What does “work on a real problem” mean for prospective data scientists?

Breaking down the expectations around the most common advice

Context

What we are looking for in a “real problem”

Conclusion

Written by Wayne Lee