How to construct valuable data science projects in the real world

Lessons learned by making several mistakes

Jonny Brooks-Bartlett

Published in

Towards Data Science

15 min readAug 27, 2018

Introduction

Most articles about how to “complete” a data science task usually discuss how to write an algorithm to solve a problem. For example how to classify a text document or forecast financial data. Learning how to do these things can be vital knowledge for a data scientist if it falls within their remit. However, the task is just a small part of the process of completing a data science project in the real world. Even if you can code up the perfect solution to a multi-class text classification problem, is it actually valuable to the business? What’s the current solution to the problem and what benchmark do you have to surpass so that the user can trust it’s output? When the algorithm is up and running in production are you getting feedback to understand whether the output is continually producing usable results?

In this post I want to set out some guidelines on developing and carrying out effective and sustainable data science projects. This guide is a list that I came up with after making several mistakes in my own data science projects as well as seeing others making their own. Some of this wont apply to all data scientists because not all of this will fall into their remit. However, I’ve been in a team where we didn’t have the luxury of a dedicated business analyst, product manager or even a data science manager. It meant that I had to take on some of the responsibilities of these roles myself and often, not do a great job of it. But it was a valuable learning experience and here are some of the things that I’ve learned.

Special mention: Regardless of whether you agree with my waffling or not you should check out the video in this article on how to do stakeholder driven data science by Max Shron, the head of data science at Warby Parker. It’s amazing and sets a great bar for doing good data science projects in the real world.

What questions should be addressed for a project to be considered successful?

When coming up with a solution to a problem I find it useful to picture what success looks like (here we’re assuming that we already know what the problem is, but don’t underestimate how hard it can be to identify a problem that your team is currently setup up to be able to solve). This helps me to develop strategies get to the end goal. In this case I want to write down a set of questions that I can answer immediately if the project is successful. Here they are:

Why are you doing the project? I.e. what value does the project bring and how does it contribute to the wider data science team goals?
Who are the main stakeholders of the project?
What is the current solution to the problem?
Is there a simple and effective solution to the problem that can be performed quickly?
Have you made an effort to involve the right people with enough notice and information?
Have you sense-checked your solution with someone else?
Have you made an effort to ensure that the code is robust?
Have you made an effort to make sure that the project can be easily understood and handed over to someone else?
How are you validating your model in production?
How are you gathering feedback from the solution?

In my experience if these questions can be answered adequately then the project is likely to be successful. This may not always be the case and this list may be far from exhaustive depending on the project, but it’s at least a good starting point.

Kate Strachnyi has a set of 20 Questions to Ask Prior to Starting Data Analysis in a much shorted article if you decide you would rather not make your way through my mammoth brain fart (I wouldn’t blame you).

The steps below help in addressing each of these questions.

5 step guideline for data science projects

Step 1: Get an initial evaluation of the potential value of the project

Why do it? It helps you prioritise projects. You should be able to adequately explain why one project should be completed before another. It also allow us to understand how the project aligns with the goals of the team and the company. In addition, this will also provide some guidance on what metric we should optimise for the model.
What does this involve? Rough quantification of benefits e.g. money saved, revenue increase, reduce time spent on manual labour. The argument against this is that it’s hard to do and not always quantifiable. My response is: if you or your stakeholders can’t figure out the value of the project then why are you allowing yourself or your stakeholders to waste your time? The value doesn’t have to be perfect, just a ballpark estimate. This step also involves determining who the main stakeholders are?
What happens if it’s not done? We may spend ages doing a project that no one benefits from. One example of a project that I saw could’ve done with better scoping was where the data science team was tasked with identifying a list of people who were most likely to benefit from being contacted by our marketing team. A model was built but we decided to spend a couple months improving it. Despite the new model giving better results, the team had to adjust their threshold because the business didn’t care about the scores just were generated for each customer, instead they wanted to make sure they were contacting a fixed number of people. So it’s arguable that the time spent improving the model was pointless and we would’ve known this if we scoped the project better with the stakeholders. (An argument could be made that the fixed number of people contacted are actually a better segment due to a better model but this wasn’t measured so we don’t know whether this is the case).
What is the outcome from this step? A rough quantitative estimate of the value of the project accompanied with a brief paragraph giving more context (an executive summary of the project). It’s important to note that depending on the company and the project, just the perceived value gain of having a data model is good enough for the business to deem the project a success. In which case a quantitative estimate isn’t necessary. But that’s more about company politics and can only happen for so long before you need hard number to show your team’s worth.
Useful resources: This article titled “A Simple way to Model ROI of any new Feature” helps a lot. Give a simple formula: Expected ROI = (reach of users * new or incremental usage * value to business) - development cost. Other useful reads are “Prioritizing data science work” and “Product and Prioritisation”

Step 2: Determine current approach/Create baseline model

Source: Create a Common-Sense Baseline First

Why do it? The current approach gives us a benchmark to target. All useful models should beat the current approach, if there is one. If there is no current solution to the problem then you should develop a baseline model. The baseline model is essentially the solution to the problem without machine learning. It’s likely that a complex solution may only provide incremental value so you’ll need to evaluate if it’s actually worth building anything more complex.
What does this involve? Speak to stakeholders to determine what they currently do and what success they have. It’s likely they don’t measure their success rate so it’s something that you’ll have to estimate/calculate. Building a baseline model should not involve any complex/involved methods. It should be fairly quick and rudimentary. Probably using counting methods.
What is the outcome of this phase? A baseline evaluation number of the performance required to be successful/useful for stakeholders. An assessment of whether a complex model is worth building.
What happens if not done: You could waste time building a complex model that, at best, probably wasn’t worth the time spent getting the additional accuracy, or at worst, doesn’t even best the current approach. This was something that was missed when we built our recommendation engine. We didn’t check that the algorithm was better than a sensible baseline (recommending the most popular content). It could’ve been that the recommendation algorithm didn’t provide enough value to warrant doing it when we did.
Resources to help: The articles titled “Create a Common-Sense Baseline First” and “Always start with a stupid model, no exceptions.” are good reads that emphasise this point.

3. Have a “Team” discussion

Why do it? At this point you’ve come to the conclusion that this project is worth doing (step 1) and success is feasible (step 2) so it’s time to speak to the people involved in making the project successful e.g. engineers and/or other data scientists are obvious candidates. You should be clearer about what code you should write, what data you need, what you should test, what performance measure to use, what model approaches you should try. It’s easy to think you know what you need on your own but having discussions with others can often help highlight things you’ve missed or things that could be improved. Don’t underestimate the importance of having people with different viewpoints contribute to the discussion.
What does it involve? Speak to at least one other data scientist and show them the results you’ve obtained so far. Perhaps they have ideas about how you can improve on your idea. It’s vital you do this before you start on the model because you’ll be less likely to change your model once it’s written. Also the data scientist you speak to might be the one doing your code review and so it’ll help them with context. Speak to the engineer that will be involved in productionising your work. They’ll likely need to know what to expect and may have suggestions that will make productionising code much easier.
What is the outcome of this phase? Nothing concrete! Just something to ensure quality is as good as it can be first time round. Ensure that the relevant people are aware and onboard with the project.
What happens if not done: Best case: you’ve managed to think about and avoid all pitfalls on your own. However, it’s more likely that you’ve haven’t thought about everything and there’ll be important things that you’ll have missed. Typical cases of the problems here include unmanageable transfer and handling of storage files when the model is moved into production. Model output misses the mark and isn’t in the most useful form. This was the case with one of the models I produced. I wrote code that made multiple API calls, many of which were unnecessary. It was fine on a small dataset which I ran locally, but the servers struggled with the load in production. It wasn’t fixed until I spoke to an engineer that helped me diagnose the problem.

4. Model Development

Why do it? This is the model that we use to ultimately solve the problem.
What does it involve? The difference lies in what’s involved. It’s not only about creating a model. There are countless articles about how to write machine learning algorithms to solve specific problems so I wont explain this here. Instead it’s important to emphasise some steps that should be carried out to produce high quality production code. In the development process you should be doing regular code reviews. Remember that you are likely not to be the only one that sees the code and, you are not the only one invested in the success project so there should be good code documentation. This is vital for the longevity of the project. There will almost certainly be bugs and unexpected inputs in production so you can mitigate these issues by performing code testing to improve the robustness of the code. This includes unit testing, integration testing, system testing and user acceptance testing (UAT). The specifics of how to make your code productionisable may vary from one team to the next but other things that will help are: working in an isolated environment (virtual environments or Docker containers), using logging to write log files, using configuration files so the configuration is separate to the main code.
What is the outcome of this phase? A shared (Github) repository with the required files and a working model that solves the problem defined in the project.
What happens if not done: The model has to be completed otherwise the problem will not have been solved. If your code isn’t tested there’ll be mistakes with logic that may not be noticed until production. If the code isn’t reviewed by someone else or it’s not documented, it’ll be difficult for other people to take over when you inevitably leave the company or are on annual leave. Some of these issues would crop up consistently on projects that I’d done previously that weren’t robust. I was still fixing bugs on a project that I was involved with 9 months after it was “completed” because the code was not robust. This eats into time that you could be using to do other valuable things and it causes lots of frustration for everyone involved. Make sure you spend the extra time required to make the code robust during the development period because it will save you time in the long run.
Resources There are loads but these are some of the ones that I’ve read that I really like:
At the very top of the list here is an article called “How to write a production-level code in Data Science?” It covers pretty much everything that I can think of. If you’re a data scientist building production level code then you should read it.
Code reviews: Code reviewing data science work and An article about how to ‘do code reviewing by yourself’ i.e. writing in such a way that it serves as being reviewed (this is not a substitute for doing actual code reviews, it’s only here to help you think about how to write good code)
Code documentation: A good guide on how to document code properly. I also really like numpy style docstrings
Code testing: A guide on how to write unit tests for machine learning code. This is a good guide for writing unit tests using Python’s Pytest library. I used this to help me write my first set of tests for one of my projects. The same company also have an article on mocking data for tests. And here’s another guide on Pytest and mocking

5. Model monitoring and feedback

Why do it? This is to ensure that our product is working as intended in production. The outputs of the solution should be stable and reliable. We should be the first to know if something is wrong. Is model performance lower than expected? Are the data formatted differently to the training data? Are the data incorrect? This saves us a lot of time manually checking outputs and going through code to ensure things are working as expected. This is especially the case when the stakeholders begin questioning our data. We’re in the business of providing value to the company so we should also be measuring the impact that our solutions have. Is it working? Does it need tweaking? How much money are we generating? How frequently is the solution being used. These are the numbers that we can report to the executives to show the value that data science is contributing to the business.
What does it involve? This involves a period of time (perhaps a couple weeks) after the model has been put into production to manually and proactively check that everything is working. This also involves automating the monitoring process. perhaps creating a dashboard and automated email alerts and/or an anomaly detection system. Perhaps the stakeholders need to do monitoring as well so the monitoring solution may need to be tweaked to be user friendly for non-technical colleagues. For feedback purposes this can involve discussing with the stakeholder how you’ll receive feedback. Will it be qualitative or quantitative? You could write something into the product that logs usage so that you don’t have to explicitly ask the stakeholder about their usage habits.
What is the outcome of this phase? Methods and products to ensure that the model is working properly and providing qualitative or quantitative feedback on the usage and value of the solution. It may also include some form of notification/alert if something goes wrong.
What happens if not done: If our products aren’t monitored properly we potentially risk business stakeholders losing trust in the things we produce when they break. It could also cost the business money and our team a lot of time trying to fix. One example of this that I’ve faced is when the data in the analytics tool the we built started giving wildly incorrect figures. It turns out that the provider of the raw data had messed up. But importantly, it was our stakeholders that picked up on the problem before our data team did. Robust testing (as described in the model development step above) and automated alerting should’ve picked this up but we didn’t have that in place. When the data finally came back, the stakeholders thought that the data were still incorrect. Our data team spent 2 weeks checking the data only to conclude that there was nothing wrong with it! That’s 2 weeks that we lost providing value to the business. Automated monitoring could’ve reduced 2 weeks to 2 minutes! Additionally, if we’re not getting feedback then we have no idea how useful our projects are or whether they’re still being used. One example was in a meeting with the marketing team and we had to ask the question “are you using the model?”. We should never have to ask that question because 1) we should have scoped the project properly during steps 1 and 2 so we know the value and we’re certain that our model improves upon the current solution/baseline model 2) We should’ve done the monitoring as part of the project, not in an unrelated meeting. This shows that we were not measuring the impact our models are having.

Notable omissions

A good friend of mine (Michael Barber) mentioned that I’d missed an important step: evaluation. Model evaluation is an incredibly important part of the process and if it’s not done properly, can lead to unexpected degradation of a model in production. Here’s a great article by Rachel Thomas at Fast.ai about using suitable train, test and validation sets for model evaluation.

Furthermore, I’ve completely missed out experimentation. One of the best ways to determine whether your data science solution has had the desired impact is to run an experiment and carry out an A/B test. Here’s a great article about A/B at Stack Overflow by Julia Silge.

I’m not going to go into detail about these because this post is already too long and these things should be brought up in early discussions with stakeholders about how the project is going to add value to the business (See step 3 about having a discussion with the relevant people before you complete a project 😂. Perhaps I would’ve included these points first time around).

So there we have it. These are 5 high level steps that I feel are required to develop a successful data science project. I should note that not all of these steps need to be carried out in full by a data scientist as some teams may have dedicated members to do some of the tasks. For example a business analyst or product manager may be the person who’ll liaise with stakeholders to determine whether a project is valuable and what the requirements are. Also there may be some steps that aren’t required for all projects. If a project is a one-off analysis piece then there’s no need to create a dashboard to continually monitor the output.

Realistically you don’t need to follow all of these steps to create a successful data science project. In most cases it wont be pragmatic to write a whole suite of tests and documentation in addition to setting up a monitoring dashboard for a stakeholder with automated alerting when anomalies crop up. It’s more likely that you’ll have to weigh up which of these things are worth doing so you can complete a project by the (likely unreasonable) deadline that you’ve been set. I confess that I’ve never completed every single step for any one particular project however, I have been aware of what I’m not doing and why not doing it at this particular time is a pragmatic decision.

The other important thing that I want to highlight is that these steps are only guidelines to help make a project successful. But to become a successful data scientist there are other traits that you must possess and skills that should be learned.

I’m very aware that I’m not the most experienced data scientist and I don’t know what I don’t know so if anyone has any comments, questions or suggestions, please feel free to fire away in the comments. Thank you for reading :)