Notes from Industry

Once upon a time, Data Science was valuable only for a handful of Big Tech companies. Those days are over. Data science is now revolutionizing many "traditional" sectors: from automotive to finance, from real estate to energy.
Research by PwC estimates that AI will contribute over 15.7 trillion US dollars to the global GDP by 2030 – for reference, the GDP of the Eurozone in 2018 was worth 16 trillion dollars [1].
All businesses now perceive their data as assets and the insights they can gain as a competitive advantage.
Yet, more than 80% of all data science project fails [2].
Why?
Each failed project fails for its own peculiar reasons, but, in three years of experience, we noticed some patterns. So, here they are: the seven mistakes you should avoid in your next data science project.
1. Pretending Data Science is Not Software Development
model.ipynb, _modelnew.ipynb, _modelfinal.ipynb… How many times have we seen an uncontrolled proliferation of notebooks and scripts? After some week, it is impossible to remember what they contain and why they were created in the first place. Confusion leads to duplication, duplication to bugs, bugs to slowness. And slow experimentation leads to poor outcomes.
Usually, data scientists do not have a software engineering background: still, this is no excuse to accept compromises on code quality.
The quality of production code is not negotiable.
Make no mistakes: we are talking about production code. Cutting some corners to go to the bottom of an experiment quickly is OK. Actually, it is a good practice: quick and free experimentation is key to finding better and better solutions. Yet, experimentation and production are two separate worlds: as we discussed in a previous post, once a solution is working, it should be taken out of notebooks, refactored, engineered into proper software modules, put under automated testing, and integrated into the production codebase.
2. Pretending Data Science is Just Software Development
No, there’s no abuse of copy-pasting here – this point is in contrast with the previous one. Data science is not a subfield of software engineering: it is a complex discipline, incorporating elements of mathematics, software engineering, and domain knowledge.
Therefore, the practices that best suit software development may not be appropriate for data science. The discrepancy is particularly apparent in the project workflow, post-deployment maintenance and testing.
Agile methodologies, like Scrum and Kanban, are widely used in software development. Unfortunately, they fail to consider the different steps involved in a data science project:
- defining the problem statement,
- retrieving useful data,
- performing the exploratory analysis,
- developing a model and evaluating the results.
Therefore, at the very minimum, a higher-level framework, like Microsoft Team Data Science Process [3], is required. More often, ad-hoc workflows are adopted: a comprehensive discussion is outside the scope of this article but may come with future posts – let me know if you’re interested!
Point 7 of our list addresses the maintenance and evolution of models after the first deployment: this leaves us with testing.
In software development, the test pyramid is a well-established metaphor [4]. You test the correct execution of every function or method with unit tests, the proper integration of individual components with integration tests, and the sound behaviour of a feature of your application with end-to-end tests. All the checks are automated and can be repeated every time an update is introduced.
Now, this is great and should be applied to a data science project whenever possible, but… How can you check for the "correctness" of a model or of an exploratory analysis?
A different approach is required. First, appropriate metrics to establish the performance of a model shall be defined before the modelling phase starts, to avoid any bias. Then, data shall be versioned as well, as they are key to the reproducibility of each experiment. Finally, automated checks for the performance of each updated modelling pipeline should be performed to assess if it brings any added benefit. For "modelling pipeline", we mean here the series of transformations happening from the raw data to a trained model. This process is the core of the raising MLOps paradigm and is facilitated by new tools like MLFlow.
3. Isolate Your Data Scientists
A data science project requires a lot of different skills, which are seldom found in a single individual. At least three groups are usually involved in a data science project: IT team, business stakeholders, and Data Scientists.
Pretending a bunch of capable data scientists can address a data science project by themselves is just pointless. First and foremost, because data science is a tool, which should be adopted to produce business value. The knowledge of where the business value resides, the capability of making the right questions, is in the business stakeholders. Similarly, taking validated solutions to production is extremely difficult without the help of an IT team.
The three groups must thus work together, be able to understand each other, and properly blend their expertise. To this end, it is useful to establish a shared glossary and a communication platform like Slack or MS Teams. Moreover, it is important to accept and make the most out of failure. Even in a client-consultant setting, experiments carrying no performance improvements must still be shown and discussed, as they help draw the way for future iterations.
Notebooks are powerful tools to foster effective collaboration: for data scientists, they are the perfect platform for quick experimentation, for IT and business people, they are easily comprehensible reports.
4. Neglecting the Data
Exploratory Data Analysis (EDA) is a critical phase of any data science project. If up to 80% of the overall elapsed is devoted to retrieving, cleaning, and analyzing the data, it may also be argued that a big chunk of the success of a project depends on the quality of the data feeding the models.
As we argued in a previous post, the insights gathered from EDA can bring intrinsic value. Another example drawn from a project of ours comes from the food and beverage industry. The project was about estimating the impact of advertisement and discounts on sales and margins in order to automatically compute an optimal strategy. During EDA, we noticed how the competitors of our customer were capable of boosting simultaneously the sales of a large subset of their products, while our customer was not. This fact surprised their marketing department and triggered a detailed investigation of the phenomenon.
However, the main purpose of EDA is understanding how to clean data prior to modelling and which hypothesis to embed into the model itself. Cutting corners in such a phase can greatly compromise the outcome of the whole project.
5. Skipping Documentation
Investing time and resources in documentation is wasteful. The code speaks for itself, doesn’t it?
Wrong.
High-performing organizations find that documentation upfront saves heartache down the road [5].
Production code may tell you what you are doing, but it does not provide you with hints on why you are doing it. What about all the experiments yielding suboptimal performance? What about all the little insights, discoveries, challenges that led to the final solution? The answer to such questions is as valuable as the solution itself.
Yes, but how to document the evolution of a data science project?
We found LaTex on Overleaf and exports of Jupyter notebooks to be extremely effective. Even more so if they are complemented with proper reproducibility, achieved with code versioning with git, data versioning with DVC, and experiment versioning with MLFlow or some similar tool.
6. Trying to Get The First Model Right
A data science project is iterative by nature. It is simply impossible to achieve the optimal solution on the first attempt, and it is often useless, too.
Before over-engineering a solution, it is worth studying the literature for tried and tested approaches. For a computer vision task, very few people would develop their own architecture of neural network: it is so much easier and more effective to rely on off-the-shelf models and transfer learning.
In a different context, such as dealing with time series or tabular data, it is often sensible to start with simple models, easier to tune and interpret. The good old ARIMA predictors are definitely not state-of-the-art for time series forecasting, but they can tell a lot about the structure of the series and guide the development of more complex approaches.
Moreover, the most important metrics are not about accuracy or error, but about business value. Hitting the market quickly with a good-enough model brings meaningful feedback and can help steer the following iterations into producing the most value, not just the best model.
The best organizations start simple and get the result into the business. Learn and measure before updating the model with a more sophisticated approach [5].
7. Thinking that a Model is Forever
No model is forever. Even if a predictor works in production today, it may not do so in a few weeks. This is because the effectiveness of a model often depends on the context where the modelled phenomenon happens.
When COVID-induced lockdowns stroke in early 2020, some of the models we developed to forecast electric energy demand experienced a dramatic drop in performance. And this is OK: no model was trained to cope with a nationwide shutdown, as (i) it never happened before, and (ii) it does not make economic sense to account for such a remote possibility.
The example is extreme, but less dramatic changes in the context the models operate in or in the data they are fed with can have a deep impact on their performance. Therefore, it is critical to build a proper monitoring infrastructure and to ship new solutions to production as quickly as possible, while avoiding dangerous mistakes.
While 5 years old, What’s your ML Test Score? A rubric for ML production systems [6] by Google is still one of the main references on the topic. Much of the issues the paper addresses are now at their core by MLOps, which promises to bring to data science as big of a boost as DevOps brought to software development.
Thank you, my reader, for getting here!
Every comment, question, or general feedback is always welcome. If you are curious about me or xtream, check us out on LinkedIn!
If you appreciated this article, you may be interested in:
Stop copy-pasting notebooks, embrace Jupyter templates!
Lessons from a real Machine Learning project, part 1: from Jupyter to Luigi
Lessons from a real Machine Learning project, part 2: the traps of data exploration
References
[1] PwC Research Group, Sizing the prize: what’s the real value of AI for your business and how can you capitalise?, 2017
[2] Brian T. O’Neill, Failure rates for analytics, AI, and big data projects = 85% – yikes!, 2019
[3] Microsoft, What is team data science process?, 2021
[4] Ham Vocke, The Practical Test Pyram, 2018
[5] Domino, The Practical Guide to Managing Data Science at Scale, 2021
[6] Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley, What’s your ML test score? A rubric for ML production systems, 2016
Acknowledgements
Much of the content and an original draft of this post were created by Marco Paruscio. The whole Data Science team at xtream provided effective support in reviewing this article.