Data Scientists are from Mars and Software Engineers are from Venus (Part 2)

Consequences of mistaking models for software (Part 2)

Twelve traps to avoid when building and deploying models

AnandSRao
Towards Data Science
12 min readSep 6, 2020

--

Edited from Hungry Venus flytraps snap shut on a host of unfortunate flies

In Part 1 of this series on data scientists are from Mars and software engineers are from Venus we examined the five key dimensions of difference between software and models. The natural follow on question to ask is — So What? Does it really matter if models are conflated with software and data scientists are treated as software engineers? After all for a large cross-section of the population, and more importantly the business world, the similarities between them are far more visible than their differences. In fact, Andrej Karpathy refers to this new way of solving problems using models as Software 2.0. If they are really the next iteration of software are these differences really consequential.

The challenges of building models is exasperated when we conflate models and software. In this blog, we describe the twelve ‘traps’ we face when we conflate the two and argue that we need to be cognizant of the differences and address them accordingly.

Data Trap

As we examined in our previous blog, models are formal mathematical representations that can be applied to or calibrated to fit data. Hence, data is the starting point for building a model. While test data is critical for building software, one can start building an algorithm from a given specification before collecting or preparing the test data.

However, when it comes to building models the data has to be of good quality (i.e., garbage in, garbage out), available in sufficient quantity, and for supervised learning models also labeled (i.e., a label is a response variable that is being predicted by the model). The data also needs to be fit for purpose. One example of this is that the data should be representative of the population that we will be using when the model is deployed in production. Recent examples of skin type and gender biases of facial recognition models underscores the importance of having a representative (and a statistically significant) dataset for building models. Such data biases are surprisingly common in practice.

We have seen the failure to address this challenge of gathering, curating, and labeling the necessary data needed to build a model as one of the significant traps of mistaking models to be similar to software. A number of companies eager to launch their AI or ML programs pay very little attention to this aspect and start building models with very little data. For example, a company recently wanted to build a NLP (natural language processing) model to extract structured information from documents with just eight PDF documents. The cost and the time required — especially from domain experts (e.g., legal experts or clinicians) — makes labeling a significant challenge. While techniques are evolving to learn from less data and also assist experts to label data as part of their normal work, having sufficient, good labeled data is still a significant departure from the way models are built vs how software is traditionally developed.

In summary, the data trap can be further categorized as data volume trap, data quality trap, data bias trap, and data labeling trap. A company can suffer from one or more of these traps. Getting a realistic sense of the data trap is critical to ensuring you don’t go down the wrong path and spend millions on your modeling effort and not realizing the expected returns. In addition, understanding these traps can also change the way you address your modeling effort by first collecting more labeled data or looking for alternative rule-based ways of solving the problems.

Scoping Trap

With more than three to four decades of software engineering practices and methodologies, software developers and system analysts, have become reasonably good (or at least much better than model developers) at estimating the time required to build and test software. With agile software development methods, software can be developed incrementally and iteratively in fixed time periods — typically in two-week or four-week sprints.

Assuming that we want our models to satisfy certain performance criteria (e.g., accuracy, precision, recall, etc), it is hard to estimate the effort and duration it will take to achieve the results. Worse, we may not be able to tell a priori if we can, in fact, succeed in satisfying the performance criteria. In addition, the difficulty of meeting the performance criteria may be non-linear. For example, in one of our recent client projects we were able to achieve 90% accuracy with a decision tree model within a couple of weeks. However, the client was aiming for a 99% accuracy. After spending a couple of months the accuracy could get no better than 93% with a neural network model.

Lukas Biewald gives another classic example, where in one of the Kaggle competitions thousand of people around the world participated in a contest to improve the accuracy of the model from a baseline of 35% accuracy to 65% accuracy in just one week. However, subsequently, even after several months and several thousand people trying to improve this result the best they managed was a 68% accuracy — a mere 3% improvement.

We call this the scoping trap where data scientists are unable to scope the effort and duration (or time) required, the data required, and the computational resources required to achieve a certain performance criteria (e.g., accuracy). This scoping trap can occur at different stages of the model. It might be difficult to scope the model to achieve a certain performance before the model is built — what we call as the pre-build scoping trap. The training scoping trap is when the data scientists are unable to tell how long they should continue training the model — with new data, new techniques, additional resources etc — in order achieve the performance criteria during the training phase.

These two traps can drive a product manager, scrum master or project manager crazy when it comes to embedding models within a traditional software or delivering on a data science project. In large software development efforts we have often seen the ‘voice’ of the data scientist being ignored, forcing data scientists to perform simple descriptive analytics and not generating any insights from the model, due to tight and fixed deadlines. Alternatively, they might develop rule-based models as opposed to truly ML models that are brittle. We believe that this is one of the significant issues in many AI/ML projects not delivering on their stated ROI (Return on Investment).

When one builds models that can learn continuously we are faced with an additional challenge. Let’s say the target accuracy determined by the business for model deployment is 90% accuracy and the trained model has achieved 86% accuracy. The business and data scientists together can take a decision to deploy the model and have the model continuously learn and hope that the accuracy crosses the 90% threshold. Once again, the data scientists will be unable to scope if and when the model will cross this threshold and under what conditions. We call this variant the deployment scoping trap.

Finally, models could suffer from model drift where the performance of the production model decreases because the underlying conditions change. Such model drift could happen abruptly or continuously. Once again, the data scientists will be unable to scope the nature, timing and deterioration of the model accuracy. We call this the drift scoping trap. As a result, one needs to institute model monitoring practices to measure and act on such model drifts.

In summary, the scoping trap can be further categorized into pre-build scoping, training scoping, deployment scoping, and drift scoping traps. The figure below highlights these different types of scoping traps using an illustrative example.

Scoping traps and how they manifest pre-build, during training and after deployment

Return Trap

Business sponsors and project managers often have to show the expected ROI before embarking on building any large scale software. As data science projects become more common in enterprises it is natural that business leaders want to understand the expected ROI before making or prioritizing their investments. While estimating returns on a new piece of software is not easy, the task gets even more complex when it comes to the expected ROI of models.

Conceptually, ROI is a relatively straightforward computation — it is the net benefits over costs

ROI = (Benefits from model — Cost of model)/Cost of model

The benefits of AI/ML models in companies typically fall under two broad categories — efficiency and effectiveness. When companies automate manual or cognitive tasks that are repetitive in nature they improve the efficiency of the process, reduce the time it takes to perform these tasks, and improve productivity of their labor force. When companies use models to make better decisions to augment humans making decisions they improve the effectiveness of their decisions. In other words, the benefits accrue from being faster and better. The question we need to ask is — faster and better relative to what baseline? It is in estimating this baseline that companies often fall short.

When automating a task we need to have a baseline of how long does it take for a human to perform the task? Unfortunately, estimating how long someone takes to perform a task — especially when it is a cognitive task (e.g., assessing the risk of a customer) or a non-repetitive task (e.g., handling exceptions in expense approval) is not easy. People with different skills, backgrounds, and tenure might take different times to complete the task. A proper analysis of all these factors to determine the true duration of a task is a non-trivial exercise and also may be impractical in a service organization or knowledge-based organization with a wide variety of tasks spanning a spectrum of complexity levels.

Another common problem in deriving the baseline for efficiency is that it might just be difficult to isolate the estimation of the given task amongst all the other tasks one does. Take the example of a purchasing manager who amongst her different activities in a day examines a purchase order in the system, cross-checks with the packaging slip and vendor invoice to determine if the transaction is accurate. Let’s say we have built a NLP model to extract key fields from the invoice so that they can be reconciled with the purchasing order. Even for this single individual, estimating the total time they spend on invoice processing may be difficult to compute, as this task is embedded with other tasks, such as attending meetings, inspecting shipment etc. and is dependent on the complexity of the purchase order, invoices, and packaging slips (e.g., the complexity and time increases if the shipment is a delivery across multiple purchase orders or multiple invoices).

When it comes to getting a baseline for effectiveness we get into an even more challenging endeavor. Efficiencies were computed for tasks — discrete activities that can be measured for the duration it takes. However, when it comes to effectiveness we are evaluating decisions and actions. How do we determine if one action is better than the other? The results from an action are multi-dimensional, could be uncertain, and delayed in its effect. Let us say you are driving your vehicle and just as you are nearing a signal the green light turns amber. Do you apply your breaks to stop the vehicle risking a car closely following behind you to potentially hit you or do you cross on amber (still legal)? Which action is better and in what way — better for the vehicle behind you, better in terms of fuel consumption, better in terms of obeying the law more strictly. While this was a straightforward action, estimating the baseline for decisions are even more complex.

So far, we have just examined the estimation of a baseline for efficiency and effectiveness. This must occur before we start building the model so that we have a good idea of what performance we require of our model. We call this as the return estimation trap. There is another type of return trap that occurs when we have built our model and deployed it and are now trying to realize the benefits. We call this the return realization trap.

Once again we run into issues while calculating the returns. In the case of efficiency benefits we may be able to categorically show that the automation was able to reduce the time required to complete a task. Let’s say your automated invoice processing model has reduced the average time for processing an invoice from 30 minutes to 15 minutes. If the person processes four invoices a day, the person will be saving an hour a day or five hours in a week. Now let’s say the person works ten hours a day or fifty hours a week. The savings in time is 10%. However, there may not be a tangible $-benefit to the company. This can happen due to a number of reasons. The employee is already working a 50-hour week and with a 5-hour saving they might just reduce the number of hours they work. This might still be an overall benefit to the organization in terms of employee satisfaction and retention — but we probably have not factored this into our benefit estimation. Even if they were working only the required 40 hrs a week and we saved them 5 hrs due to automation, they might find other things to do to fill the gap as opposed to the organization being able to monetize the fractional time savings. This is one of the biggest challenges with RPA (Robotic Process Automation) and IPA (Intelligent Process Automation) where time is saved for individuals so that there is a decrease in FTE (Full-Time Equivalents), but this savings does not translate into headcount reduction where you can clearly demonstrate the return from automation.

When it comes to realizing the benefits of effectiveness of decisions or actions we run into similar issues as well. The biggest challenge in these cases it the challenge of attribution. When an action can be shown to be measurably better than an alternative, it is not always possible to isolate the entire context in which this action was performed. For example, in the case of stopping the vehicle or crossing an intersection on amber, the act of stopping suddenly might be better when it is a dry and sunny day and could potentially be the wrong choice on a wet, slippery, snowy day. In this case, you cannot completely attribute all of the benefits of stopping on a dry and sunny day to your action — part of the credit goes to mother nature for providing you with the right environment for your action. This attribution challenge is all too common when it comes to evaluating decisions and actions, where competitors, customers, suppliers, regulations and a host of other stakeholders might have a hand in making an action or decision ‘better’ or ‘worse’.

The reason that the return trap is more acute for models as opposed to software is because we are comparing the performance of these models with human performance. In cases where humans are unable to perform certain tasks at the speed of automated models (e.g., algorithmic trading) or the model can evaluate a humanly impossible number of choices and make the right decision (e.g., playing the game Go or Chess), the value of models will be reasonably clear. However, in a majority of the cases where models are automating tasks or augmenting human decisions or actions the value trap is a significant challenge to contend with.

In summary, we end up with four different types of return trapsreturn efficiency estimation, return effectiveness estimation, return efficiency realization, and return effectiveness realization traps.

Summary

We have looked at three broad categories of traps and a total of twelve different sub-categories as shown below.

Twelve traps of models across three different categories

In Part 1 we examined five dimensions of differences between models and software. The data trap discussed above largely stems from the fundamental way in which models are constructed to fit the data. In addition, the uncertainty around the output and the inductive inference mechanism also contribute to the different data traps. The scoping trap arises from the need to be scientific (i.e., test-and-learn or experimental approach) in training the models. The same scoping traps are also common in the pharmaceutical sector where scientists cannot estimate the time required to find a drug to cure a condition or whether a drug will successfully pass the different clinical trials. Similarly, even after the drug is released in the market its efficacy can drop (e.g., antibiotic-resistant bacteria). The drift scoping trap is an effect of the dynamic manner in which the decision space evolves. Finally, the return traps occur due to the experimental and scientific mindset of models and also the dynamic nature of the decision space.

In subsequent blogs, we will look at some of the best practices to address these traps and challenges of scoping, building, and delivering models.

Authors: Anand S. Rao and Joseph Voyles

--

--

Global AI lead for PwC; Researching, building, and advising clients on AI. Focused at the intersection of AI innovation, policy, economics and application.