Data Scientists are from Mars and Software Developers are from Venus (Part 1)

Data Scientists are from Mars and Software Developers are from Venus (Part 1)

Software vs Models

AnandSRao
Towards Data Science
8 min readAug 29, 2020

--

Figure 1: Data Scientists are from Mars and Software Developers are from Venus

Mars and Venus are very different planets. Mars’s atmosphere is very thin and it can get very cold; while Venus’s atmosphere is very thick and it can get very hot — hot enough to melt lead!. Yet they are our closest sister planets. They have a number of similarities too. Both have a high concentration of carbon-dioxide in their atmosphere and are exposed to solar radiation with no protective magnetic field.

Software Engineers and Data Scientists come from two different worlds — one from Venus and the other from Mars. They have different backgrounds mindsets, and deal with different sets of issues. They have a number of things in common too. In this and subsequent blogs we will look at the key differences (and similarities) between them and why those differences exist and what kind of bridge we need to create between them. In this blog, we explore the fundamental differences between software and models.

Software vs Models

In traditional programming, one provides line-by-line instructions (often called an algorithm)for a computer to process the input data to produce the desired output that matches a given software specification. The line-by-line instructions can be in one of many computer languages e.g., Lisp, Prolog, C++, Java, Python, Julia etc.

In data science, one provides the input data and in some cases (e.g., supervised machine learning) a sample of the output data to build a model that can recognize the patterns in the input data. Unlike traditional programming, data science models are trained by providing input data (or input and output data) to recognize patterns or make inferences. When fully trained, validated, and tested they perform predictions or inferences on new data.

A model is a formal mathematical representation that can be applied to or calibrated to fit data. Scott Page provides a number of examples of models and how they are used to make decisions. Models can be machine learning models, system dynamic models, agent-based models, discrete event models or a number of other different types of mathematical representations. In this article, we will focus primarily on machine learning (ML) models.

Figure 2: Software and Model Definition

Software and models differ by five key dimensions. These dimensions are shown in Figure 3.

Figure 3: Five key dimensions of difference between software and models

Output

The output of software is certain. For example, consider an algorithm (e.g., bubblesort) for sorting an array of numbers. The bubblesort algorithm takes as input an array of numbers and iteratively goes through a series of steps to produce an output array that is sorted by ascending or descending order. Given any array the bubblesort algorithm will always produce a sorted array as output. There is no uncertainty in its output. If the program has been properly tested the algorithm will always produce the result and the result will be 100% accurate.

In contrast, take a deep learning model that has been trained on a large number of images and is capable of recognizing different breeds of cats. When the model is provided as input a cat image, it uses the model to predict the breed of the cat. However, it may not always provide an answer with 100% accuracy — in fact, more often than not the accuracy will be less than 100%. Figure 3 illustrates the input, the deep learning network layers, and the output of a deep learning model. The model predicts that the image contains a tabby cat with 45% accuracy and could be an egyptian cat with 23% accuracy. In other words, the predictions from models are often uncertain. This uncertainty in predictions is a challenging concept for businesses to grasp. We will come back to the implications of this dimension to model development in a subsequent blog.

Figure 3: Deep learning image recognition model

Decision Space

The second dimension of difference between software and models is the decision space. What we mean by decision space is the context in which the software or model is used for making decisions. When we build software we typically have a specification or a user need that is coded as an algorithm. The software is tested and when fully tested is available for use. The software gets executed to produce an outcome. This decision space is fixed or is static. If the needs of the user changes the algorithm has to be modified or rebuilt and tested. There is no promise of the algorithm learning or modifying itself. Figure 4 illustrates the context around how software gets used (an adaptation of how models are used from the book on Prediction Machines)

Figure 4: Software decision space

When it comes to models the decision space is more dynamic. Consider a machine learning based chatbot that has been trained to provide first level support for queries related to smartphones. When the model was trained it would have been trained with historical data on make, model, accessories of different smartphones. Once deployed the chatbot will be able to answer customer queries on all the smartphones in the market. Let’s assume that when the chatbot is unable to answer queries beyond a certain level of accuracy it bounces the queries to a human customer service agent. Such a chatbot will work fine for a few months, but when new models and accessories are introduced in the market, the chatbot will be unable to respond to customer queries and will progressively transfer more and more calls to the human agent, eventually making the chatbot useless.

Models rely on historical data and when that historical data is no longer relevant they need to be refreshed with newer data. This makes the context in which models operate more dynamic. While user needs or software specifications can also change, they happen less frequently. Moreover, there is no expectation that the software will continue to function when the specification changes. In the case of models there is a clear expectation that when the performance of the model deteriorates, at the very least we get alerted to this deterioration (often called model drift), or at best the outcomes and new data are used to continuously improve the model (we will come back to this issue of continuous improvement in a susbequent blog). Figure 5 illustrates the context around how models get used. Note the feedback loop from outcome to training (the red dotted lines have been added from the original diagram in Prediction Machines for emphasis). It is this feedback process that makes the decision space for models more dynamic.

Figure 5: Model decision space

Inference

Traditional software typically uses a deductive inferencing mechanism while machine learning models are based on inductive inferencing. As shown in Figure 6 a software specification acts as the theory from which the code is developed. The code can be viewed as the hypothesis that needs to be confirmed with the theory based on the observations. Observations are nothing but the output produced by the code that needs to be repeated tested against the specification.

Models are patterns derived from observational data. The initial model acts like the hypothesis that needs to be iteratively refined to ensure that the model best matches the observational data. The trained model when validated captures the theory underlying the data. Sufficient care needs to be taken to ensure that the model does not over-fit or under-fit the data.

Moving from theory to observation as done by software is arguably easier than moving from observation to theory as done by models. This also highlights the challenge around the the certainty/uncertainty of the output as discussed in the first dimension.

Figure 6: Inference in software and models

Development Process

The process by which software is developed is also fundamentally different to the way models are typically developed. Software development typically follows a waterfall approach or an agile approach. The waterfall approach goes through a series of steps from specification to design to coding to testing and deployment. In the case of an agile approach the software development process is iterative and often embodies a set of principles centered around user needs and self-organizing, cross-functional teams. Software is typically developed in one to four week sprints. Each successive sprint encodes additional functionality leading to a minimum viable product that is released to users.

The model development process has to follow a somewhat different approach. The availability, quality, and labelling of data and the difficulty of estimating the desired accuracy or more generally the performance of the algorithm means that model development needs to take a portfolio approach. The data scientists need to develop a number of models and a subset of these models may meet the performance criteria. As a result the model development process needs to follow a more scientific process of experimentation, testing, and learning from these experiments to refine the next set of experiments. This process of hypothesize-test-learn does not fit well with an agile software development life cycle. In a subsequent blog, we will revisit the model development process and how it can be integrated with the agile approach.

Mindset

The four dimensions we have addressed so far clearly separates the mindsets of those who build software and those who build models. Software developers typically have an engineering mindset — they work on architectural blueprints, connections between different components, and are typically responsible for production software. Software engineers typically have a computer science, information technology, or computer engineering background or education. They develop software products that create the data.

Model developers, on the other hand, have more of a scientific mindset — they work on experiments, are better at dealing with ambiguity, and are typically interested in innovation as opposed to production models. A data scientist is someone who has augmented their mathematics and statistics background with programming to analyze data and develop mathematical models. They use data to draw insights and effect outcomes.

In subsequent blogs, we will see the implication of these differences in how software and models get developed in organizations. We will also look at the model development process in detail, how they can be integrated with agile software development, and the emergence of new roles like ModelOps and MLEngineers.

Authors: Anand S. Rao and Joseph Voyles

--

--

Global AI lead for PwC; Researching, building, and advising clients on AI. Focused at the intersection of AI innovation, policy, economics and application.