DATA SCIENTISTS ARE FROM MARS AND SOFTWARE ENGINEERS ARE FROM VENUS (PART 5)

Time to combine agile programming and agile data science

Agile Software 2.0 Manifesto and Process

AnandSRao
Towards Data Science
9 min readNov 6, 2020

--

Source: Photo by Robert Collins on Unsplash

In Part 1 of this series, we examined the key differences between software and models; in Part 2, we explored the twelve traps of conflating models with software; in Part 3, we looked at the evolution of models; and in Part 4, we went through the model lifecycle. Now, in our final part of the series, we address how the model lifecycle and the agile software development methodology should come together.

Based on our previous discussions, we are primarily concerned with how the model lifecycle process — with its iterative value discovery, value delivery and value stewardship — can be combined with the traditional agile software development methodology. The emphasis of this article is on the combination of the two methodologies; it is not about making data science or model lifecycle agile. The model lifecycle process itself is both iterative and agile.

History of Agile

The roots of agile can be traced back to agile manufacturing in the 1930s. In their article on “The Secret History of Agile Innovation,” the authors — Darrell Rigby, Jeff Sutherland and Hirotaka Takeuchi — note this historical context.

Some trace agile methodologies all the way back to Francis Bacon’s articulation of the scientific method in 1620. A more reasonable starting point might be the 1930s when the physicist and statistician Walter Shewhart of Bell Labs began applying Plan-Do-Study-Act (PDSA) cycles to the improvement of products and processes. Shewhart taught this iterative and incremental-development methodology to his mentee, W. Edwards Deming, who used it extensively in Japan in the years following World War II.

One of the early uses of the concept of agility in software (e.g., its continuous and iterative nature) goes back to the 1950s. According to experts Craig Larman and Victor Basili, Gerald Weinberg and Bernie Dimsdale at IBM’s Service Bureau Corporation were doing incremental software development in Los Angeles as early as 1957. However, the birth of agile is usually credited to a 2001 conference in Snowbird, Utah. The following quote from an article in The Atlantic captures the spirit of this birth:

But it was here, nestled in the white-capped mountains at a ski resort, that a group of software rebels gathered in 2001 to frame and sign one of the most important documents in its industry’s history, a sort of Declaration of Independence for the coding set. This small, three-day retreat would help shape the way that much of software is imagined, created, and delivered — and, just maybe, how the world works.

It was at this venue that the Agile Manifesto and the twelve principles of agile were born. It is worth going into the agile manifesto in detail so that we can develop an alternative manifesto for agile data science.

Manifesto for Agile Software Development

The manifesto for agile software development consists of four key statements:

  1. Individuals and interactions over processes and tools
  2. Working software over comprehensive documentation
  3. Customer collaboration over contract negotiation
  4. Responding to change over following a plan

The document goes on to state that while there is value in the items on the right, we value the items on the left more. So the agile mindset is additive to the existing best practices of its time rather than a replacement. Sometimes agile is used as an excuse for not providing comprehensive documentation or having an architectural design or plan for development. Ironically, agile itself has become more of a process and methodology, as opposed to focusing on individuals and their interactions.

Agile Data Science Manifesto

In his first book on Agile Data Science in 2013 and a revised version called Agile Data Science 2.0 in 2017, Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications. He defines the goal of agile data science as follows:

The goal of the Agile Data Science process is to document, facilitate, and guide exploratory data analysis to discover and follow the critical path to a compelling analytics product.

He then organizes his Agile Data Science Manifesto along the following seven principles.

1. Iterate, iterate, iterate: tables, charts, reports, predictions.

2. Ship intermediate output. Even failed experiments have output.

3. Prototype experiments over implementing tasks.

4. Integrate the tyrannical opinion of data in product management.

5. Climb up and down the data-value pyramid as we work.

6. Discover and pursue the critical path to a killer product.

7. Get meta. Describe the process, not just the end state.

These principles target a couple of the key differences between software and models that we outlined in our earlier article — primarily the development process that is centered around experiments and the inferencing mechanism based on induction that uses data.

In our view this agile data science manifesto misses out on some key developments:

  • Data products vs Models: The emphasis is on building (big) data products with a web front-end in an agile manner. The richness of continuous and iterative value discovery, value delivery, and value stewardship is missing in the manifesto.
  • Data scientists vs interdisciplinary teams: As we have discussed earlier, model lifecycle management is moving from prediction as a service to the model factory that is resulting in the emergence of new roles such as ML engineers, ModelOps, and DataOps experts. This requires more of a focus on interdisciplinary teams.
  • Responding to changes vs Disruption: The agile data science manifesto focuses more on responding to change as opposed to disrupting the market with an innovative product.

Now we look at how we combine the agile software development and data science manifestos into what we call the Agile Software 2.0 Manifesto.

Agile Software 2.0 Manifesto

Andrej Karpathy, the senior director of artificial intelligence at Tesla, introduced the term Software 2.0 and contrasted it with traditional programming, which he called Software 1.0. Specifically, he said:

The “classical stack” of Software 1.0 is what we’re all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer identifies a specific point in program space with some desirable behavior.

In contrast, Software 2.0 can be written in much more abstract, human unfriendly language, such as the weights of a neural network. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard (I tried).

While Karpathy uses the word Software 2.0 only for systems developed using machine learning and deep learning, it is clear that we will need both 1.0 and 2.0 versions of software development to co-exist for the foreseeable future. As a result, we should be combining the key values of agile software development with agile model development. We view the objective of this combined manifesto as follows:

We are uncovering better ways of developing software powered by (AI/ML) models by facilitating continuous integration (of data, software, and models), delivery, and (machine) learning.

Extending the Agile Software Development Manifesto into the Agile Software 2.0 Manifesto, we have these four basic values:

  1. Multi-disciplinary teams AND individuals & interactions: When we combine the traditional Software 1.0 (e.g., traditional programming) with Software 2.0 (e.g., machine learning or more generically model building) we are combining the data, software, and model lifecycles. This requires us to focus on multi-disciplinary teams from data, software, and data science, including business analysts, data analysts, architects, software developers, data scientists, ML Engineers, DevSecOps, MLOps or ModelOps, and AI ethicists. No one individual can bring all these capabilities and the individuals are not just from one discipline; they need to be multi-disciplinary determined by the stage in the life-cycle.
  2. Insightful actions and decisions AND working software: Traditional agile software development accelerates the timeline for useful working software, often called the Minimum Viable Product (MVP). However, forcing the data exploration and model building within this agile software cycle often leads to simple descriptive analytics without insights or predictive or prescriptive models. The model building or data science agile cycle needs to be decoupled from the software agile cycle (as described below) in order to produce working software that is also insightful.
  3. Data & model exploration AND customer collaboration: Agile software development emphasized customer collaboration because the traditional waterfall method over-indexed on collecting requirements from customers, designing the software, building, and testing. Software 2.0 brings an additional dimension to this equation by bringing data and models. Customers may not be in a position to always articulate their preferences or the reasons or why they made certain decisions. The ability to have the “data tell the story” and the “model bring out the essence of the data” is critical for Software 2.0.
  4. Being innovative & disruptive AND responding to change: Agile software development is very good at responding to change — especially changes requested by customers. In our experience, the short, iterative cycles of software development (i.e., sprints) often lead to incremental improvements in functionality, but do not provide sufficient opportunity for innovating and disrupting by using the data that is available and the insights obtained from the models.
Figure 1: Agile Software Development Manifesto (Source: Agilemanifesto.org) and Agile Software 2.0 Manifesto

Having discussed the Agile Software 2.0 Manifesto I will go into the details of how to combine the agile software development lifecycle with the model development lifecycle that we discussed in my previous article.

Integrated Agile Software 2.0 Process

Given the fundamental differences between software and models — and the reasons mentioned above in the Agile Software 2.0 Manifesto — we need to separate the traditional agile software sprints from the model lifecycle in terms of the timing. At the same time, we do not want to run them as two separate and distinct agile cycles, as that would prevent us from realizing the full benefits of traditional software and models. So, what is the solution?

One solution involves interleaving the sprint cycles and having a separate clock for software sprints and model sprints. There are a few key interaction points that are worth elaborating on in this interleaved process:

  1. Product Start/Backlog: At the start of a product — especially a product that incorporates both traditional software and models — the software, data and modeling teams need to come together to decide on the key functionality desired by the customer, the data that is available, the potential insights that need to be generated or hypotheses that need to be tested, and the key product differentiators. Once this is done, the software and modeling teams can independently carry out their sprints. The software team can go through their standard sprint process, while the modeling team does the exploration and experimentation with data and models as described by our value delivery loop in the model lifecycle.
  2. Sprint Backlog: When the modeling team has verified the hypotheses and generated the insights, the tested models and data pipeline are placed in the sprint backlog. When the software team has finished its standard sprints, it takes the tested models and incorporates them in the current version of the software. At this step, the software developers, data engineers, ML Engineers and data scientists must work together to tune the models and deploy them at scale.
  3. Finished Work and Value Delivery: This software and model integration sprint then leads to a software that has an embedded model that is ready for deployment. At the conclusion of this integrated sprint, the model is deployed in a production environment, and the value delivery process begins. Unlike traditional software where the software can go into an operational phase, for software with embedded models, we need to go through the value delivery phase to confirm that the model is still performing as required in a production environment. Once this is done, the integrated model can go into the value stewardship process.
Figure 2: Agile Software 2.0 Process

In conclusion, Agile Software 2.0 needs to incorporate the leading agile methods from agile software development and agile data science to help generate ROI for clients that want to deploy software with AI/ML models embedded in them.

The convergence of data, software, and AI/ML models is just beginning. Extensive software engineering, software development and maintenance methodologies don’t exist for AI/ML models. Best practices are just emerging and the next decade will see more academic and industry advances in this area. We started this series with a claim that data scientists are from Mars and software developers are from Venus. These disciplines are coming together over the past couple of years. This five part series is just the beginning of what is likely to be a major area of focus for software developers and AI/ML modelers.

--

--

Global AI lead for PwC; Researching, building, and advising clients on AI. Focused at the intersection of AI innovation, policy, economics and application.