The world’s leading publication for data science, AI, and ML professionals.

All meaningful data science explorations should be reproducible

Data science exploration is the process of understanding the requirement and feasibility of a technical approach that contributed to AI/ML…

Photo by Keven Ku on Pexels
Photo by Keven Ku on Pexels

Synopsis

If you do not have time, here is the 30-second version:

  • If you or your team is doing a Data Science exploration work that is not a total waste, you need to preserve the work in such a way that you, your team, or someone else can get back to it later without too much trouble. The value of an idea starts from exploration and the value of exploration starts from sharing in such a way that it is easy to reproduce.
  • It may be tempting to refer to a notebook running in a platform accessible to you or your team at that moment of exploration, e.g., hosted JupyterHub, Databricks, etc. However, such an approach has a low probability to survive the test of time. The harder it is to reproduce, the higher the chance that the work will not be studied carefully months or years later. Such a situation can increase the likelihood of accepting assumptions negated by prior exploration, rejecting recommended propositions, or, much worse, redoing the same exploration that draws the same conclusion.
  • Reproducing data science work requires maintaining a version of its dependencies, i. e., the code, the data (both inputs and output), and the parameters. In addition, it requires a guide that explains what is the problem, its solution approach, the solution’s assumption and limitations, potential future works, what is the code execution environment and how to execute the code in the environment, where is data located and how to access/load the data while executing the code, explanation of key parameters and metrics that are essential to manipulate the code and resulting analysis, and brief analysis of the result and conclusions.
  • A suggested approach would be to put the code in a distributed version control system like GitHub, libraries and its version defined in an explicit file like requirements.txt, parameters and data paths saved in a nested file structure like yaml, an infrastructure description file that defines the environment where the code can execute like environment.yml or Dockerfile, a user guide like README.md that includes as much as detail possible to facilitate the manual steps, e.g., running the code in an external platform, loading the data, understanding the code, analysing the results, etc.
A sample structure for versioning exploration work (photo by author)
A sample structure for versioning exploration work (photo by author)

If you are intrigued and have more time, here is the longer version:


Background

Data science exploration is the process of understanding the requirement and feasibility of a technical approach that contributed to AI/ML solutions. At the time of writing, I am working heavily with a fashion retailer. Here are a few examples of explorations that can be relevant to a fashion retailer. Please provide your ideas on explorations from your domains in the comments for a wider understanding of how exploration works.

Photo by Artem Beliaikin on Pexels
Photo by Artem Beliaikin on Pexels
  1. Can we use image features to identify similar/complementary garments?
  2. Can we evaluate the impact of discount strategy on garment stock through simulation?
  3. How much time we can save on customer order fulfilment if we choose allocation strategy X over Y?
  4. Does buying variations of the same garment increase probability of returns?

These explorations vary a lot. From a scope perspective, exploration can be answering a simple question on the data source to something significantly more complicated where a real-world process needs to be modelled. From a human resource perspective, it can be done by individual experts to a team of experts of several kinds. From a duration perspective, it can run from a week to several months. From an area perspective, it can include data analysis, engineering proof of concepts, algorithm experimentations or some combination of all these three areas.

Explorations by their own nature are messy. There is no guarantee that a good, usable result will come out of it. Such is the case in many other disciplines of applied engineering and science, e.g., aeronautical, pharmaceutical, etc. If the expectation on explorers in those fields is to keep as much as details possible for every important experiment, why should we lower our standards, especially when we have a very high possibility to influence the lives of millions with our decisions, oftentimes, influenced by our explorations?

Currently, explorations in data science typically mean writing some code to study something in data to draw a conclusion to answer a question or expose problems/promise in assumptions in platforms that allow easy execution of code, analysis of results, and saving of exploration tables, models, and plots sharable/viewable to many other collaborators. Databricks and Google Colaboratory are examples of Cloud-based platforms that are widely adopted by organisations and companies all over the world. Our work is no doubt has become better for such powerful platforms, since finishing explorations in those platforms has become significantly faster. For the users of these platforms, a common practice is to archive the explorations only as notebooks (with certain states) in the platforms. This is an anti-pattern, drawing from the software engineering discipline that influences heavily the world of data science. Here are a few problems to that approach that can change the truth conveyed by the original developer:

  1. Someone with access to the notebook can alter the code
  2. Source code and other types of files in the platform may be deleted/renamed/moved by someone with access to the environment
  3. If data read by the code is a live one, that data may change
  4. Data and results may be overwritten by recent execution on the most recent data
  5. The platform may be a transient one, which may be purged

In any of the above case, the problem may be easily caused without intention due to the flexibility of the platforms.

Suggested Approach

The suggested approach is very obvious. I am sure thousands before me have advocated this. However, the issue is so important that it needs to be said again and again.

Reproducing explorations requires archiving the exploration work, which includes exploration artifacts, process, and details in a way that guarantees their versions kept intact for a long time. This means the platform where we execute explorations, should not be confused as an archiving platform. Unless the execution platform is designed specifically to optimise archiving activity, such as, version controlling all aspects of explorations, long life of archives even when the platform is purged, "lift and shift" migration of explorations, etc.

Photo by author
Photo by author

I do not possess extensive knowledge of such platforms. However, with the exposure that I have, I have yet to observe a scalable platform that can guarantee both easy execution and durable archiving of explorations. It probably will become a reality in a not so distant future. Until that point comes, we have to use our current arsenal to come up with a workable archiving strategy. So, here are the suggestions:

  1. Data version: Maintain your input and output data as a snapshot and, if possible, a versioned snapshot in a storage facility that is long-lived. Avoid versioning data in version control system for codes, since they are built for a different purpose, maintaining small files, instead of large files.
  2. Code/Parameter version: Maintain your experiment and analysis code and parameters in a distributed version control system. Maintain your data generation pipeline as code in the version control system as well. Avoid using graphical tools to generate data, if they cannot guarantee data reproducibility. External libraries used in the code and data access paths should be explicitly mentioned. It may be tempting to maintain the parameter in a separate system, but this approach keeps it simple.
  3. Guide: Add a small report on how to work with the above points to understand and reproduce the work. The report can also be included with the code versions since it is likely to be a small file. Avoid using images in the report that are large in size.

Note that I am not advocating against archiving in the execution platform. It is quite beneficial to archive explorations in the execution platform for immediate use. However, the benefit decreases as time progresses.

Concrete Example

I will provide a concrete example inspired by my current ways of working with explorations. Please provide suggestions in the comments that match your stack if you feel aligned with the above principles described. It would be really nice to learn all the possibilities.

One stack that I use these days includes the following:

  1. Python
  2. Azure Databricks
  3. Azure repos
  4. Azure Datalake
  5. Microsoft Sharepoint

Here is my suggestion, based on the above stack:

Data version: Preserve all non-aggregated/non-joined, but filtered inputs that are extracted from databases/datalake/other tables/streams in Azure datalake that is supposed to be long-lived. Data versioning technology is still in its infancy. It may be tempting to use advanced techniques, such as Delta Lake, but understand its limitations. For the case of outputs, its far more important to save the data that generates the plots or analysis, rather than the plot and analysis itself. For explorations, it may be good enough to preserve a couple of versions, which can be managed by a simple strategy, such as, directory following monotonically increasing numbers, e.g., integer number, timestamp, etc. If your stack is based on some other public/private cloud, choose the datalake at your disposal that allows saving a wide range of file formats, including but not limited to csv, parquet, avro, png, pkl, json, yml, etc.

Code/Parameter version: Technologies have improved significantly in the area of distributed version control, not only in terms of efficiency but also the portability of its repositories to different platforms. It makes little difference whether your code is maintained in Azure repos, Bitbucket, or Github. Maintain all your code in a repository. Add the following files to provide information on dependencies, even if you do not use them explicitly:

  • libraries with its version number in a requirements.txt
  • execution environment file, such as conda’s environments.yml, Docker’s Dockerfile, or Databrick’s cluster description in yaml file
  • code manipulation parameters, data access paths (with its version number), external integration information, in json or yml files (avoid including sensitive information at all costs)

Guide: Add a README.md file in the code repository that includes the following:

  1. Contact information, such as name, email address, role, and team of the developer
  2. A brief description of the problem, solution approach, assumptions/limitations, and future work. A brief analysis of the exploration and its conclusion. This information are often contained in presentation materials, such as powerpoint or pdf files. In such cases, save the file in a Sharepoint (or whatever Document archiving solution you have) and add the link to the file in this guide. If the exploration uses an approach published by someone else, refer to the publications.
  3. A brief description of the execution environment that explains cluster configuration, library installation, data file loading, secret management, etc. In case, supporting file exists, e.g., requirements.txt, refer to those files. If you are using model management system, e.g., MLflow, mention experiment names.

Disclaimers

In this article, I have expressed my opinion based on common sense and experiences. I do not suppose that it will match your reality as is. However, like my code, my opinion has versions. It will not change that much by next week or month, but it will probably change a lot by next year. If you do not agree with me or prefer a variation of what I proposed, please provide feedback in the comments. While working on this piece, I was listening to Bill Maher, who was advocating speaking your mind. I may have been slightly influenced by that.


Related Articles