The world’s leading publication for data science, AI, and ML professionals.

Machine Learning code reproducibility using versioning

Software dependency and version control is a hard and tedious process. But as such, they are the ideal target for automatization.

Machine Learning systems are based on 3 main components: data, code and model. The three components are interrelated in many ways and using a dependency control system, we can ultimately achieve reproducibility. This factor is key because is one of the main drivers when managing ML solutions in production.

Photo by James Sutton on Unsplash
Photo by James Sutton on Unsplash

Usually, software dependency and version control are hard and tedious processes. But as such, these kinds of processes are the ideal target for automatization. Before delving into the implementation, let’s review the key players interacting with these processes:

Version schema

Currently, there is one schema used by most Linux distribution packages, which is called semantic versioning. And on the other hand, we have Python that boosts its own versioning identified by PEP 440 (Version Identification and Dependency Specification). Both schemas share some common ground but also differs in some aspects.

Public version and Local version
Public version and Local version

As shown in the above illustration, the first part (public version) is almost identical for both schemas. The second part (local version) is what differentiates SEMVER from PEP440. These two parts are evaluated every time the precedence order between versions needs to be calculated.

Given the public version, you should increment the:

  1. MAJOR version when you make incompatible API changes (in case of ML applications 🡒 input data/output response),
  2. MINOR version when you add functionality in a backward-compatible manner (in case of ML applications 🡒 improve feature engineering, change ML model, etc)
  3. PATCH/MICRO version when you make backward-compatible bug fixes

The local version is used for testing versions not ready for general use (alfa/ beta/pre-release/post-release version, developer build, etc).

Package

Mostly Python packages are packaged using setuptools. Indeed, the method setup has a version parameter, which in turn it is a string, so it supposed to be version agnostic … right? In reality, setuptools follows the PEP 440 standard mentioned before, so if you use semantic versioning, you might loss some information about the local version. More on this later when we review pysemver.

Artifact Repository

Another player in the architecture is the place where packages are hosted and distributed to the clients. But it also deals with other nuisances: for instance, depending on the version requested by the client, the repository needs to return (in most cases) the latest version (that involves ordering and making precedence right). The most common Artifact Repository used is PyPI, but in the case of private solutions, you can use others like Azure Artifacts as well.

SCM

Depending on the way we store the current package’s version, we can consider two main strategies for developing a version control system:

  1. The first strategy is adding a file that stores the package’s version. For example, python-versioneer or bump2version follows this strategy.
  2. The second strategy relies on the SCM, as you will be storing the version using SCM artifacts. For example, if you’re using GIT as SCM, the most widespread practice is to store the version as a TAG; in this way, you can quickly associate the package’s version with the git history log (for example, you can execute git checkout 0.1.3 and then check out the code that was packaged in version 0.1.3). In others SCMs you could also use branches. There are tools like setuptools_scm that use this strategy.

Yet another Python version control

I will say now and at the conclusion: there is no silver bullet. The solution presented here is a somehow simple solution but complicated enough so you can get a sense of the troubles related to versioning Python code for Machine Learning applications, so hopefully, you will test possible shortcomings before you implement your own version control.

For this simple solution, I will use GIT with TAGs, because I don’t like creating bump version file commits or detached commits (usually authored by the CI/CD).

Secondly, I will use semantic versioning, with the support of the lightweight python-semver, although I will use only the public version, to maintain compatibility with PEP 440. It’d be nice to support local versioning during PR (branch development), but for the moment I will keep things simple.

Python stores package metadata information (namely, many of the parameters used in the setup function, such as version and description) in the distribution itself. We will use the support for package metadata introspection built-in since Python 3.8, but if you want to use it in previous versions, you can install the package importlib-metadata.

Lastly, we will rely on Azure Devops for GIT SCM (Repository), CI/CD (Pipelines) and Artifact repository. Alternatively, you can also use GitHub for SCM and CI/CD (GitHub Actions) and PiPy for Artifact repository (at least until GitHub Packages supports Python).

Simple version

For the sake of this post, let’s create a simple python package (a console application) for executing an ETL for a Machine Learning application. The console application, at this stage, simply outputs the package version, as a test of the metadata introspection that we explained before.

There is one requirement to be fulfilled before applying this process: you need to push an initial tag (the first version – this is a one-time action). You can name this tag as you wish as far as it complies with semantic versioning syntax (for example: 0.1.0)

git tag -a "0.1.0" -m "Release v. 0.1.0"
git push origin "0.1.0"

Then achieving the mentioned version control is just a matter of executing the following steps:

1. Retrieve the most recent tag

export DESCRIBE=$(git describe --always --tags --long --first-parent)
export VERSION=$(echo $DESCRIBE | cut -d "-" -f 1)

First of all, to create a new version, we need to know the current version. In the case of using GIT as SCM, we will store the (current) version as a tag of the repository. So, in this case, [git describe](https://git-scm.com/docs/git-describe) will return the most recent tag, and the result of executing the command might be like 0.1.27-3-ge72f11d. The command response can be split into distinct parts, such as:

  • 0.1.27 : most recent tag
  • 3 : updates (number of commits) after the 0.1.27 tag
  • e72f11d: git commit hash. This is very handy, as you can check out the associated version using the commit hash:
git checkout using hash and tag are equivalent
git checkout using hash and tag are equivalent

2. Install pyton-semver and bump the current version

pip install semver==2.13.0
export VERSION=$(pysemver bump patch $VERSION)

python-semver is a transparent and simple library (in the sense that doesn’t introduce any new artifacts in the source code). Furthermore, python-semver installs the pysemver command, which can be launched from the CLI. For example, the following table collects the data returned when executing pysemver bump <command> <version>

Bump version using pysemver
Bump version using pysemver

This approach doesn’t need pyton-semver as a package dependency of the application, as it is only needed in the CI/CD agent to bump the current version (so you don’t need to include python-semver in your requirements.txt).

3. Create a new tag using the bumped version

git tag -a "$VERSION" -m "Release v. $VERSION"
git push origin "$VERSION"

Pushing a tag to the repository avoids any CI/CD pipeline triggers by new commits.

On the application side, we need:

1. Define setup.py

from setuptools import setup, find_packages
​
setup(name=_package_name_,
      packages=find_packages(include=["etl"]),
      version=os.getenv('VERSION','2.0.0'),
      python_requires=">=3.8")

The method uses as version parameter the environment variable already set up before (which contains the value of the bumped version). Then you can execute:

python setup.py bdist_wheel

and obtain a file named product_sales_ml_etl-0.1.2-py3-none-any.whl inside the dist folder. Also, mind that if you are using local versions, the version might be get mangled in the filename. For example, in the following table, you can observe that the -rc.1 version is altered before getting stamped in the filename:

Package naming using setuptools
Package naming using setuptools

This is one of the reasons why it is inadvisable to obtain the file version from the filename, and you should try to use other mechanisms, such as package introspection.

The described steps can be easily packaged into a pipeline and executed by CI/CD. In the third step explained before, at the time of pushing the TAG, you will need to grant permissions to the building service to complete such action. After following the steps annotated at run git commands from the pipeline , I kept having an error stating You need the Git 'GenericContribute' permission to perform this action. If this is your case, the way I solved this problem was by checking the user ID returned in the error, and looking up this user ID in the Permissions tab, as described in the following figure:

Azure Devops - Repository permissions tab
Azure Devops – Repository permissions tab

For the selected user, you should check the "Contribute", "Create tag" and "Read permissions".

2. Inspect current version

For the sake of your curiosity, you can inspect a folder named product_sales_ml_etl.egg-info, which contains a file named PKG-INFO with contents similar to this one:

Metadata-Version: 2.1
Name: product-sales-ml-etl
Version: 2.0.0
Requires-Python: >=3.8

This folder is also packaged into your wheel/egg file, and when you use the library importlib-metadata in turns it checks these packaged files:

from functools import lru_cache
_package_name_ = "product_sales_ml_etl"
​
@lru_cache
def get_version():
    from importlib.metadata import version
    return version(_package_name_)

Finally, another devops pipeline task can be set up for publishing and/or consuming the package:

1. Publish package

The python package produced by the pipeline can be uploaded to PyPI or any compatible pip repository. Azure Artifact repository is compatible with pip and, after setting it up, you can use the following devops template that uses twine to upload the package:

- task: TwineAuthenticate@0
  inputs:
    artifactFeeds: 'productsalesml/wheel_feed'
    publishPackageMetadata: true
​
- bash: |
    pip install twine
    python -m twine upload --config-file $(PYPIRC_PATH) -r "productsalesml/wheel_feed" dist/*.whl
  displayName: Upload artifacts to feed

At this stage, you start to realize some annoying details when dealing with versioning: for example, in the bash task described before, we used dist/*.whl. As you checked before, the file name generated depends on the package name + version set up in the setup function + python version, so this (filename) might change frequently (as soon as any of these parameters change). This is the reason we use pattern matching to match any file with extension .whl and we don’t use any fixed filename.

2. Consume package

Once the package has been uploaded into the feed, it can also be consumed within the pipeline:

- task: PipAuthenticate@1
  displayName: Pip Authenticate
  inputs:
    artifactFeeds: 'productsalesml/wheel_feed'
    onlyAddExtraIndex: true
​
- bash: |
    pip install product-sales-ml-etl
    python -m etl.console_main
  displayName: Install package and test it

Extended version

Sometimes you need to store additional information about the version/building process (for example, storing the whole git describe response and/or the [Build.BuildNumber](https://docs.microsoft.com/en-us/azure/devops/pipelines/build/variables?view=azure-devops&tabs=yaml#build-variables-devops-services)). In theory, you can take advantage of the package’s metadata in order to store such information, but I couldn’t find any friendly way of extending the metadata (although PEP 459 affirms is possible). In the end, I encoded the extended version into the long description metadata.

First, we need to extend the information captured during the pipeline execution. For example, add the following variable to the devops pipeline:

variables:
- name: buildnumber
  value: $(Build.DefinitionName)_$(Build.Repository.Name)_$(Build.SourceBranchName)_$(Build.BuildNumber)

This creates a pipeline variable named buildnumber, that uses Build predefined variables values for its definition. The value of buildnumber might be similar to this: cicd-main_etl_testBranch_20201229.18

  • cicd-main: Build.DefinitionName 🡒 devops pipeline name
  • etl: Build.Repository.Name 🡒 repository name
  • testBranch: Build.SourceBranchName 🡒 branch name’s l_ast segment (_for example, features/testBranch last segment is testBranch)
  • 20201229.18: Build.BuildNumber 🡒 the build number is based on the build date + incremental ID

Now, the setup method can collect the value of the environment variables previously set up during the execution of the pipeline (here I define default values, useful when testing in local or checking empty values):

from setuptools import setup, find_packages
import json
​
metadata_custom = {
    'version': os.getenv('VERSION','0.1.2'),
    'revision': os.getenv('DESCRIBE', '0.1.2-1-ge34ce98'),
    'buildNumber': os.getenv('BUILDNUMBER','etl-main_etl_main_20210101.1')
}
​
metadata_custom_as_markdown = 
rf"""```json
{json.dumps(metadata_custom)}
```"""
​
setup(name=_package_name_,
      version=os.getenv('VERSION','0.1.2'),
      packages=find_packages(include=["etl"]),
      long_description=metadata_custom_as_markdown,
      long_description_content_type="text/markdown",
      python_requires=">=3.8")

How can you access the information stored in the description metadata field? In the same way as in the simple version, using importlib.metadada. Although in this case, using the method distribution(), which returns not only the version but all metadata properties. Then, you can use the following code to obtain the version metadata after parsing the description accordingly:

@lru_cache
def get_version_metadata():
    from importlib.metadata import distribution
    import json, re
    metadata = distribution(_package_name_).metadata
    description = metadata["Description"] or metadata.get_payload()
    regex = r"^s*```json(.*)^s*```"
    matches = re.findall(regex, description, re.MULTILINE | re.DOTALL)[0]
    return json.loads(matches)

Finally, you can extend the main function to display both versions:

import argparse
​
def display_version():
    print(f"version: {etl.get_version()}")
    print(f"extended version: {etl.get_version_metadata()}")
​
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--version', '-v', help="display version", action="store_true")
    args = parser.parse_args()
    if args.version:
        display_version()

Differences between main branch and other branches

In some cases, your requirements for the main branch (releases) and feature/fix branches will be different. For example, in case you need to generate a pre-release package from a feature branch for testing the package in some staging environment. In these cases, is recommended to not create a new tag (because this will trigger a new release), but to generate a local version (using build, pre-release or another custom naming).

We might use two different pipelines to differentiate between the two different workflows:

  • cicd-main: triggered by any commit to the main branch
trigger:
  batch: True
  branches:
    include:
    - main
  • cicd-pr: triggered by pull requests to the main branch. The setup is a bit different, as the trigger is set to none in the YAML file:
trigger:
  none

Although there exists a pr schema for YAML, it seems that it is only valid for GitHub and Bitbucket repositories. Instead, we will trigger the pipeline using a branch policy. Please, take into account that the configuration described in the picture below, 1) allows to push commits into main branch 2) the pipeline is not required to be successful for the PR to be approved 3) the pipeline is not executed when the PR is merged (if you change previous options, the main and the pr pipelines both will be executed when merging the PR). It would be great if this behaviour could be managed with finer detail using YAML, but for the moment, we need to use the UI as shown below:

Azure Devops - Edit policy tab
Azure Devops – Edit policy tab

Lastly, an updated version is generated, but the tag is not pushed to the repository and the Python package is not uploaded using twine. But the Python package is uploaded as an artifact of the pipeline, so it can be used later.

- bash: |
    ls $(Pipeline.Workspace)/dropArtifacts/*.whl | xargs -I {} pip install {}
    pip list
  displayName: Install package from artifact store
​
- bash: |
    LOGURU_LEVEL=WARNING
    python -m etl.console_main -v
  displayName: Test etl.console_main

One drawback of this approach is that all builds generated from the same (or even different) feature branch will have the same version. In case you need more control over this behaviour, I recommend using local version.

Summary

There is no silver bullet. This might seem a convoluted solution, but if you must take a single thing with you, it is that you must absolutely set up a mechanism that ensures the Reproducibility of your Machine Learning application. You can check out the following repository – https://github.com/franperezlopez/blog_databricks – that shows the techniques reviewed here and other aspects about deploying Databricks applications (write a comment if you are interested in reviewing those aspects in a future post). Also, feel free to share any insight into this process in the comments.


Related Articles