The world’s leading publication for data science, AI, and ML professionals.

Elegant CICD with Databricks notebooks

How to release Databricks Notebook artifacts with Azure DevOps

With Luuk van der Velden

Notebooks are the primary runtime on Databricks from data science exploration to ETL and ML in production. This emphasis on notebooks calls for a change in our understanding of production quality code. We have to do away with our hesitancy about messy notebooks and ask ourselves: How do we move notebooks into our production pipelines? How do we perform unit and integration tests on notebooks? Can we treat notebooks as artifacts of a DevOps pipeline?

Databricks Notebooks as first class citizens

When choosing Databricks as the compute platform your best option is to also run notebooks in your production environment. This decision is dictated by the overwhelming support for the notebooks runtime versus classic python scripting. We argue that, one should fully embrace the notebooks approach and choose the best methods to test and deploy notebooks in a production environment. In this blog, we use Azure Devops pipelines for notebook (unit, integration) testing using transient Databricks clusters and notebook artifact registration.

Notebooks: entry point of a python package

Notebooks can live in isolation, but we prefer them as part of a Git repository, which has the following structure. It contains a notebooks directory to check in Databricks notebooks as Source files, a Python package (‘my_model’) containing functionality to be imported in a notebook, a tests directory with unit tests for the Python package, an Azure DevOps pipeline and a cluster-config.json to configure our transient Databricks clusters. Additionally we use Poetry for Python dependency management and packaging based on the pyproject.toml specification.

notebooks/
-        run_model.py  # Databricks notebook checked in as .py file
my_model/
-        preprocessing.py  # Python module imported in notebook
tests/
azure-pipelines.yml
cluster-config.yml
pyproject.toml
...

Notebooks can be committed into a Git repository either by linking a Git repository to the notebook in the Databricks Workspace or by manually exporting the notebook as a Source File. In both cases, the notebooks are available in the repository as a Python file with Databricks markup commands. The notebook entry point of our repository is shown below. Notice that it installs and imports the Python package ‘my_model’ build from the containing repository. The package versioning will be worked out in detail later. Any notebook logic is captured in the main function. After executing the main function dbutils.notebook.exit() is called, which signals successful completion and allows a result value to be returned to the caller.

# Databricks notebook source
dbutils.widgets.text("package_version", defaultValue='')
package_version = dbutils.widgets.get("package_version")

# COMMAND ----------

devops_pat = dbutils.secrets.get("devops_scope", "devops-artifact-read")
%pip install my_model==$package_version --index=https://build:[email protected]/organization/project/_packaging/feed/pypi/simple/

# COMMAND ----------

from my_model.preprocessing import do_nothing

# COMMAND ----------

# define the main model function
def main(spark):
  do_nothing(spark)

# COMMAND ----------

# run the model
from loguru import logger

with logger.catch(reraise=True):
  main(spark)

# COMMAND ----------

dbutils.notebook.exit("OK")

Notebook pull request pipeline

When developing notebooks and their supporting Python package, a developer commits on a development branch and creates a Pull Request for colleagues to review. Figure 1 shows the steps of the pipeline that support our Pull Requests. The Pull Request automatically triggers an Azure DevOps Pipeline that has to succeed on the most recent commit. First we run the unit tests of the Python package and on success build it and publish the dev build package to Azure Artifacts. The version string of this dev build package is passed to the notebook input widget "package_version" for notebook integration testing on our staging environment. The pipeline validates whether the notebook runs successfully (whether dbutils.notebook.exit is called) and provides feedback on the Pull Request.

Fig 1: PR notebook pipeline for continuous integration, Image by Author
Fig 1: PR notebook pipeline for continuous integration, Image by Author

Integration test on a transient cluster

The goal is to execute this notebook on Databricks from an Azure DevOps pipeline. For flexibility, we choose Databricks Pools. The advantage of these pools is that they can reduce the startup and auto-scale times of clusters when many different jobs need to run on just-in-time clusters. For the execution of the notebook (and access to optional data sources) we use an Azure App Registration. This Azure App Registration will have permissions to manage Databricks clusters and execute notebooks. The basic steps of the pipeline include Databricks cluster configuration and creation, execution of the notebook and finally deletion of the cluster. We will discuss each step in detail (Figure 2).

Fig 2: Integration test pipeline steps for Databricks Notebooks, Image by Author
Fig 2: Integration test pipeline steps for Databricks Notebooks, Image by Author

In order to use Azure DevOps Pipelines to test and deploy Databricks notebooks, we use the Azure DevOps tasks developed by Data Thirst Ltd to create clusters and the tasks from Microsoft DevLabs to execute notebooks. As their set of tasks does not yet support all needed operations, we also use their PowerShell tools they developed for Databricks. Both the tasks and PowerShell tools are wrappers around the Databricks API.

Databricks permissions for the App Registration

As preparation we create a Databricks pool that is available for integration tests. We use an Azure App Registration that acts as a principal to execute notebooks on the instance pool. The App Registration is registered as a Databricks Service Principal with the "Can Attach To" permission on the Databricks pool to create cluster.

Databricks instance pool configuration, Image by Author
Databricks instance pool configuration, Image by Author

Preparing pipeline secrets

The first step of the CI/CD pipeline is to fetch all required secrets. For simplicity, we store the app registration client id, secret, tenant-id and the Databricks pool ID in a Key Vault. The secrets are collected using the AzureKeyVault task.

# azure-pipelines.yml excerpt
jobs:
- job: integration_test
  displayName: Test on databricks
  pool:
    vmImage: "windows-latest"
  steps:
  - task: AzureKeyVault@1
    inputs:
      azureSubscription: "Azure DevOps Service Connection"
      keyVaultName: "keyvault-test-environment"
      secretsFilter: "appreg-client-id,appreg-client-secret,tenant-id,databricks-pool-id"

Databricks workspace connection

To interact with Databricks we need to connect to the workspace from Azure DevOps. We use two Azure Devops Tasks from Data Thirst to generate an access token for Databricks and to connect to the workspace. The token is stored in the BearerToken variable and generated for the app registration we have granted permissions in Databricks. The workspace URL can be found in the Azure Portal on the Databricks resource.

# azure-pipelines.yml excerpt
- task: databricksDeployCreateBearer@0
  inputs:
    applicationId: $(appreg-client-id)
    spSecret: $(appreg-client-secret)
    resourceGroup: "DatabricksResourceGroup"
    workspace: "DatabricksWorkspace"
    subscriptionId: "AzureSubscriptionId"
    tenantId: $(tenant-id)
    region: "westeurope"
- task: configuredatabricks@0
  inputs:
    url: "https://adb-000000000000.0.azuredatabricks.net"
    token: $(BearerToken)

Please note, there is a potential security issue by using the databricksDeployCreateBearer task, which we have resolved in our live pipelines. The current version of the task creates bearer tokens without an expiration date and unfortunately, there is no way to set an expiration date using the task. As an alternative, it is possible to use the Powershell Databricks tools from Data Thirst as well. By consecutively calling Connect-Databricks and New-DatabricksBearerToken it is possible to create a token with a limited lifetime.

Creating a transient test cluster

After setting up the connection to Databricks, we create a dedicated cluster in the Databricks for the purpose of the integration tests executed by this pipeline. The cluster configuration consists of just 1 worker, which is sufficient for the integration test. As we store test data needed for the notebooks on an ADLS gen2 storage account, we setup ADLS pass-through to allow the app registration to authenticate with the storage account. For best practices we do not insert the app registration client secret directly in the cluster config, as this will be visible in Databricks. Instead, we use a Databricks Secret Scope and its template markup in the cluster config, which is filled at runtime.

// cluster-config.json
{
    "num_workers": 1,
    "cluster_name": "",
    "spark_version": "",
    "spark_conf": {
        "fs.azure.account.auth.type": "OAuth",
        "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
        "fs.azure.account.oauth2.client.id": "",
        "fs.azure.account.oauth2.client.secret": "{{secrets/zpz_scope/appreg-client-secret}}",
        "fs.azure.account.oauth2.client.endpoint": "",
        "spark.hadoop.fs.permissions.umask-mode": "002"
    },
    "ssh_public_keys": [],
    "custom_tags": {},
    "spark_env_vars": {
        "PYSPARK_PYTHON": "/databricks/python3/bin/python3",
        "NSK_ENV": ""
    },
    "autotermination_minutes": 10,
    "cluster_source": "API",
    "init_scripts": [],
    "instance_pool_id": ""
}

We commit the above template of the cluster configuration file and use the linux jq command to fill out details such as the pool id and app registration client id from the Azure Key Vault at runtime. The cluster name is based on the current devops Build ID and together with other parameters the cluster-config.json is rendered and written to disk.

# azure-pipelines.yml excerpt
- bash: |
    jq -c ".cluster_name = "${CLUSTER_NAME}""`
         `"| .spark_conf."fs.azure.account.oauth2.client.id" = "$(ar20zpz001-client-id)""`
         `"| .spark_conf."fs.azure.account.oauth2.client.endpoint" = "https://login.microsoftonline.com/$(tenant-id)/oauth2/token""`
         `"| .spark_version = "${RUNTIME}""`
         `"| .instance_pool_id = "${INSTANCE_POOL_ID}""`
         `"| .spark_env_vars.NSK_ENV = "${NSK_ENV}"" cluster-config.json > tmp.$$.json
    mv tmp.$$.json cluster-config.json
    echo "Generated cluster-config.json:"
    cat cluster-config.json
  displayName: "Generate cluster-config.json"
    env:
      CLUSTER_NAME: "integration-build-$(Build.BuildId)"
      RUNTIME: "7.5.x-scala2.12"
      INSTANCE_POOL_ID: $(main-pool-id)
      NSK_ENV: "test"

The databricksClusterTask from Data Thirst uses the rendered cluster-config.json to create and deploy a cluster on our staging environment with resources taken from the Databricks pool.

- task: databricksClusterTask@0
  name: createCluster
  inputs:
    authMethod: "bearer"
    bearerToken: $(appreg-access-token)
    region: "westeurope"
    sourcePath: "cluster-config.json"

Execute notebook

Finally, we can upload the notebook to test and execute it. The databricksDeployScripts task uploads the notebook to Databricks, which is executed using the executenotebook task from Microsoft DevLabs. The notebook is stored in a path containing the devops Build ID to identify (and delete) it later if needed. If the notebook uses widgets, the executionParams input is used to pass a JSON string with input parameters. In our case, the Python package dev version string is passed as "package_version" for controlled integration testing. Finally, we wait for the execution of the notebook to finish. The executenotebook task finishes successfully if the Databricks builtin dbutils.notebook.exit("returnValue") is called during the notebook run.

# azure-pipelines.yml excerpt
- task: databricksDeployScripts@0
  inputs:
    authMethod: "bearer"
    bearerToken: $(appreg-access-token)
    region: "westeurope"
    localPath: "notebooks/"
    databricksPath: "/test/package_name/$(Build.BuildId)"
    clean: false       
- task: executenotebook@0
  inputs:
    notebookPath: "/test/package_name/$(Build.BuildId)/$(notebook_name)"
    executionParams: '{"package_version":"0.0.0-dev"}'
    existingClusterId: $(createCluster.DatabricksClusterId)
- task: waitexecution@0
  name: waitForNotebook

Delete cluster

Finally, we delete the cluster. Unfortunately, no Azure DevOps task from Data Thirst exists to delete clusters, so we installed their Powershell Databricks tools and use the Remove-DatabricksCluster command to delete the cluster.

# azure-pipelines.yml excerpt
- task: PowerShell@2
  condition: always()
  inputs:
    targetType: "inline"
    script: |
      Install-Module -Name azure.databricks.Cicd.tools -force -Scope CurrentUser
- task: PowerShell@2
  condition: always()
  inputs:
    targetType: "inline"
    script: |
      Remove-DatabricksCluster -BearerToken $(BearerToken) -Region 'westeurope' -ClusterId $(createCluster.DatabricksClusterId)
  displayName: "Delete Databricks integration cluster"

Notebook artifact release

Notebooks that have been tested successfully are ready to be merged with the main branch. After merging we want to bring the notebooks to production. We use Azure devops artifacts to register the project notebook directory as a universal package in an Azure devops artifact feed. We tag the main branch with a release version, which triggers a pipeline run including artifact registration. The release version is set as the default "package_version" in the notebook input widget with sed before registering the notebook artifact, see below (example below for release 1.0.0). Note that, the accompanying Python package is also registered as an artifact with the same name and version, but in a different artifact devops artifact feed. This ensures that by default the notebook will run with the Python package version it was tested against. Our notebook artifacts are thus reproducible and allow for a controlled release process.

# Databricks notebook source
dbutils.widgets.text("package_version", defaultValue='1.0.0')
package_version = dbutils.widgets.get("package_version")

How you generate release versions is up to you. Initially you can add a git tag to the main branch to trigger a build including artifact registration, such as shown below. For full CICD you can generate a version on-the-fly when merging a pull request.

# azure-pipelines.yml excerpt
 variables:
  packageName: 'my_model'
  ${{ if startsWith(variables['Build.SourceBranch'], 'refs/tags/') }}:
    packageVersion: $(Build.SourceBranchName)
  ${{ if not(startsWith(variables['Build.SourceBranch'], 'refs/tags/')) }}:
    packageVersion: 0.0.0-dev.$(Build.BuildId)
- job: publish_notebook_artifact
  pool:
    vmImage: "ubuntu-latest"
  dependsOn: [integration_test]
  condition: and(succeeded(), or(eq(variables['Build.Reason'], 'Manual'),  startsWith(variables['Build.SourceBranch'], 'refs/tags/')))
  steps:
  - bash: |
      set -e
      if [ -z "$PACKAGEVERSION" ]
      then
        echo "Require PACKAGEVERSION parameter"
        exit 1
      fi
      sed -i "s/defaultValue=.*/defaultValue='$PACKAGEVERSION')/" 
        notebooks/run_model.py
    displayName: Update default value for version
  - task: UniversalPackages@0
    displayName: Publish notebook artifact $(packageVersion)
    inputs:
      command: publish
      publishDirectory: "notebooks/"
      vstsFeedPublish: "DNAKAA/databricks-notebooks"
      vstsFeedPackagePublish: "my_model"
      packagePublishDescription: "notebooks of my_model"
      versionOption: custom
      versionPublish: "$(packageVersion)"

Conclusion

We have shown how to run notebook integration tests on transient Databricks clusters accompanied by Python package unit tests. This results in reproducible notebook artifacts allowing for a controlled release process for notebooks. Databricks notebooks are first class citizens and require engineers to emancipate notebooks into their test and release processes. We look forward to learn more about merging the realities of data scientists with those of the data engineer with the goal increase productivity, regular releases. Our goal is to ease the move from exploration, proof-of-concept to production. In our next blog we will go in depth how to use notebook artifacts in production pipelines with an emphasis on Azure DataFactory pipelines.


Originally published at https://codebeez.nl.


Related Articles