The world’s leading publication for data science, AI, and ML professionals.

Snakes on a Plane

Using the Python package distribution system to ship your Snakemake pipelines

Image by pch.vector / Freepik
Image by pch.vector / Freepik

Snakemake is one of the most popular workflow management packages for data science, but it lacks functionality to distribute the workflows (pipelines) you create. In this post, I will demonstrate how to add such functionality by using the Python package distribution system. Using this distribution system has the following benefits:

  • Distribute workflows to other data scientists who have no access to your infrastructure in an installable package
  • Add version control to your pipelines
  • Reduce Snakemake boilerplate (e.g. the required argument —-cores)

When following these guidelines, installing and running your snakemake workflow will be as easy as executing:

Step 0: Prerequisites

To build a package, a (discoverable) Python 3 installation is required. The easiest way to achieve this is by creating a Conda environment with Python 3 included. I used Python 3.9 on a Unix system for this post. To follow along, some experience using Conda and Snakemake is required.

Step 1: Creating the packaging material

Image via burst.shopify.com
Image via burst.shopify.com

For this example, we will create a Python package (called snakepack) that uses a simple internal Snakemake pipeline that copies all .txt files from an input directory to an output directory.

To mimic a real-world situation, we will create two separate folders, one containing the package files and one containing the input and configuration files that a user would use to run the package on.

Package folder structure

This structure will contain all files to create the actual package. The root folder contains the files that are required to build the package. The snakepack subfolder contains the actual modules. Inside the snakepack subfolder is the snakemake folder which contains the pipeline file.

This structure can e.g. be created by running

The files stay empty until we will fill them in later.

User files

These folders contain the files that will be used to demonstrate that the pipeline works. They can be located anywhere, but in this example, they will reside in the home (~) folder:

This structure can e.g. be created by running

Step 2: Forging the pipe to pack

Image via burst.shopify.com
Image via burst.shopify.com

Let’s start by creating the Snakemake pipeline in the snakemake folder and see if it works without the package wrapper. As we want the user of our package to be able to configure the pipeline, we will use Snakemake’s config file to define the input and output directories and pass them to Snakemake.

Snakefile

In our snakepack_files folder (see above) we can fill in the config file:

config.yaml

Testing

This pipeline can now be run from command line from the snakemake folder via

Step 3: Packing up

Image via burst.shopify.com
Image via burst.shopify.com

We will create a package wrapper around the existing snakemake pipeline which allows us to create one .whl file that contains the package and can be distributed. Use the above folder/file structure to fill in these files:

copyfiles.py

As we want to be able to invoke the pipeline from the command line after we install the package, this file acts as a wrapper that takes the command line arguments and uses them to start the snakemake pipeline. An advantage is that we can control which parameters the user sets and which parameters we fix or determine automatically.

In this case, we can distinguish the following parameters:

  • User defined: the location of the config file
  • Fixed: the location of the Snakefile inside the package
  • Automatically determined: the number of available cpus

requirements.txt

This file contains the requirements for the Python package. In this case, snakemake is a requirement. We also included mamba, as this optimises the usage of snakemake.

The requirements will be automatically downloaded and installed when you install the snakepack package later on.

setup.py

The setup.py file will give the python package tool instructions on how to create the package.

The setup command directs the package builder that this is version 0.0.1 of package snakepack. The find_packages() call ensures that the python module will be included in the wheel. install_requires defines the requirements that we read in from requirements.txt. Lastly, we set include_package_data to True and point to the snakemake directory to be included in the package.

Testing

Once you have created the above files, you can test whether your package can be installed in development mode and you can run the module.

This should run the pipeline and copy the files. Delete the copied files afterwards.

Step 4: Take-off!

Image via burst.shopify.com
Image via burst.shopify.com

If step 3 works, you are ready to build the actual package:

This will create a wheel (.whl) file in a new subfolder called dist/. This wheel can be shared with others.

Testing

Let’s try if it installs and runs:

Conclusion

This post uses a very simple snakemake pipeline to show it can be made distributable by containing it inside a custom Python package. If you are a frequent Snakemake user and would like to easily share and version control your pipelines, you can expand the offered framework with more complex pipelines and tailor it to your needs.

Sources


Related Articles