Data Science in the Real World

Solving dependency management in Python with Poetry

Bringing simplicity and consistency to our Python workflow

Christopher Sidebottom
Towards Data Science
5 min readFeb 16, 2020

--

A train pulling along all of its dependencies in Emeryville, CA (Image by author)

My team has recently opted to add another piece of tooling to our tech stack, Poetry. As with every technical decision, this one was made carefully so as to avoid unnecessary complexity. So why did we decide to use Poetry?

Coming from a world of building websites, I was used to npm. With npm, the JavaScript world sees dependency management as a solved problem. In our Python projects, we were just using pip and there are two fundamental workflows we could follow:

  • Curating the dependencies you need at the top level by hand
  • Collecting all of your dependencies together using pip freeze

Defining what you need

In this approach, you create a requirements file which looks like:

pandas==1.0.1

This has the advantage of being simpler to understand, because you’re only directly using pandas. You then use pip to install these dependencies and their dependencies:

pip install -r requirements.txt

Which means you can create a separate file for any development dependencies:

pip install -r dev_requirements.txt

This allows you to split out your production requirements from your development requirements. Installing only your production requirements helps reduce the size of the production build. Reducing the number of packages in production also improves security by reducing the attack surface.

Unfortunately, you have to maintain this file, and you have little control over the dependencies. Each time you run pip install your dependencies dependencies are at risk of being installed with a different version.

Freezing all dependencies

A different approach is to use more of pip’s functionality to freeze your dependencies. Running pip install and then pip freeze will give you a list of all packages installed. For instance, running pip install pandas will result in a pip freeze output of:

numpy==1.18.1
pandas==1.0.1
python-dateutil==2.8.1
pytz==2019.3
six==1.14.0

This is great as it lets you keep your dependencies locked to specific versions, creating a repeatable environment. Unfortunately, if you install anything in the environment pip is running in, it’ll be added to this list. As an example, you decide to install numba to see if it speeds up your processing. This results in the following frozen dependencies:

llvmlite==0.31.0
numba==0.48.0
numpy==1.18.1
pandas==1.0.1
python-dateutil==2.8.1
pytz==2019.3
six==1.14.0

But unfortunately, numba didn’t have the performance improvement you wanted. When you then run pip uninstall numba, it does remove numba, but not necessarily all dependencies installed by numba:

llvmlite==0.31.0
numpy==1.18.1
pandas==1.0.1
python-dateutil==2.8.1
pytz==2019.3
six==1.14.0

Notice how you now have that extra llvmlite package which you may never get rid of?

It’s worth noting that there are tools to work around this and remove all dependencies. But there’s one more caveat to consider; if you install development dependencies, they will also contribute to this issue.

The best of both worlds

Poetry is capable of doing both of the above, with a simple interface. When you call poetry add it adds the package to a pyproject.toml file to keep track of the top level dependencies (including Python itself):

[tool.poetry.dependencies]
python = "^3.7"
pandas = "^1.0.1"

This is paired with a poetry.lock file which includes all of the installed packages, locked to a specific version. Embedding the lock-file in this article would take up a lot of space, because it includes hashes by default to ensure package integrity. Here’s a snippet instead:

[[package]]
category = "main"
description = "Python 2 and 3 compatibility utilities"
name = "six"
optional = false
python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*"
version = "1.14.0"
[metadata]
content-hash = "5889192b2c2bef6b6ceae7457fc90225ba0c38a80ecd15bbbbf5871f91a08825"
python-versions = "^3.7"
[metadata.files]
six = [
{file = "six-1.14.0-py2.py3-none-any.whl", hash = "sha256:8f3cd2e254d8f793e7f3d6d9df77b92252b52637291d0f0da013c76ea2724b6c"},
{file = "six-1.14.0.tar.gz", hash = "sha256:236bdbdce46e6e6a3d61a337c0f8b763ca1e8717c03b369e87a7ec7ce1319c0a"},
]

To repeat our experiment with numba, we use poetry add numba and poetry remove numba. These commands also remove llvmlite from the environment and dependency locking file. By removing the dependencies and cleaning up, Poetry enables us to try more packages with an easy way to clean our virtual environment.

Finally, Poetry splits out production dependencies and development dependencies. If we want to write some tests, we can use poetry add -D pytest , which results in:

[tool.poetry.dependencies]
python = "^3.7"
pandas = "^1.0.1"
[tool.poetry.dev-dependencies]
pytest = "^5.3.5"

When creating a production bundle you can then use poetry install --no-dev to ignore anything used for development.

Working with Virtual Environments

Last year one of my colleagues wrote a great post about why we use virtual environments in Data Science. Poetry has something to add here as well, automatic management of them for your projects. So instead of having to run:

python -m venv .venv
. .venv/bin/activate
super-command

You instead just type:

poetry run super-command

Poetry creates a virtual environment for your project, which matches your specified Python version. This is kept outside your project root by default so as not to create clutter. My team did opt to put the virtual environment back in the project directory for other reasons though, which was as easy as:

poetry config --local virtualenvs.in-project true

Which creates the virtual environment in fairly standard .venv folder. This has the slight annoyance it creates another poetry.toml, file rather than using the main pyproject.toml .

Surely this has been done before?

There are other tools that have existed before Poetry that can meet these requirements. Poetry also leverages a lot of Python’s own enhancement proposals such as PEP-508, PEP-517 and PEP-518. These PEP’s work towards using the pyproject.toml as the primary place to configure a project. When considering the longevity of Poetry, following the PEPs gives us confidence in the future support for the features we’re using.

Hopefully this explains why I was confident in my team choosing to use Poetry instead of the alternatives that came before it. If you’re looking for a more detailed introduction to the history and more specific poetry commands I’d suggest Package Python Projects the Proper Way by Todd Birchard. And for other tools to combine with Poetry, take a look at How to Setup an Awesome Python Environment for Data Science or Anything Else by Simon Hawe.

It’s been a really simple transition with Engineers and Data Scientists collaborating on translating projects. Building new projects has also become much simpler, enabling us to experiment with different packages and get a specific deterministic build when we go to production.

--

--

Currently building ML things at Arm, but content is all my own. Interested in coffee and building things that make coffee.