Fast Docker Builds With Caching (Not Only) For Python

Installing application dependencies in Docker build takes long? CI/CD limits effectivity of Docker caching? Using a private repo? … Have you heard about the new BuildKit caching features?

Jan Michelfeit
Towards Data Science

--

Photo by Cameron Venti on Unsplash

One of the most important things for my productivity as a developer is the speed of the make changes — build — test loop. It currently involves building several Docker images with Python inside. As it turns out, making Docker builds with Python fast is not that simple and there are not many great articles available. I learned more than I wanted about Docker caching on my journey towards efficient builds, and I would like to share the findings with you here.

I’ll start by explaining the different caching options Docker provides with the new BuildKit backend (Docker 18.09+), and show step-by-step how to combine them so that you don’t spend a second more than you need to waiting for your build pipelines. The impatient can jump straight to the complete solution at the end of the article.

While the code examples show a Python application installed with Poetry, most of the techniques are also applicable for other languages or situations where the build phase of your application in Docker takes long and caching can help.

The Challenge

There are a couple of things that can make fast builds challenging:

  • Application has many dependencies that take a long time to download and install.
  • CI/CD jobs run on machines we have limited control over. In particular, we cannot rely on what Docker cache is present.
  • Some dependencies are installed from a private repository which requires secret credentials for authentication.
  • The resulting image is so large that pushes and pulls from a registry take non-negligible time.

What we want are builds, image pull, and image push as fast as possible even with all these constraints. But before we get to the solution, let’s have a look at what tools we have at our disposal.

The Tools

Image from makeameme.org

Docker layer caching

Everybody knows about Docker layers and caching — unless the inputs of an image layer change, Docker can reuse locally cached layers. Just order Dockerfile commands carefully to avoid invalidating the cache.

External cache sources

What about if we don’t have a local cache available, e.g., on a CI/CD agent? A less known feature addressing this problem are external cache sources. You can supply a previously built image in your registry with the --cache-from flag of the build command. Docker will check the manifest of the image, and pull any layers that can be used as local cache.

There are a few caveats to making it work. BuildKit backend is required — this requires Docker version ≥18.09 and setting a DOCKER_BUILDKIT=1 environment variable before invoking docker build. The source image should also be built with --build-arg BUILDKIT_INLINE_CACHE=1 so that it has cache metadata embedded.

Build mounts

When it comes to using a cache directory in Docker builds, one might think that we can just mount it from the host. Easy, right? Except that it’s not supported. Fortunately, BuildKit adds another feature that can help: build mounts. They enable mounting a directory from various sources for a duration of a single RUN instruction:

RUN --mount=type=cache,target=/var/cache/apt \
apt update && apt-get install -y gcc

There are several kinds of mounts, e.g.,

  • bind mount allows you to mount a directory from an image or from the build context;
  • cache mount mounts a directory whose content will be locally cached between builds.

The content of mounts is not available for any instruction later in the Dockerfile, however.

To make build mounts work, one needs to include a special first line in the Dockerfile # syntax=docker/dockerfile:1.3, and enable BuildKit with the DOCKER_BUILDKIT environment variable.

Improving Builds Step-by-step

Now that we know what Docker with BuildKit has to offer, let’s combine it with some best practices to improve our build times as much as possible.

Photo by Viktoria Niezhentseva on Unsplash

Dependencies first, application code later

The first trick frequently recommended for Python images is to reorder instructions so that changes in application code do not invalidate the cache of the layer with installed dependencies:

Now the layer with dependencies will be rebuilt only if pyproject.toml or the lock file change. By the way, some people recommend installing dependencies with pip rather than Poetry, but there are also reasons not to.

Multi-stage builds and virtual environments

Another obvious thing is to leverage multi-stage builds so that the final image has only necessary production files and dependencies:

You can see we used a smaller base image (3.8-slim) for the final stage and the --no-dev Poetry option to make the result smaller.

We also added a Python virtual environment. While it may seem superfluous in an already isolated container, it provides a clean way to transfer dependencies between build stages without unnecessary system packages. All you need to activate it is to set variables PATH and VIRTUAL_ENV(some tools use it to detect an environment). An alternative to venv is building wheel files.

One caveat with Poetry is that you should be careful with the virtualenvs.in-project setting. Here is a simplified example of what not to do:

COPY ["pyproject.toml", "poetry.lock", "/app/"]
RUN poetry config virtualenvs.in-project true && poetry install
COPY [".", "/app"]
FROM python:3.8-slim as finalENV PATH="/app/.venv/bin:$PATH"
WORKDIR /app
COPY --from=build-stage /app /app

This would make the resulting image as small and build as fast as before, but application files and dependencies will end up in a single final image layer, breaking caching of dependencies for pull/push. The correct version shown earlier did allow caching of the layer with dependencies in a remote registry.

Passing repository secrets

Poetry accepts credentials through environment variables such as POETRY_HTTP_BASIC_<REPO>_PASSWORD. A naive solution for passing PyPI repository credentials is to pass them with --build-arg. Don’t do it.

The first reason is security. The variables will remain embedded in the image, as you can verify with docker history --no-trunc <image>. Another reason is that if you use temporary credentials (e.g., supplied by your CI/CD), passing credentials in --build-arg or through the COPY instruction will invalidate the cached layer with dependencies!

BuildKit to the rescue again. The new recommended approach is to use secret build mounts.

  • First, prepare an auth.toml file with your credentials, e.g.:
[http-basic]
[http-basic.my_repo]
username = "my_username"
password = "my_ephemeral_password"
  • Place it outside of your Docker context or exclude it in .dockerignore (the cache would still get invalidated otherwise).
  • Update your Dockerfile to include # syntax=docker/dockerfile:1.3 as the very first line, and change the poetry install command to
  • Finally, build the image with DOCKER_BUILDKIT=1 docker build --secret id=auth,src=auth.toml ....

The contents of build mounts are not considered by the caching logic, therefore the layer with installed dependencies will be reused even when credentials change.

Caching without a local cache

One of our requirements was to leverage Docker cache also in CI/CD jobs which may not have local cache available. That’s when external cache sources and --cache-from mentioned earlier can help us. If your remote repository is my-repo.com/my-image, your build command would change to :

DOCKER_BUILDKIT=1 docker build \
--cache-from my-repo.com/my-image \
--build-arg BUILDKIT_INLINE_CACHE=1 \
...

This would work fine with a single-stage build. Unfortunately, we also need to cache layers for the build stage, which requires building and pushing a separate image for it:

Notice we used --target in the first build command to stop at the build-stage stage, and that the second build command referenced both the build-stage and latest images as cache sources.

An alternative to pushing the cache images to a remote registry is using docker save and managing them as files.

One last thing: now that you push your build stage image too, it’s a good idea to make it also smaller. Setting PIP_NO_CACHE_DIR=1 ENV variable can help.

Use .dockerignore

A cherry on top is omitting unnecessary files from your build context and the resulting docker image with .dockerignore exclusions. Here is an example of what you may want to ignore.

Get a cache directory inside docker build

I’ll mention one last trick, though I don’t recommend it unless you really need it. So far we managed to avoid installing dependencies repeatedly with Docker layer caching unless you change your dependency definitions (pyproject.toml or poetry.lock). What if we wanted to reuse previously installed packages even when we change some dependencies (like Poetry does when running locally)? You would need to get the cached venv directory to the docker buildcontainer before poetry install runs.

The simplest solution is to use a cache build mount. The downside is that the cache is only available locally and cannot be reused across machines. Also bear in mind that build mounts are only available during a single RUN instruction, so you need to copy the files to a different location in the image before the RUN instruction finishes (e.g., with cp).

If you manage the cache directory yourself on your build host, you can mount it with a bind build mount. The same caveat with being available only for a single RUN instruction applies.

Another approach is to COPY the cache directory from another image, e.g., a previously built build-stage image. You can pull it as another stage (FROM my-image:build-stage as cache). The tricky part is solving the chicken-egg problem: your build needs to work the very first time before a cache source is available; and there is no if in Dockerfile. The solution is to parametrize what image to base the cache stage on:

ARG VENV_CACHE_IMAGE=python:3.8FROM $VENV_CACHE_IMAGE as cache
RUN python -m venv /venv
FROM python:3.8 as build-stage
# ...
COPY --from=cache /venv /venv
RUN poetry install --remove-untracked

If you have already a build-stage image available, point the VENV_CACHE_IMAGE build argument to it. Otherwise, use some other available image as the default, the RUN python -m venv /venv instruction will ensure that an empty /venv directory is available so that COPY won’t fail.

The Complete Solution

Photo by Mike Doherty on Unsplash

Let’s summarize what steps you can take to make builds faster and images smaller:

  • Reorder Dockerfile instructions so that only COPY of dependency specifications precede install of dependencies. Copy application files later.
  • Leverage multi-stage builds and copy only the necessary dependencies with a virtual environment (or wheels) to the final image to make it small.
  • Do not mix application code with dependencies with virtualenvs.in-project = true.
  • Use a secret build mount to pass repository credentials.
  • Use --cache-from to reuse images from a registry if a local cache may not be available. This may require pushing a separate image with your build stage to the registry too.
  • If you absolutely need to get a cache directory to a container during docker build, use cache or bind build mount, or COPY it from another image pulled as an extra build stage. Bear in mind that all options are a bit tricky to implement right.

The resulting Dockerfile can look something like this:

Dockerfile for a Python application tailored for optimal Docker caching

(I also added instructions so that the application doesn’t run under root for security reasons. You can find this and other useful tips in this great article.)

Building and pushing the image in your CI/CD can then look like this:

Script for building the above Dockerfile with build stage caching support

As you can see, proper caching with Python in Docker is not exactly trivial. But when you know how to use the new BuildKit features, Docker can work for you instead of against you, and do a lot of the heavy lifting.

--

--