Notes from Industry

“DAG Card” Is the New “Model Card”

Automated generation of DAG Cards from Metaflow, inspired by Google Cards for machine learning models.

Jacopo Tagliabue
Towards Data Science
7 min readMar 13, 2021

--

Introduction

“Imitation is the sincerest form of flattery” — O. Wilde (possibly imitating someone else)

Software is eating the world, and A.I. is eating software. However, there is still a relatively small number of people that can understand the behavior of machine learning models — truth is, even A.I. experts struggle to understand a model built by somebody else, especially if it is not a rarefied model from the literature, but an actual API serving millions of requests a day. So, the problem is pressing: how do we make models easy to understand for a wider audience, inside and outside our organization?

A FAT* paper by Margaret Mitchell et al. introduced in 2019 the idea of Model Cards, a “one-pager” summing up what we know about a given model: input and output of course, but also accuracy on a test set, biases and limitations, best practices for its use, etc. As aptly exemplified by their Face Detection page, cards are thought to be a “reference for all, regardless of expertise”: ML engineers will find pointers on architectures and quantitative performance, PMs will read about strengths and weaknesses to imagine new use cases, marketing folks will get a bird-eye view of its capabilities.

We liked the idea so much that we decided to hack together ourselves a Metaflow card generator, self-documenting ML pipelines from code and comments.

To build our own cards, we made a few tweaks to the original idea (mostly due to our focus on B2B, as well as an internal-only audience):

  • while the Google sample card is obscenely cute, we want to generate readable cards without ad hoc manual work; in other words, we want cards to be automatically built from our code;
  • Google model cards are mostly satisfying a B2C use case, when a single model is globally available at scale through a public interface; for us, cards are primarily targeted to the internal audience of a growing company, where keeping track of new features is hard, significant domain knowledge go into models and several services/frameworks are involved in training and testing;
  • finally, while models are certainly the frontmen of A.I. bands, they are not a monad living in isolation: to us, ML work is almost invariable synonym with “ML pipeline”, or better still, with “ML DAG”: given that our tool of choice for ML DAGs is Metaflow, our card generator may be thought as a “plugin” that runs on top of existing Metaflow classes.

To get a feeling of what we are building, this is a screenshot of our sample DAG card:

A glimpse of a DAG card, illustrating owners, tasks, input files and parameters. The accompanying code shows how to programmatically generate a card like this from a Metaflow class [ screenshot by the author — original Google card here ].

And now, let’s make documentation great again!

[ NERD NOTE: The code to reproduce this post is shared on Github: it’s all WIP and hand-waving, but if there is interest we may actually build a legit PyPI package down the line! ]

Prerequisites

This post assumes you know why Metaflow is great, Weights & Biases is cool, and how they fit together: in case you missed Season 1, you can start from our pilot post (note that today’s DAG is just a stripped down version of the previous one).

Go through the README, but you basically just need to make sure you have:

  • Metaflow up and running, preferably configured to use the AWS-based setup;
  • an account on Weights & Biases, with a valid API key.

Wait, Why Cards and Not Confluence Pages?

We know what your colleagues will say:

“Of course we have documentation on model XYZ! Have you checked Confluence?”

The uncomfortable truth is that, yes, we checked, and we are none the wiser. While Confluence (and similar solutions) is an indispensable tool to share knowledge inside a company, in our experience it rarely works well for ML DAGs (especially for new, still-changing features). From the perspective of people owning the ML pipeline (that is, ultimately the people responsible for its behavior), we see DAG Cards having key advantages over Confluence pages:

  • building and updating Confluence pages is a quintessentially manual endeavor. On top of it, a 360 view of a DAG involves tapping into different services and APIs, thus involving a complex layer of authorizations, domain knowledge etc.; DAG Cards are built programmatically, and can be updated for example at the end of each training cycle.
  • Confluence pages are inherently static: the model changes all the time, but we all tend to forget to change Confluence as well. DAG Cards depend only on what is in the repo: if developers and testers update files and comments (as they should do anyway as part of good coding practices), the card will reflect all the latest changes;
  • Confluence pages are not interactive: while the current DAG Card is not interactive as well, it is not hard to extend the layout to include (like Google cards do) a testing interface;
  • last, but definitely not the least, Confluence pages may or may not be written “in the same spirit” as our DAG Cards: if PMs write the page, we are detaching the model creators from its theoretical consumption; if ML engineers write the page, they may tend to stress only a certain type of information. While no bullet-proof solution exists of course (as human affairs tend to resist silly standardization), we believe that DAG Cards point in the right direction, as there is tremendous value in encouraging engineers to explain their work in a comprehensive way, both quantitatively and qualitatively. By asking first the people behind the DAG to think hard on what their code really does — its shortcomings, the KPIs to monitor, the special cases to get right no matter what— , we believe we can foster a bigger sense of ownership and prevent many errors from ever showing up in production. In a sense, our way is the old way: “the man who passes the sentence should swing the sword”.

Building Cards Programmatically

The card builder script is pretty straightforward. Given:

  • a HTML (Jinja) template;
  • a Metaflow class (and config file to instantiate the client);
  • a Weights & Biases API key;

the builder will call the relevant services, grab information about the DAG and the model, prettify a bunch of JSONs, and finally compose a static HTML page for consumption.

The card builder takes a Jinja template and fills the slot by calling APIs from Metaflow, Wandb and possibly other services: the output is a straightforward static HTML page [ image by the author].

When running the code end-to-end, the final result is a Google-like card collecting in one place a lot of information about our Metaflow DAG:

Animated GIF showing the final DAG card [ image by the author — original screencapture here]

In particular, we organized the content in five main parts:

  • Overview: high-level description of the DAG (i.e. the docstring of the Python class).
  • Owners: a section containing DAG developers/owners, and a chart displaying distribution of runs among them.
  • DAG: a visual description of the DAG, and collapsible sections with step details (built through Metaflow own representation).
  • Model: collapsible sections reporting metrics for the latest K runs, the path to cloud storage (for ML engineers that wish to recover the exact artifact), an architecture sketch, and a loss/epoch chart.
  • Tests: on top of code checks (e.g. vector size) and general quantitative tests (e.g. accuracy), behavioral tests are designed to be “sanity checks” on the model qualitative behavior, and probe performances on cases which are deemed important for deployment (e.g. model performance on a small, but crucial set of inputs). While explaining our testing philosophy is out of scope (this is a great paper on NLP tests!), we just note that DAG cards can report predictions for important use cases, as specified by ML/QA engineers when building the pipeline. A natural extension for more complex models would be to break-down accuracy by partitions of the input space (say, loan default probability by gender, race and age): while human analysis is unbeatable, behavioral tests act like a powerful safeguard from obvious (but still not rare) mistakes.

The MVP is simple, and you should be able to extend it to include more data and/or more services: Jinja templates can be composed like Russian dolls, offering a principled way to extend the UI to many more use cases.

While we published working code to get the discussion started, we approach this from a Lean perspective — which basically means the least amount of effort to just get our cards done: make sure to check out the README for some technical “backlog” and let us know if you would like to see something more polished coming out of this experiment!

While “everybody wants to do the model work”, we do believe that documenting and sharing information is equally important in a functioning A.I. organization: (model) knowledge is indeed power (and yes, since you asked, France is bacon).

See you, space cowboys

If you have questions or feedback, please connect and share your MLOps story (all Medium opinions are my own). If you want to team up and extend this idea into a full fledge package, please do reach out!

--

--