An organised codebase enables you to implement changes faster and make less mistakes, ultimately leading to higher code and model quality. Read more to learn how to structure your ML projects with Tensorflow Extended (TFX), the easy and straightforward way.

Project Structure: Requirements
- Enable experimentation with
multiple
pipelines - Support both a
local
execution mode and adeployment
execution mode. This ensures the creation of 2 separate running configurations, with the first being used for local development and end-to-end testing and the second one used for running in the cloud. Reuse code
across pipeline variants if it makes sense to do so- Provide an easy to use
CLI interface
for executing pipelines with differentconfigurations
and data
A correct implementation also ensures that tests are easy to incorporate in your workflow.
Project Structure: Design Decisions
- Use Python.
- Use Tensorflow Extended (TFX) as the pipeline framework.
In this article we will demonstrate how to run a TFX pipeline both locally and on a Kubeflow Pipelines installation with minimum hassle.
Side Effects Caused By Design Decisions
- By using TFX, we are going to use
tensorflow
. Keep in mind that tensorflow supports more types of models, like boosted trees. - Apache Beam can execute locally, anywhere kubernetes runs and on all public cloud providers. Examples include but are not limited to: GCP Dataflow, Azure Databricks.
- Due to Apache Beam, we need to make sure that the project code is easily packageable by python’s
sdist
for maximum portability. This is reflected on the top-level module structure of the project. (If you use external libraries be sure to include them by providing an argument to apache beam. Read more about this on Apache Beam: Managing Python Pipeline Dependencies).
[Optional] Before continuing, take a moment to read about the provided TFX CLI. Currently, it is embarrasingly slow to operate and the directory structure is much more verbose than it needs to be. It also does not include any notes on reproducibility and code reuse.
Directory Structure and Intuition Behind It
$project-name
is the root directory of your project$project-name/ml
includes Machine Learning related stuff.$project-name/ml/pipelines
includes the actual ML pipeline code- Typically, you may find yourself with multiple ML pipelines to manage, such as
$project-name/ml/pipelines/predict-sales
and$project-name/ml/pipelines/classify-fraud
or similar. -
Here is a simple
tree
view:$project-name/ml/pipelines
includes the following: data
→ small amount of representative training data to run locally for testing and on CI. That’s true if your system does not have a dedicated component to pull data from somewhere. If this is true, make sure to include a sampling query with a small limited number of items.util
→ code that is reused and shared across$pipeline-name
s. It is not necessary to includeinput_fn_utils.py
andmodel_utils.py
. Use whatever makes sense here. Here are some examples:
In my own projects, it made sense to abstract some parts on the utility module, like building named input and output layers for the keras models.
Building the serving signature metagraph using Tensorflow Transform output.
Preprocessing features into groups by using keys.
And also other common repetitive tasks, like building input pipelines with the Tensorflow dataset api.
cli.py
→ entry point and command line interface for the pipelines. Here are some common things to consider when using TFX.
By using [abseil](https://github.com/abseil/abseil-py)
you can declare and access flags globally. Each module defines flags that are specific to it. It is a distributed system. This means that the common flags, like --data_dir=...
, --hparam_tuning
, --pipeline_root
, --ml_metadata_url
, --use_cache
, --train_epochs
is some you can define on the actual cli.py
file. Other, more specific ones for each pipeline can be defined on submodules.
This file acts as an entry point for the system. It uses contents in pipeline.py
to set up the components of the pipeline as well as provide the user-provided module files ( in the tree example these are constants.py
, model.py
, training.py
) based on some flag like --pipeline_name=$pipeline-name
or some other configuration.
Finally, with the assembled pipeline, it calls some _runner.py
file, by using a --runner=
flag.
-
pipeline.py
→ parameterised pipeline component declaration and wiring. This is usually just a function that declares a bunch of TFX components and returns atfx.orchestration.Pipeline
object. local_beam_dag_runner.py
→ configuration to run locally with the portable Beam runner. This can typically be almost configuration – free, just by using theBeamDagRunner
.kfp_runner.py
→ configuration to run on Kubeflow Pipelines. This typically includes different data path and pipeline output prefixes and auto-binds an ml-metadata instance.
Note: you can have more runners, like something that runs on GCP and just configures more provisioning resource like, TPU instances, parallel AI platform hyperparameter search etc.
$pipeline-name
This is the user-provided code that makes different models, schedules different experiments, etc.
Due to the util
submodule, code under each pipeline should be much leaner. No need to split it in more than 3 files. It’s not prohibiting to split your code throughout more files though.
From experimentation, I converged to a constants
, model
and training
split.
constants.py
→ declarations. Sensible default values for training parameters, hyperparameter keys and declarations, feature keys, feature groups, evaluation configurations and metrics to track. Here is a small example:model.py
→ Model definition. Typically contains abuild_keras_model
function and uses imports fromutil
and$pipeline-name.constants
. Here’s an example from a recent project of mine:- Lastly,
training.py
includes all the fuss required to train the model. This is typically: preprocessing definition, hyperparameter search, setting up training data or model – parallel strategies and tensorboard logs and saving the module for production.
That’s it. Thank you for reading to the end!
I hope that you enjoyed reading this article as much as I enjoyed writing it.