Generating a working (value-generating) machine learning model is not an easy task. It usually involves advanced modelling techniques and teams with scarce skills. However, this is only the first step on an even more complex task: deploying the model into production and preventing its degradation.
Even being alleviated by the cloud shift, at least two-thirds of IT spent is still concentrated on maintenance-mode tasks. There is still little research about where this split holds for ML related projects or not, but my take is that this percentage will even increase significantly due to the fact that an ML workload has more "liquid" inputs and fewer control levers as shown below:

In essence, maintenance is mainly driven by the level of variability and control we may have over the different components on the system. As shown on the diagram, it is reasonable to conclude that Machine Learning workloads are more prone to maintenance tasks. To make things even worse, the evolution paths of data and code (business roles) do not necessarily need to be aligned. This is greatly explained in depth in Hidden Technical Debt in Machine Learning Systems
It is reasonable to conclude that machine learning workloads are more prone to maintenance tasks
It is absolutely necessary then, to develop a robust framework for ensuring control over the Machine Learning operations once our model is deployed into production, and at the same time ensure the quality of the models and its evolution is not compromised.
The science (art) of developing models is a well-studied field and there are even industry reference frameworks for model development such as CRISP-DM , specific EdA methodologies, so for the rest of the article we will assume we have an already trained model with acceptable performance.
What infrastructure do we need for running Machine Learning at scale?
In a nutshell, there are three big platforms we need to engineer, apart from of course the development platform where building the initial model, running experiments and so on and also another cross functional platforms such as code repositories, container registries, schedulers or monitoring systems. The platforms are depicted in the following diagram:

Inside the feature store
In essence, a feature store decouples the feature engineering process with its usage. This is especially useful for situations where the input data is subject to complex feature transformation logic or where one feature is used by many models, in those scenarios a feature store is an excellent component to engineer since it hides complexity and promotes reusability. However, there are certain scenarios where we can skip this component, for example where the data used for training the model is in its natural state or where the model itself incorporate feature generators (e.g convolutions, bidirectional or embedding layers).

The feature store is comprised of a number of elements:
- Ingestion: This component is responsible for loading the raw data inside the feature store storage. Batch and online ingestion paths should be supported.
- Feature transformation: This component is responsible for actually computing the features, again both batch and online processing should be supported. Computing time performance is critical when designing this component.
- Feature serving layer: Component that actually serves the features for downstream processing. Again, features can be retrieved online or in batches.
Inside the training rig
The objective of the training rig is to find and produce the best model (in a specific point in time) given: (i) an initial model architecture, (ii) a set of tunable hyperparameters and a (iii) a historical labeled feature set.
The next diagram highlights the main components I believe there should be present to ensure a smooth and effective training operation.

The output of the training rig is what I call the "golden model", in other words, the architecture, weight and signature that will be deployed in the inference platform. In order to generate these assets, several components must intervene.
- Re-train checker: This component mission is to detect when it is needed to re-train the current golden model, there are many situations when a re-train event must be raised. I propose to deploy a number of Evaluators to check for retrain conditions. Some examples can be changes on the features (addition or deletion), statistically divergence in the training set (data drift), or between the train data and serve data (skew), or a simply a fall in accuracy metrics. The model generated should be passed to the promoter component, that will have the last word on deploying it to production and how.
- Golden model loop: This is probably the most critical step, it actually performs the training. Therefore performance considerations should be taken into account when engineering the system (e.g. distributed infrastructure and access to hardware asics). Another responsibility is to generate the model signature, clearly defining the input and output interfaces as well as any initialisation task (e.g. variable loader).
- Next golden model loop: This component aims to discover potential new models by continuously optimising (or attempting to) the current gold model. There are two sub-loops, one for optimising the hyperparameters (e.g learning rate, optimisers ..) and another for launching searches for a new model architecture (e.g. number of layers). Although there are two separate loops, the new model architecture candidates can be further refined on the hyperparameter loop. This component can be resource-intensive, particularly if the search space is big and the optimisation algorithm (e.g. grid search, hyperband) is greedy. From an engineering perspective, techniques such as checkpoint for resumable operations and job prioritisation mechanisms should be taken into account. The outputs of this component should be further evaluated before taking any additional actions.
- Model promoter: This component is responsible for issuing models ready for production work, therefore extensive testing should be performed on this step. In any case, as we will examine in the inference rig, no new model will be deployed openly to all their potential user base.
- Metadata store: This component centralised all the metadata associated with the training phase (model repository, parameters, experiments ..)
Inside the prediction rig
The main goal of the prediction platform is to execute inferences. The next diagram presents a set of components for achieving that.

A few components are present in the inference stage:
- Feature transformer: Even when having a feature store in place for decoupling features from data production systems, I think a feature transformer still has its place at inference time to apply low level specific operations to a potential reusable and more abstract feature. For online systems, latency requirements are critical.
- Dispatcher: The dispatcher objective is to route requests to a particular prediction endpoint. I believe that every single request should be subjected to an experiment, that’s why the dispatcher should be able to redirect the call to a particular or many live experiment(s), to the golden model or to both. Each request not subject to experimentation is an improvement opportunity lost.
- Predict backbone: The horsepower of the prediction rig resides on this component, hence from an engineering standpoint, it would be critical to design for classical non-functional requirements such as performance, scalability or fault tolerance.
- Cache layer: Low latency key value store to quickly respond to re-entrant queries. It must implement the classical cache mechanisms (invalidations, key computes based on feature hashing, LRUs queues ..)
- Golden promoter/de-promoter: As A/B tests take place, we could potentially reach to a point where one of the live experiments is actually most performant that the current gold model, this component mission is to analyse metadata and particularly ground truth data in the feature store to suggest a replacement of the golden model with one of the experiments.
- Model warmers: Component to ensure cache and memory warm-ups when a cold star situation happen (e.g. new model promotion)
- Explainer: Component that implement model explainability logic (e.g. Anchors, CEM ..) and returns it for a given request
- Metadata Store: This component centralised all the metadata associated with the prediction phase (live experiments performance, prediction data stats …)
Some user journeys enabled by the platform
There are several journeys that this architecture can articulate, my goal is not to mention all of them, but I would like to highlight a few interesting ones:
At feature generation time
- Compose a complex feature and serve it in real time
- Compose a complex feature kicking a LRO and use it consistently by many models
- Change/update a feature information producer without affecting the transformation and serving logic
At training time
- Launch a re-training (distributed) job triggered by the inclusion of a new feature in the training set
- Model evaluators based on data (features) dependencies
- Discover a new and more performant architecture for a current DNN model
- Optimize the learning rate for an already deployed model
At prediction time
- Gradually roll-out a new golden model by increasingly expanding its reach to all the population
- Query a prediction along with its black-box explanation
Manageability
- Inspect models, features, signatures versions
Available technology for deployment
There is a great array of open source software we can use to build a platform with all the components described before, but before thinking on designing each component independently, wouldn’t it be great that we can solve non functional requirements such as scalability, security or portability in a standard and unified way?
Luckily enough we can rely on kubernetes as the main platform where we deploy our components. The following diagram shows a proposal with a mapping of components with open source products/projects*.

*FEAST and kubeflow integration is currently work in progress
To make things even easier, kubeflow already packages all those components in a nice way so much of the integration is already done.
Conclusion
Even one can think that running a machine learning operation is fundamentally different from a traditional one, most of the software engineering principles actually hold, they are simply applied in a different context. In this article, we have presented a logical high level architecture that can be easily deployed using open source components such as kubeflow.
I am publishing some components and example notebooks on this topic here.