As demands for AI applications grow, we’ve seen a lot of effort put by companies to build their Machine Learning Engineering (MLE) tools tailored for their needs. There are just so many challenges faced by industries in regards to having a well-designed environment for their Machine Learning (ML) lifecycle: building, deploying, and managing ML models in production. This post will cover two papers, explaining MLE practices from two of the leading tech companies: Google and Microsoft.
Adding a little bit of context, this article is part of a graduate-level course at Columbia University: COMS6998 Practical Deep Learning System Performance taught by Prof. Parijat Dube who also works at IBM New York as Research Staff Member.
The first section will present a paper from Google and will touch on the building part of an ML lifecycle. Google has an internal tool called Google Vizier [1] that aims to ease its engineers in doing Hyperparameter Optimization for their ML models.
The second section will cover the paper from Microsoft and will touch on the deploying part of an ML lifecycle, as well as how Microsoft manages ML models in production. All of those practices are bundled in a service called Microsoft DLIS (Deep Learning Inference Service) [2].
Technical Contributions and Their Importance
Paper [1]: Google Vizier, A Service for Black-Box Optimization
Introduction
- Black-Box Optimization
Before diving further into what Google Vizier is, we would like to give you a little introduction to black-box optimization. Black-box simply means that we could only evaluate a function f(x) but we don’t have direct access to the internal of the function, e.g. the function’s gradients or Hessian.
Put it together with optimization, a black-box optimization is a process of finding the best operating parameters by evaluating those parameters against the function. For companies like Google which uses a lot of heavy DNN models, the evaluation is more often than not expensive. Thus, it’s in their very best interest to generate a sequence of x that is more likely to reach the global optimum quicker than any other sequence of x in the parameter space. There are existing algorithms on this matter, such as Random Search, Grid Search, Simulated Annealing, Genetic Algorithm, and Bayesian Optimization, which I believe many of you have used, especially the first two.
- Google Vizier Terms
There are three terms that I will use in explaining the concepts of this paper:
- Trial, a list of parameter values, i.e. the x, that leads to a single evaluation of f(x)
- Study, a single optimization ran over a parameter space, i.e. a set of Trials
- Worker, the process/instance which is responsible to evaluate Trials
System Overview
- Basic User Workflow
There is no better way to understand a tool than looking directly at the code. So, take a look at a snippet below that gives you a basic user workflow for Google Vizier. I’ve also written comments for you to understand the lines.
# Register this client with the Study, creating it if necessary
# Supply the study_config (name, owner, etc.)
# and the worker handle, i.e. the process ID
client.LoadStudy(study_config, worker_handle)
# Study is not Done if there are pending Trials still
while (not client.StudyIsDone()):
# Obtain a Trial to evaluate
trial = client.GetSuggestion()
# Evaluate the objective function at the trial parameters
metrics = RunTrial(trial)
# Report back the results and mark Trial as Complete
client.CompleteTrial(trial, metrics)
Google Vizier Architecture
Below is the architecture of Google Vizier:

There are 6 main components:
- Evaluation Workers, a component provided by the client/user which will execute Trials
- Vizier API, a component that connects Evaluation Workers with Google Vizier functionalities
- Persistent Database, an internal Google Vizier component that holds the current state of all Studies, i.e. the Complete and Pending Trials
- Suggestion Service, a component that returns a new Trial to be executed by Evaluation Workers
- Early Stopping Service, a component that helps terminate a Trial early by analyzing several conditions
- Dangling Work Finder, a component that restarts works lost due to preemptions
Google Vizier Algorithms
Google Vizier provides the algorithm below that can be used by the clients to design their hyperparameter tuning experiments.
- Automated Early Stopping
- Performance Curve Stopping Rule, this rule performs regression on the performance (loss/accuracy) curves to predict the final objective value of a Trial given a set of previous Trials so we could base our Early Stopping mechanism on that predicted value.
- Median Stopping Rule, this rule stops a pending Trial when the Trial objective value at step s is strictly worse than the median value of the running averages at step s-1.
- Transfer Learning
The idea is pretty much the same as Transfer Learning in general. Suppose we have tuned the Learning Rate and Regularization hyperparameters for a certain ML system. Then, we can use that Study/Studies as a prior to tuning the same ML system on a different dataset. Note that we could use more than one Study which isn’t a common thing in Transfer Learning in general.
The scheme is as below:
- Build a stack of Gaussian Process regressors, i.e. a stack of Studies.
- Priors are placed on the bottom and its residual will be used for the Studies in the upper part of the stack.
- This approach works best when priors have many well-spaced data points but the top-level regressor has relatively little training data. This is exactly the same scenario as Transfer Learning in general, e.g. use ImageNet at the bottom and a small, domain-specific image dataset at the top.
- Algorithm Playground
In the case that you need to customize your own algorithm, i.e. no algorithms implemented in Google Vizier match your needs, then you can do that through this Algorithm Playground.
Below is the architecture of the Algorithm Playground.
![Figure 2: Algorithm Playground Architecture [1]](https://towardsdatascience.com/wp-content/uploads/2020/11/0eMSjM54m3wYGF_Et.png)
There are 4 main components:
- Vizier API, same as before
- Persistent Database, same as before
- Abstract Policy, an abstract class, containing two abstract methods to be implemented
- Custom Policy, a class that inherits from the Abstract Policy and implements the two abstract methods
- Playground Binary, a binary executable compiled from the Custom Policy that is based on the demands reported by Vizier API
- Evaluation Workers, same as before, but note that these workers are unaware of the Playground Binary since it is bridged with Vizier API, so clients don’t need to adjust Evaluation Workers when they use Algorithm Playground
The clients need to implement the two abstract methods from Abstract Policy class in a way that suits their needs. Those two functions are:
# 1. Function to generate num_suggestions number of new TrialsGetNewSuggestions(trials, num_suggestions)
# 2. Function that returns a list of Pending Trials that should be stopped earlyGetEarlyStoppingTrials(trials)
Google Vizier Use Cases
- HyperTune, a Google Cloud Machine product that uses Google Vizier and is accessible to external users.
- Automated A/B testing. For example, to optimize the reading experience for users by tuning the font, color, and spacing in a certain webpage owned by Google. Another example is when Google needs to find answers for questions like: "How should the search results returned from Google Maps trade-off search-relevance for distance from the user?"
- To address more complex problems in black-box optimization, two of them are Infeasible Trial and Overriding Suggested Trial. An Infeasible Trial is when a Trial cannot be evaluated due to hyperparameter reasons, e.g. the learning rate is too high so that the model couldn’t converge. Google Vizier can mark such Trial as infeasible. Overriding Suggested Trial is the case when you have to manually replace the suggested trial due to resource/worker limitation, e.g. not enough memory.
Paper [2]: Microsoft DLIS (Deep Learning Inference Service)
Introduction
Microsoft’s aim of building DLIS is to achieve ultra-low latency inference for their deployed DNN models. As many of you might know, Microsoft uses DNN models for a lot of their products, such as to improve the search relevance for their Bing Search Engine. Microsoft DLIS is claimed to have these capabilities:
- Handling 3 million inference calls per second
- Serving 10 thousand model instances
- Distributed in nature and is deployed across 20 data centers worldwide
There are three core concepts/practices that Microsoft DLIS employs so that the above is achievable:
- Intelligent Model Placement
- Low-Latency Model Execution
- Efficient Routing
System Overview
![Figure 3: Microsoft DLIS Architecture [2]](https://towardsdatascience.com/wp-content/uploads/2020/11/0ZIR0zxyV3N0tt26S.png)
There are 7 main components:
- Model Instance (MI), this is the deployed model
- Model Container (MC), this component is a wrapper for MI to make it flexible for Deployment across different environments
- Model Server (MS), this is where MC is deployed
- Model Loader (ML), this component is responsible to load MC and then deploy them in MS
- Model Executer (ME), this component functions to handle the inference process
- Router, this directs incoming request from a client to ME
- Model Master (MM), this component holds all information regarding other components above
Core Concepts/Practices
Intelligent Model Placement
The motivation behind this practice is that, as many of us know, the performance of different DNN models varies across hardware. Thus, we need to design a strategy to allocate suitable resources for different types of models. CNN based models are more suitable to be deployed on GPU to optimize and parallelize the convolution operations. On the other hand, we could use cheaper resources like FPGA/CPU for sequential models like RNN because using GPU doesn’t help with that.
There are two main modules in the Intelligent Model Placement implementation, which are Model Placement and Diverse Hardware Management.
Model Placement detail is as below.
- MM has a global view of every MS hardware and resource
- MM knows estimated resource usage through a prior validation test
- Model placement is multi-tenant: An MS can have MIs of the same or different models (e.g. it could host two ResNet-18s or a Resnet-18 and a VGG-16)
- Model placement is dynamic: MM reads resource availability at runtime and could move instances to another MS if necessary
- For an MS to host an MI, there are two requirements: MS meets the minimum hardware requirements (known through prior validation test) and MS currently has available resources capable of holding at least one MI instance (known through dynamic scanning)
Diverse Hardware Management detail is as below.
- This module is used to handle specialized hardware, such as GPU and FPGA
- There is a component inside this module called Machine Configuration Model (MCM) that functions to properly configure the specialized hardware and manage them during runtime
- MCM install drivers for GPU and FPGA at the very start of the deployment, i.e. when DLIS spawns new MS and attach GPU and FPGA to the MS
- MCM configures MS at regular intervals, i.e. pinging every 10 minutes, to verify GPU health in MS and to reset clock speed if necessary
Low-Latency Model Execution
This practice is motivated by the fact that DNN models typically have a high computational cost. Thus, we need optimization in both System and Model levels to achieve low latency in our inference. Microsoft DLIS, in this case, only covers the System level optimization. Microsoft DLIS enforces two important practices, which are Resource Isolation and Data Locality. The detail is as below.
- Data access is localized so that MS can benefit from different cache layers
- MS isolates MI in containers using MC, in this case, Docker is used for Linux-based environments, meanwhile, for Windows-based environments, Microsoft implements a custom wrapper
- The resources are isolated to ensure that MI doesn’t interfere with each other
- To isolate resources for MIs, Microsoft DLIS enforces three things: Processor Affinity (allowing model critical data to stay in the nearest processor caches), NUMA Affinity (guarantees that an MI doesn’t have to cross memory banks), and Memory Restrictions (ensures that an MI never needs to access the disk to reduce I/O operations).
Furthermore, since we use containers, i.e. the MC, Server-to-Model Communication is employed. Thus, efficient communication between MS and MI through MC is a must. Microsoft DLIS hence implements the below practices for Linux-based and Windows-based environments to achieve such a goal.
- Linux-based MI is wrapped in a custom infrastructure built by Microsoft that enables very fast communication over UDP
- Windows-based MI is wrapped in a custom infrastructure built by Microsoft that enables communication over a shared-memory queue which performs within a few hundreds of milliseconds for inter-process communication
Efficient Routing
Last but not least, we need to take into account how we should route client requests to ME (and eventually to MS). This is a concern due to some unique challenges in regards to traffic patterns. For example, Microsoft often experiences frequent burst traffic, which is an explosion of requests in a span of a few milliseconds, leading to a long request queue.
To handle that, MS supports a practice called Backup Request. This is to mitigate if the First/Primary request could miss SLA (Service Level Agreement, which defines the standard/average latency an MS should achieve). The time delay for the Backup Request could either be static or dynamic.
There are, however, issues with Backup Request. Let’s observe the following scenario. Suppose our SLA latency is at 15ms and based on the data we gather from an experiment over a certain period of time, we know that the average latency is at 8ms and the 95th percentile latency is at 10ms. Let’s say that we use a static Backup Request, then look at the two options below.
- If static Backup Requests are configured at 10ms, the request will most likely timeout because we send the Backup Requests 10ms after the First Request but we have SLA latency of 15ms
- If we pick sometime earlier, say 2ms, then the workload is effectively doubled and the MS will overload
To mitigate the above issue, Microsoft DLIS introduces a practice called Cross-Level Cancellation to accompany the Backup Request practice. Backup Request with Cross-Level Cancellation has the following mechanisms.
- Always send Backup Request early, e.g. 2ms delay after the First Request
- When an MS dequeues the First Request successfully, it will notify all other MS through MM to abandon the forthcoming request with the same ID, which is basically the Backup Request
Observation and Insights
Paper [1]: Google Vizier, A Service for Black-Box Optimization
- Model Placement strategy is prominent to save resources. Always use relevant hardware for placing a model and monitor resource availability at runtime.
- Low-latency inference can be achieved internally (Model Optimization) and externally (System Optimization). System Optimization techniques in Microsoft DLIS include using cache and isolate resources to avoid interference between Model Instances.
- Latency can also be infected by concurrent requests. Make sure we have a strategy to overcome concurrent requests, such as using Backup Request with Cross-Cancellation.
Paper [2]: Microsoft DLIS (Deep Learning Inference Service)
- Hyperparameter Tuning, particularly in a black-box setting, is not a trivial computation. More often than not, we need to tune a complex model, e.g. DNN, so we need to search the parameter space efficiently.
- When building an automation framework/service, it’s always good to leave off some spaces for customization, e.g. Google Vizier with its Algorithm Playground.
- Always halt expensive computations, such as by involving the Early Stopping mechanism or by marking certain Trials as infeasible, e.g. Trials whose predicted objective values don’t converge. It can help boost the performance, speed-wise, of our Hyperparameter Tuning engine.
Conclusion
In designing a Hyperparameter Tuning Engine, we need to:
- Determine what algorithm is best for a given Study, Vizier has default algorithms given a Study characteristic
- Find a good parameter space (i.e. the Study) and a good starting point (i.e. which Trials should be evaluated first)
- Leave off some spaces for users to customize the above strategy
- Avoid expensive computations that aren’t meaningful
In serving models, we need to:
- Define a scheme to efficiently use resources
- Optimize the model inference process, model-wise and system-wise, to achieve low latency
- Compose a strategy to handle concurrent requests efficiently
References
[1] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. 2017. Google Vizier: A Service for Black-Box Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’17). Association for Computing Machinery, New York, NY, USA, 1487–1495.
URL: https://research.google/pubs/pub46180/
[2] Jonathan Soifer, Jason Li, Mingqin Li, Jeffrey Zhu, Yingnan Li, Yuxiong He, Elton Zheng, Adi Oltean, Maya Mosyak, Chris Barnes, Thomas Liu, and Junhua Wang. 2019. Deep Learning Inference Service at Microsoft. In Proceedings of 2019 USENIX Conference on Operational Machine Learning (OpML ’19). USENIX, Santa Clara, CA, USA, 15–17.
URL: https://www.microsoft.com/en-us/research/publication/deep-learning-inference-service-at-microsoft/