My previous posts have been mostly technical, covering a range of topics on training in the cloud and advanced TensorFlow development. This post is different. You might consider it more of an opinion piece.
In the course of my career I have rarely seen tensions run higher than when discussing the topic of software development practices. Even the most introverted engineer will suddenly come alive. Peers, who otherwise work in unison, will fall into heated argument. I have worked with cowboy coders and Scrum masters, with people who have adopted Agile as a way of life, and others who would sooner quit than participate in even one more [standup meeting](https://en.wikipedia.org/wiki/Stand-up_meeting#:~:text=A%20stand%2Dup%20meeting%20(or,to%20keep%20the%20meetings%20short.), with people who capitalize their function names and others who would deem such a practice a capital offense.
Truth be told, most common development practices are rooted in solid common sense. While developers may feel strongly about whether, and how, to adopt Scrum, it’s common sense that modern day development teams should be able to adapt quickly to changing environments. While the specific details may be disputed, it is common sense for teams to adopt a coding style in order to increase code readability and maintainability. While test-driven development might seem overwhelming, introducing testing at various levels, and different phases of development, is common sense. When working on a team, performing continuous integration including automated integration testing is common sense. Even the best developers sometimes make mistakes, and catching them before they are pushed to the main repository and end up ruining everybody’s day is common sense. Conducting peer code reviews is common sense, and might just prevent small oversights such as a memory leak or a data type overflow, from taking down your company.
Never underestimate the capacity of a person, no matter how stupid, to do something genius. More importantly, never underestimate the capacity of a person, no matter how genius, to do something monumentally stupid. --Chaim Rand
Another claim, well rooted in common sense, is that our development habits should be adapted to the details of the project on which we are working including the team, the timeline, and, importantly, the development environment. This is true, in particular, when training machine learning models in the cloud.
Advantages of Cloud ML
There is a growing trend towards training machine learning models in the cloud, with cloud service providers rolling out ever expanding services for just about any machine learning related task you can imagine. It is not hard to see the appeal of training in the cloud.
Building a modern day machine learning infrastructure can be hard. First, there’s the hardware infrastructure. GPUs, and alternative machine learning accelerators, are expensive, and if you want to stay up to date with the latest and greatest, you will need to refresh them regularly. Since you might want to distribute your training over more than one machine, and will probably want to run more than one training experiment at a time, you will likely need multiple GPUs of varying specifications. Machine learning typically requires a sophisticated and scalable storage infrastructure for storing ever expanding datasets, as well as a network infrastructure that supports a high volume of traffic between the data storage and training machines.
Then there is the software infrastructure. Getting your training to work will require appropriate BIOS setup, regular driver updates, sophisticated access permissions management, and careful environment variable configuration. You will typically require a long list of software packages, that might have conflicting dependencies that you will need to sort out. You might require Docker or some other virtualization layer, introducing a whole new slew of potential problems.
On the other hand, when you train in the cloud, life is easy. The hardware, storage, network, and software infrastructures are all set and ready to go. You have a wide variety of GPUs (and other accelerators) to choose from, as well as different types and versions of software frameworks. In the cloud you can scale freely to run as many parallel experiments as your heart desires, and also scale independent training sessions to run on many GPU cores (or other accelerators) in parallel. The storage capacity in the cloud is virtually limitless, saving you the headache of having to meet ever increasing storage demands. Furthermore, cloud offerings often include unique optimizations based on their own custom hardware or network topology. Creating your own cost efficient infrastructure, with the same flexibility and complexity that is offered in the cloud, is virtually unfathomable for all but the largest organizations who are already managing their own data centers. See here for more on the advantages and challenges of migrating your training to the cloud.
Managing Cost
While cost might be one of the primary considerations for using cloud ML services, it is easy to see how, without proper governance, cost could also become the primary pitfall. I believe that wired into the DNA of every developer is an innate inclination towards the biggest, the newest, the fastest, and the strongest computation resources available. If we can use a powerful (and expensive) GPU (or other accelerator) system, why settle for a cheaper one? If we can distribute our training over 1024 cores, why settle for 8? If we can run 1000 experiments in parallel, why limit ourselves to 4? And if we can run our training for 20 days, why limit ourselves to 1?
Clearly, in order to ensure that the transition to the cloud will be successful, managers must establish clear policies for how and when to use the different cloud services. Guidelines should be defined including details such as the number and types of training resources to use, how long training sessions should run for, when to pay full price for cloud resources, and when to use cheaper, "low priority" resources, such as spot instances, or preemptible virtual machines, (running the risk of having a job terminated prematurely). Usage patterns should be monitored and continuous adjustments made to maximize productivity.
While defining and enforcing such policies falls under the responsibility of management, there is onus on us, the developers, to adapt our development processes to this new era of engineering. We need to adjust our habits so as to maximize the use of the cloud resources, and minimize waste, idle time, and resource underutilization.
Development Habits for Increasing Cloud Training Productivity
In order to maximize efficiency, I would like to propose 6, common sense, development practices that I believe should be emphasized when training machine learning models in the cloud.
1. Test locally before running in the cloud
As we all know, applications we write almost never run the way we intended them to, straight out of the box. More often than not, we will access an undefined variable, use the wrong data type, access a list out of bounds, use an API incorrectly, etc. Working out these kinks in the cloud is unproductive, time-consuming, and wasteful. Common sense dictates setting up a local (CPU based) environment, with a small subset of training data, in which you verify, to the furthest extent possible, that everything is working as expected.
2. Adapt your debugging tools and techniques to cloud training
Unfortunately, debugging failing applications in the cloud, is rarely simply a matter of running your preferred IDE in debug mode. At the same time, our ability to quickly identify and fix bugs in our cloud runs can have a huge impact on cost and productivity. Common sense dictates revisiting our debugging processes, and adapting them to the new development paradigm. Code modifications should be made to support fast reproducibility of bugs. For example, saving intermediate model checkpoints might shorten the time to reproduction, capturing the input tensors might help in identifying the precise combination of inputs and weights that resulted in the bug, and dedicated debug flags for running the model training with greater verbosity, may help pinpoint the root cause of the issue. See here for more details on effective TensorFlow debugging.
3. Conduct continuous, in-depth, performance analysis and optimizations
In order to maximize the use of your training resources, you should make performance analysis and optimization an integral part of your development process. By analyzing the speed at which the training is performing (as measured, for example, in iterations per second) and the manner in which we are utilizing the system resources, we seek to identify opportunities for increasing efficiency and reducing overall cost. While any under utilized training resource is a potential opportunity for increasing training efficiency, this is true, first and foremost, for the GPU (or other training accelerators), typically, the most expensive and powerful resource in the system. Your goal should be to try to increase the GPU utilization as much as possible.
A typical performance optimization strategy includes executing the following two steps iteratively until you are satisfied with the training speed and resource utilization:
- Profile the training performance to identify bottlenecks in the pipeline and under-utilized resources
- Address bottlenecks and increase resource utilization
There are many different tools and techniques for conducting performance analysis, which vary based on the specifics of the development environment. See here for more details on analyzing the performance of TensorFlow models.
4. Introduce fault tolerance into your code
Imagine running a highly distributed training session on 32 GPU machines for two days, only for one of the machines to crash just as the model appears to have converged. While machines can break down in a local training environment, as well, there are two reasons why this might be a greater concern in the cloud. The first is greater prevalence of distributed training. Generally speaking, the higher the number of machines that we are training on, the more vulnerable we are to system failures. The second reason is that one of the most common ways of reducing cost of training in the cloud is by using "low priority" resources, (such as spot instances, or preemptible virtual machines,) which are prone to being prematurely terminated if a higher priority request comes along. Thus, it is only common sense to introduce fault tolerance into your development requirements.
The first thing you should do is ensure that you are periodically capturing snapshots (a.k.a. checkpoints) of your machine learning model so that in the event of a premature termination you can resume from the most recent model captured, rather than start training from scratch. The intervals at which to capture snapshots should be determined by weighing the overhead of capturing the snapshot vs. the expected training time you will lose in the event of a failure.
Another thing to consider is to implement a mechanism for recovering from failures of individual systems without needing to restart the training session. One way to do this, in a distributed training setting, is with Elastic Horovod, which provides support for scaling the number of workers dynamically based on system availability.
5. Perform automated monitoring of your training jobs
Monitoring the learning process refers to the art of tracking different metrics during training, in order to evaluate how the training is proceeding, and determine what hyper-parameters to tune in order to improve the training. While monitoring the learning process is a critical part of training machine learning models, in general, it is especially important in the cloud where our goal is to minimize waste as much as possible. Once we identify a failed training experiment, or a training session that has stopped learning, our goal should be to halt the training as soon as possible so as to avoid waste.
Since it is unreasonable for us to track every single training experiment at every moment of the day, common sense dictates that we automate the monitoring wherever possible. We can program rules to detect common training failures such as exploding or vanishing gradients, non-decreasing loss, and so on, and take the appropriate action whether it be terminating the training or adjusting some of the hyperparameters.
See [here](https://towardsdatascience.com/the-tensorflow-keras-summary-capture-layer-cdc436cb74ef) and here for more on the topic of monitoring, and automated monitoring in TensorFlow.
6. Adapt advanced hyperparameter tuning techniques
While hyperparameter tuning is generally required for successful training, regardless of the training location, our desire to minimize waste in the cloud dictates that we perform this tuning in the most cost efficient way possible. While we might settle for manual or grid search tuning in a local environment, we should aim to use more advanced techniques in the cloud, techniques that have been shown to converge faster and reduce the overall cost of the tuning phase. Once again, there are a number of solutions for hyperparameter tuning available, varying based on training environment. Strive to choose the one that offers modern, Bayesian based, algorithm support (e.g. Ray Tune and the Amazon SageMaker hyperparameter tuner).
Summary
The transition to Cloud Computing is a paradigm shift that calls for adapting and tightening up our development processes towards the goal of increasing productivity and efficiency. While the details will certainly change from team to team, and from project to project, the core principles are common sense. Success is in our hands.