Making the AI Journey from Public Cloud to On-prem

Lessons learned from a deep learning team that outgrew experiments in AWS

Emily Potyraj
Towards Data Science

--

Joint post with Farhan Abrol.

Machine learning on array telemetry

One of our machine learning teams at Pure Storage works on a range of forecasting, regression, and classification problems. A core piece of technology we build is a predictive performance planner for our customers. It models a storage array and predicts its performance based on signals from the workload running on it. These signals include things like the read and write bandwidth, IOSize, dedupability, pattern etc.

At a high level, our system takes a collection of time series data from the past 1 to 12 months for N features and predicts a system’s performance over the next 1 to 12 months. Performance is then computed analytically in terms of a derivative of multiple system bottlenecks like CPU, SSD, IOPorts, etc. (together called “load”).

Our current model splits the problem into two halves: the first forecasts the time series of the features, and the second then uses a regression model to predict the associated load.

The time series projections are based on ARIMA and a few other detrending statistical techniques — i.e. not deep learning. We found that it was becoming hard to get this model to perform well in a large number of cases without significant tuning. As a development team, our aim is to develop a highly accurate model that we can then deploy to production.

We decided to experiment with deep learning based models to see if we could improve either our time series models or the entirety of our pipeline by doing a direct prediction of load from the time series.

The dataset consisted of ~25GB of time series data pulled from our telemetry system (Pure1) and stored as a csv file. Pure1 streams telemetry data every 30 seconds from the fleet of our deployed systems. Today, we capture about 60 billion events per day.

In this post, we’ll review some of the challenges we faced — from dataset scale to the software stack to infrastructure.

Standing on the shoulder of giants … or not

When we started this project, the best literature we found was a paper from Uber where they used LSTMs to predict daily completed trips. They used an encoder-decoder architecture to learn the structure of the time series and then a separate neural net for inference.

We tried to replicate the paper but ran into several problems. The paper was a daily univariate time series prediction, and we were trying to do an hourly N to 1 multivariate time series prediction. Our resulting model was only able to learn the mean of the predicted value and had high training and test errors.

Ultimately, we realized that the data we were modeling was much larger and more complex than the data used in the paper.

This brought up a series of questions we had to answer. First, we had to understand how different layers learned characteristics about the time series. Second, we had to evaluate what kind of layers and topology would give better accuracy. Finally, we had to explore the scale needed in the layer neurons to get good performance.

All this meant that we needed lots more experiments, and we needed them to get us answers faster.

More experiments, faster experiments

Our primary requirement of the dev environment was to empower our data scientists to be more productive — which means removing any bottlenecks to their experiment rate. To us, solving this challenge meant

➊ increasing hardware flexibility
➋ moving to round-the-clock testing

Our data scientists were doing good work already, but they only worked during the day (of course!), and they had to feed their training jobs through slower infrastructure than we liked.

We were previously running all training jobs in the public cloud, but it was limiting us on both fronts.

Problem 1: Hardware flexibility (compute & storage)

In the public cloud, higher GPU counts get expensive fast and the GPU allocation isn’t as fine-grain as we wanted. For example, in AWS, a single V100 GPU instance only allows 1 job at a time; running multiple concurrent jobs means managing multiple GPU instances and higher cost.

As an alternative to public cloud GPUs, we switched to on-prem GPUs in the form of two Nvidia DGX-1 servers (with a FlashBlade storage system serving the data on-prem). Since each DGX-1 contains 8 GPUs, a developer can manage multiple concurrent jobs within a single DGX-1 server by targeting specific GPUs — which is great for exploring several hyperparameter settings — or they can combine the two DGX-1s to run a larger-scale training job across the 16 GPUs.

After switching to the 16 on-prem GPUs instead of various single-GPU public cloud instances, the monthly compute cost was significantly cheaper for us.

Cost comparison from when we made the switch. Note: we could have used a larger AWS instance (p3.8xlarge with 4x Tesla V100 is $12.24/hr), but that limits the flexibility for number of concurrent jobs. Current AWS pricing is here. Assume initial cost per on-prem DGX-1 of $150,000.

Even assuming that we only used the DGX-1s for one year, the hourly “rate” for those on-prem GPUs was cheaper than in the public cloud. If we were to look across a multi-year use of the DGX-1s, hourly “rate” would be even lower.

Problem 2: Testing 24 hrs/day

To get the highest efficiency out of the infrastructure, we switched to a two-stage development effort:. During the day, when we have human eyes on the tests, we iterate quickly through model experiments by tuning hyperparameters. At night, we take the best-of-the-day model and run it through more strenuous testing by using a larger dataset and training for more epochs.

Part of the solution there was to have two training datasets: a smaller set of daily logs and a larger set of hourly logs. We didn’t need to train on every single data point from our logs in order to fast-fail a new network architecture, but we did want to train with every data point when solving for highest accuracy. Running with realistic amounts of data overnight provides more realistic tests.

The large size of the hourly dataset meant that training it on the public cloud would be too slow and too expensive. Using our on-prem hardware, we were able to run 12-hour, 16-GPU jobs overnight without worrying about memory management or memory/GPU cost.

While we succeeded in switching to 24 hr/day experimenting, we’re not quite at the ideal state yet. This particular team and its hardware & datasets are currently small enough that we can manage jobs manually. The optimal state would be for this team to use some kind of resource scheduler to manage jobs, which would provide ever tighter infrastructure utilization and ensure that we never had user-induced job conflicts.

Several machine learning platforms exist today, like Kubeflow, MLflow, and H2O.ai. None of these platforms are generalized one-stop solutions today, so some companies prefer to simply set up Slurm as their resource manager.

Problem 3: Memory management required for larger datasets

Our initial experiments in the public cloud had been designed around the fixed memory available in those instances. With our on-prem infrastructure, we had both more HBM and more DRAM available — and we can even spill over to NFS as needed — so our experiments could be more ambitious. Sometimes, however, the software stack between us and our hardware got in the way.

For example, for multi-DGX-1 (16 GPU) jobs, we use a Horovod-based setup that uses Docker to communicate between compute nodes. Unfortunately, memory management can get a little tricky for these workflows with several layers of memory parameters: DGX HBM, Docker container memory, Docker swap drive, etc.

This problem was by far the easiest for us to solve technically. There are a couple parameters that to fine-tune memory limits based on workload, like increasing shm-size and adding a temp fs.

The harder change for our team was retraining a habit: how to get developers to switch away from workloads limited by what fits in memory. In AWS, we’d selected a specific memory size for each instance, and that hard limit affected the way our team approached experiments. In the future, now that we’ve tuned our application stack to support the large amount of memory available, we can start testing with even more complex model architectures and even larger training datasets.

What did we learn?

Deep learning is in many ways still a nascent field, and the literature for applying deep learning to non-traditional problems is sparse. To effectively apply deep learning to new domains, data scientists will need a lot of iteration on model architecture, size, and hyperparameters to get the best results.

We need flexibility in the infrastructure to efficiently and effectively experiment and get to their final, production-ready model.

Moving to an on-prem GPU cluster helped solve some of our test scale and cost issues, but the state of the world is far from ideal. We would like to have a better experiment management platform, a better job scheduler to keep the cluster busy at scale, and a better memory management toolkit within the machine learning libraries we used.

While we’re continually iterating on our AI dev platforms internally, we’ve shipped this particular AI project to production. To read more about our final model and how customers can use it to simulate workloads, check out this blog post.

--

--