Chapter 4: Azure Batch HPC, Learning 20 Million Ratings in 20 Seconds

Herger Gattoni

Published in

Towards Data Science

15 min readSep 25, 2017

This story is part of the Series “What You Truly Need for Designing a Real Recommender Engine (with Azure)”

“With NVIDIA GRID software and NVIDIA Tesla M60 running on Azure, Microsoft is delivering the benefits of cloud-based RDSH virtual apps and desktops to enable broad-scale, graphics-accelerated virtualization in the cloud that meets the needs of any enterprise” John Fanelli, VP NVIDIA GRID

The Leading Microsoft & NVIDIA Partnership

Since the fourth quarter of last year, Graphic Processing Units (GPU) have started to become generally available on the cloud of Microsoft and gradually on the platform of concurrent providers such as Amazon and Google. The remarkable recent growth of the Machine Learning (ML) needs on the digital markets and the rapid democratization of GPU widely accelerated by the game industry, favored exclusive partnerships between the actual leader Nvidia and the main cloud actors. Pushing forward the winning strategy of proposing hardware on a pay-per-use model instead of actually acquiring the High-Performance Code (HPC) infrastructural hardware, Microsoft opens itself to a new market ranging from 3D modelling, remote games playing to scientific research applications. The today activity is so high in this area, that it is already sometimes difficult for the providers to even keep up with the clients’ demand. A situation promised to escalate as the new and fastest GPUs are coming out on the market at an increasing pace.

As those lines are being written, Microsoft is proposing the NC Series specialized virtual machines targeted for heavy graphic rendering and video editing available with single or multiple GPUs. The later features the NVIDIA Tesla accelerated platform and NVIDIA GRID 2.0 technology, providing the highest-end graphics support available in the cloud today.

With such power at one’s fingertips, it wouldn’t be reasonable not to give it a try and adapt a well-known algorithm particularly computationally expensive in the area of Recommender Engines (RE).

Collaborative Filtering Based Recommenders

In the wide area of RE, Collaborative Filtering (CF) is a specific domain where ratings are predicted by discovering users’ behavioral patterns in a large population. The main principle relies on finding users that have similar rating habits and recommend items that are still missing in each other collection. While not subject to the same problems than Social Network Filtering (SNF), which is especially vulnerable to malicious attacks, the main caveats of CF lie in the fact that many users must have already provided their ratings before any good recommendation can be produced. An issue strongly limiting CF usefulness in scenarii where a system is just starting-up.

It is mainly for this reason that modern RE propose different hybrid approaches aiming at alleviating the related problems known as the cold start effect, long tail non-trivial recommendations, improved accuracy and providing clear explanations of why an item is recommended. On the other hand, CF has the great advantage of bringing serendipity to recommendations and open up user profiles to themes that weren’t present in Semantic Filtering (SF) based recommenders.

A Powerful Approach: The Singular Value Decomposition (SVD)

While it wouldn’t be wise to solely rely on CF for the design of a RE, some of the methods such as the Singular Value Decomposition (SVD), provide the benefit to be particularly effective at automatically finding latent features that are not explicitly defined by either the items or the users. In the right conditions, this property alone can produce excellent recommendations based on abstract concepts not clearly explainable directly from their features but perfectly valid on a mathematical point of view. Thus, it is possible to find surprising clusters of users sharing notions such as specific movies atmospheres, their implicit targeted audience, political aspects or simply the genre that may not be contained in any semantic description.

The linear algebra behind SVD aims at finding a low-rank representation of a matrix by extracting only the most significant eigen values while minimizing the associated loss function. It is widely used in images processing as a destructive method for pictures compression. In the case of CF, this is achieved by the factorization of three matrices respectively containing the users’ eigenvectors, the eigenvalues and the movies’ eigenvectors.

A Useful Case Study: Predicting MovieLens Ratings

Because Everybody Loves Movies… Chicago International Film Festival

In the case of the MovieLens 20M quadruplets data set, the ratings matrix presents the additional challenge to be extremely sparse, as each user only rates a tiny portion of all the possible items. With 27’000 movies and 138’000 users, the although already consequent data set represents only 0.53% of the entire matrix.

A quick analysis with [R] also reveals the overall ratings skewed normal distribution with a global mean of 3.423 and the users’ rating tendency throughout the activity years of the academical portal.

Writing a First Serial Version of the SVD

While the goal of the present article is to explore real solutions for designing a modern scalable Azure-based RE and not a course on Machine Learning or Statistics in general, it is still useful to precise that all the experiments presented below are based on the exact same setup. First, the overall mean, users means, movies means, variances, standard deviations and ratings counts where incrementally computed with the following online algorithms:

Second, the dataset was split in two, by randomly selecting 20% of the user profiles ratings as an “holdout” cross-validation test set and the remaining 80% were kept for the training of the model.

Surprisingly, it is still nowadays very rare to find good implementations of the SVD for RE, at my knowledge neither does Spark or other scalable ecosystems propose any pre-written model directly useable. Moreover, let’s be honest, it is often a better idea to have a perfect understanding of how the algorithm is actually working, what are its possible implementations and implications when designing a new system. The SVD in particular is a broad model that has many different versions, especially when scalability and parallelism are in question. Additionally, in the case of RE one does usually not use the real SVD but a variant called the Maximum Margin Matrix Factorization (MMMF) where its simplest expression is as follows:

Therefore, the initial version used in this article was written from scratch in C# and optimizes the model through the bi-convex property of the MMMF where the users and movies features are learned simultaneously.

Designing for Scalability

Houston, we Have a Problem…, Apollo 13, 1995

When it comes to parallelizing an algorithm such as the SVD for being able to absorb a near infinite growth, several possible approaches exist. The first one aims at reducing the amount of work by splitting the entire matrix in smaller blocks and learn them independently. While theoretically correct, this “divide and conquer” strategy relying on creating sub-epochs and weights synchronization, is not well adapted for the case of RE and may be relevant only for gigantic data sets of several terabytes and denser matrices.

Despite good scalability, the time spent transferring data largely overweighs the computational gains and, because the blocks are not randomly sampled, the ones with significantly less ratings are computed quicker than the others ending up having several nodes doing nothing while waiting for their homologues to complete.

Another strategy would be to simply perform the computation on a multi-cores machine and therefore get rid of the data transmission bottleneck point. Unfortunately, even with the help of the Moor Law, it is still rare or impossible to scale-up a single computer to the desired level of performances. At best, commodity hardware can reach up to 8 or 16 cores before having the obligation to delegate part of the work to another node and fall again in data transfer complications. Despite this fact, it is still interesting to benchmark such a solution in order to get an idea of its maximum performances compared to other options. To this end, a parallel but non-scalable C# version of the SVD was written and used in the benchmarks presented further below in the chapters.

Taking Advantage of Massively Parallel Architectures

“Sometimes you gotta run before you can walk” Tony Stark, Iron Man 2008

As the present article titles heavily suggest, GPU and massively parallel architectures may very well provide a solution to our scalability and parallelization problems. First, GPU can propose thousands of cores for processing huge data sets with their Single Instruction / Multiple Data (SIMD) architecture. Second, their high-speed memory can reach tenths of GiB on a single gear and the best supercomputers can mount in parallel multiple intercommunicating cards. Thus, it makes no doubt that the potential results of deploying an algorithm such as the SVD for RE on a GPU could tremendously benefit from those qualities and solve all the aforementioned problems at once. Especially because it is very unlikely that the users and movies counts will grow greater than a few billions in the very best case…

QuantAlea GPU: The Swiss Leader

As good news rarely comes alone, a recent technology intelligence operated on the famous Microsoft Channel 9 TV, reported that a Swiss company located in Zürich is providing a framework capable of compiling the required CUDA code transparently from any .Net language. QuantAlea is at my knowledge the only professional .Net framework proposing direct access to all the NVIDIA GPU capabilities. Knowing the overhead of creating a C++ project with the cumbersome Boost library and the CUDA development kit, this new possibility is an undeniable productivity gain and proposed for free as long as a non-enterprise GPU is used.

Massively Parallelizing the SVD Algorithm

“With great power comes great responsibility.” Spiderman 2002

A good understanding of the massively parallel architectures that offers General-Purpose Processing on Graphics Processing Units (GPGPU) is very important to get satisfactory results. Bluntly converting the existing code with the QuantAlea framework would yield deceptive results and lead to the false conclusion that only minimal speed gains can be achieved. The diagram below succinctly describes how one wants to take advantage of the power at disposal and may also give an idea or two why this implementation is still not the best possible case for such an architecture.

Other algorithms, such as Artificial Neural Networks, are more adapted for better befitting to very technical aspects such as shared memory. However, the random nature of how the SVD model is learned on a GPU has the indirect benefit to better simulate its theoretical properties and avoid overfitting on the first user profiles as it is the case on the single-threaded version.

Test Drives & Results

As the author of this article is not yet in possession of a supercomputer able to demonstrate the best possible results in terms of speed, a high-end Dell Alienware laptop with the following specifications was used for benchmarking all the aforementioned versions.

Processor : Intel® Core™ i7–6700HQ CPU @ 2.60GHz with 8 logical cores
Memory : 16 GiB
GPU : NVIDIA GeForce GTX 1060–1280 CUDA cores — 6 GiB GDDR5

The chart here after displays how many seconds were needed for training one unique SVD epoch. With the hardware at hand, the GPU version is up to 12 times faster than the original and almost 4 times quicker than the multi-cores improvement.

For an entire training of the SVD, one can notice that the GPU implementations runs more epochs than its concurrents although having the exact same parameters. It is actually a good thing and meets with the statistical properties discussed above. Training a SVD is a random process that, in the end, can only be simulated correctly with a massively distributed architecture. Even with more iterations, the total training time stays far below the original best version with better accuracy.

But What About the Azure Offering?

“M.D.? Molecular Detachment Device. We call her, ‘the Little Doctor’.” Ender’s Game, 2013

Running the same experiment on the today Azure offering would allow the use of the Tesla K80 with no less than 4992 cores. But the availability of the new Tesla P100 is already announced and would push the limits to reach the theoretical 2 terabytes of global memory for any HPC application. A card at almost $10’000 that would definitely benefit from a pay-per-use plan on the user point of view…

Running the HPC SVD Algorithm with Azure Batch

In the previous article “Azure Machine Learning : Not Your Usual Regression”, most of the infrastructural needs were fulfilled by Azure Machine Learning. In the present case, the SVD algorithm is much bigger than the elastic net regression used to fit independent user profiles. Hence, leading us to a more complex architecture infrastructure using Azure Batch for the training of the model and an independently scalable Azure WebApp for providing personalized predictions. Moreover, a queuing mechanism must be set-up to ask the recommender to safely process the needed workflow steps. It is also important to note that no mention have been made of how to incrementally retrain the profiles when changes are made to their ratings. This operation is commonly called “Fold-In” and out of the scope of the current article.

Another very important point is the time needed to transport the already considerable 20 Million ratings data set to the computing cluster. In real world scenarii, it is usually unpracticable to perform this operation repeatedly one rating after the other from a disk data storage. The later would require massive scaling in order to reach performances comparable to what can provide Event Hub, the fastest possible solution in the area. Therefore, instead of going down this road, a mapping file containing all the ratings has been uploaded to a single blob and directly downloaded by each node of the cluster. This first data set can be seen as a seed that can ultimately be stored in a distributed memory service such as Redis Cache. Thus, providing the maximal possible speed with respect to the Cloud architecture. The only other alternative would be to use Spark Resilient Distributed Dataset (RDD), but as one does not need all the transformation machinery, this solution was left aside for the time being.

In the end, one could imagine recomputing very quickly the whole dataset from memory every day or create a new version of the algorithm allowing stable incremental learning, very much like the simple equations related to the online computation of the mean and variance presented in the first chapters.

Setting-Up the Azure Batch HPC Cluster

In my humble opinion, Azure Batch is a great under used service from Azure, directly undergoing the current popularity of Hadoop MapReduce for no particular technical reason. The Azure Batch API is simple to use and automate through C# code or simple Powershell commands. The portal also provides most of the functionalities, but should be only used to actually monitor the deployed virtual machines and running jobs as it lacks any programmable operationalization.

For demonstration purpose, a simple cluster of four Virtual Machines (VMs) with the Standard plan were deployed. Each of these nodes was asked to train a version of the SVD with specific parameters in order to quickly find a satisfactory combination. Pushing forward this aspect would lead to implement a full parallel Genetic Algorithm (GA) for reaching the near-optimal solution. A task made difficult to tackle by hand when the number of parameters becomes important.

Creating a Meaningful Job

One of the many great features of Azure Batch resides in the possibility of defining tasks dependencies inside a job, thus providing the ability to wait for the completion of a set of operations before going further. Each time a task is spun-up, it downloads an application package and the ratings data set. The diagram below shows the actual tree elected for the experiment:

In total there were 4 nodes running a single multi-threaded task in the cluster at once. Whenever a task finished, it stored the best attained RMSE and the corresponding parameters in a blob, allowing the next one to start from the attained results and refine further the optimization.

For practical reasons, the multi-cores version of the SVD algorithm was finally used. QuantAlea GPU charges $2'000 a year for using their framework on Enterprise GPUs and is obviously out of the present POC budget.

Monitoring the Training

Here again, the Azure user experience is excellent. It is possible to monitor directly on the portal the different nodes activity.

Moreover, the Azure Batch blade allows the exploration of each node and to ultimately use Remote Desktop Session Host (RDSH) to directly log on the corresponding VM for investigating an eventual problem or get access to more precise telemetry.

In the end, observing the High Bias / High Variance indicators shown below, a best RMSE was obtained, hence successfully concluding this real-world scenario.

Conclusions & Further Thoughts

From a Microsoft cultural perspective, it is pleasant to witness the good competitiveness of Azure in the area of scalable applications, a domain largely dominated by the Apache Unix / Java projects. One would nonetheless see from a good eye further innovations in the matter from the enterprise in question. For attaining this goal, the MS technology will have to find a way to be more widely adopted and welcomed in the concerned domains. If good solutions are proposed, there is no reason for that not to happen as the systems deployed in production take a real productivity boost from today MS technologies.

Azure Batch is a good service that would definitely benefit from a better exposition to the public. A solution concerning the possibility to have the equivalent of the Spark RDD in the Cortana Intelligence suit would be very seductive, in a period where AI is very mediatic and useful. In the same vein, initiatives such as QuantAlea GPU which comes from the financial industry, is undeniably going towards the right direction in easing the acceptance of MS in the domain.

It is now well known that many enterprises and cloud vendors are proposing Machine Learning capabilities in their offering. The usual marketing message suggests that every application can easily integrate ML functionalities. But this motto often does not live up to the actual theoretical and technical complexity involved, the diversity of the possibilities and the rapid technological evolution, especially for the most advanced cases. Operationalizing Data Science is hard and can easily backfire if approached from the wrong angle.

Concerning the RE area, building or even using correctly a recommender stays a challenging endeavor. The amount of required experience and insights ultimately forms a mutual cultural background where entire communities naturally come together alike the ones created during the Netflix Prize Challenge. While some subjects are definitely narrowed down to specific scientific areas, RE are useful, broad, modern, likeable and business relevant to most people and will undoubtedly keep on progressively integrating our everyday lives for the upcoming decades.