Handling Big Datasets for Machine Learning

Matthew Stewart, PhD
Towards Data Science
12 min readMar 11, 2019

--

More than 2.5 quintillion bytes of data are created each day. 90% of the data in the world was generated in the past two years. The prevalence of data will only increase, so we need to learn how to deal with such large data.

“Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.” — Dan Ariely

Imagine downloading a dataset full of all the Tweets ever written, or the data of all the 2.3 billion people on Facebook, or even, the data for every webpage that exists on the Internet. How do you analyze such a dataset?

This is not an isolated problem that only hits the largest tech companies. In the current age, datasets are already becoming larger than most computers can handle. I regularly work with satellite data and this can easily be in the Terabyte range — too large to even fit on the hard drive of my computer, let alone to process it in a reasonable amount of time. Here are some eye-opening statistics regarding big data:

  • More than 16 million text messages are sent every minute
  • More than 100 million spam emails are sent every minute
  • Every minute, there are more than a million tinder swipes
  • Every day, more than a billion photos are uploaded to Google Photos

Storing this data is one thing, but what about processing it and developing machine learning algorithms to work with it? In this article, we will discuss how to easily create a scalable and parallelized machine learning platform on the cloud to process large-scale data.

This can be used for research, commercial, or non-commercial purposes and can be done with minimal cost compared to developing your own supercomputer.

To develop a very robust and high-performance parallel cluster on the cloud (this can also be used on a local machine for performance enhancement) we will delve into the following topics:

  • Environment Setup
  • Parallelization with Dask and Kubernetes
  • Dask Cloud Deployment
  • Example Cloud Deployment on AWS

This post will be based on the contents of the following GitHub repository, that can be found here. All of the commands required for setting up the machine learning platform on the cloud can be found in the markdown file here. This is based on a tutorial by the Institute for Applied Computational Science at Harvard University.

Environment Setup — Dockers and Containers

If you only read one part of this post, let it be this part.

When people set up their machine learning environment, typically they install everything on the directly on their operating system. Oftentimes, this is fine, and then you try to download something like PyTorch, TensorFlow, or Keras and everything explodes and you spend hours on Stack Overflow trying to get things to work. I implore you not to work like this, for your own sake.

This problem typically occurs from dependencies or co-dependencies of certain packages on specific versions of other packages. Often, you do not need half of these packages for your work. It would make more sense to start from a clean slate and only install the versions and dependencies that are required for the task at hand. This will ultimately save you time and stress.

If you are using Anaconda, it is very easy and efficient to separate these into isolated ‘containers’ such that they can all run without causing problems. These containers are called Conda environments. Conda is a package manager for Python

You can think of these environments as different computers that do not know about the existence of each other. When I create a new environment, I start with a blank slate and need to install packages again. The great part about this is that you do not actually download the packages twice when doing this, a pointer is created which points to the specific version of the package you want to install that is already downloaded on your computer.

This may seem pointless unless you have had dependency issues before on your computer, but I can promise you it is worth knowing about this. Another useful feature is that you can install all the packages you like in just one line by using a YAML (.yml) file. This is a file that tells the environment what packages you want to install and what dependencies are required to be downloaded. You do not need to write this file, it can be exported with one line of code from an environment where you already have all the required packages — pretty neat right? All of the required commands are shown in the Gist below.

Example of Linux commands to easily create new environments and dependency files.

Here is an example of what the YAML file looks like when the conda env export > environment.yml command is run.

There is another good reason for separating things into environments like this. If I want to get reproducible results for data analysis that I am doing, it can widely depend on the versions of different packages and also the operating system that you are working on. By creating environment.yml files that contain all of the dependencies, it is easier for someone to reproduce your results.

So what did we do when we created our Conda environment? We essentially isolated it from the rest of our system. However, what if we have additional things that we want to work with the environments that are not just Python packages. In this case, we use Docker to create containers.

If your application:

  • uses a server (for example a database server with preloaded data), AND
  • you want to distribute this server and its data together with your application and its Python environment to others (for instance to a fellow developer or to a client),

you can “containerize” the whole thing using Docker.

In this case, all these components will be encapsulated in a Docker container:

  • The application itself,
  • The Conda environment that can run your application (so a compatible Python version and packages),
  • The local server or service (for example: a database server and a web server) required to run the application

I admit the concept behind Docker and containers is a bit confusing. Building a docker image is not a trivial task. Fortunately, however, the Jupyter folks created repo2docker for this. repo2docker takes a GitHub repository and automatically makes a docker image and uploads it to the docker image repository for you. This can be done using one line of code.

After running the above code, you should have some code pop up in the terminal that looks like the following:

Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://0.0.0.0:36511/?token=f94f8fabb92e22f5bfab116c382b4707fc2cade56ad1ace0

Simply copy and paste the URL in your browser and you then have access to your docker image and can get going! You can read more about using repo2docker here.

Another really useful thing to use is binder. Binder builds on repo2docker to provide a service where you provide a GitHub repository, and it gives you a working JupyterHub where you can “publish” your project, demo, etc. The GitHub repository associated with this tutorial can be run on binder by clicking on the link in the ReadMe section.

You can read more about using Binder here.

Parallelization with Dask and Kubernetes

It has taken us quite a while to get to the parallelization part of the tutorial, but the previous steps were necessary to get here. Let’s now dive into using Dask and Kubernetes.

  • Dask - a library for parallel computing in Python
  • Kubernetes - an open-source container orchestration system for automating application deployment, scaling, and management.

Dask has two parts associated with it:

[1] Dynamic task scheduling optimized for computation like Airflow.

[2] “Big Data” collections like parallel (Numpy) arrays, (Pandas) dataframes, and lists.

Dask has only been around for a couple of years but is gradually growing momentum due to the popularity of Python for machine learning applications. Dask allows scaling up (1000 core cluster) of Python applications so that they can be processed much faster than on a regular laptop.

I would refer anyone who is interested in working with Dask to the GitHub repository by Tom Augspurger (one of the main creators of Dask), which can be found here.

So we have talked about Dask, where does Kubernetes come in here? If we run Dask on our laptop, it allows us to distribute our code to multiple cores at once, but it does not help us run the code on multiple systems at the same time. We have run it locally. Ideally, we want to run on a cloud provisioned cluster, and we’d like this cluster to be self-repairing — that is, we’d like our code to respond to failures and expand onto more machines if we need them. We need a cluster manager.

Kubernetes is a cluster manager. We can think of it like being an operating system for the cluster. It provides service discovery, scaling, load-balancing, and is self-healing. Kubernetes think of applications as stateless, and movable from one machine to another to enable better resource utilization. There is a controlling master node on which the cluster operating system runs, and worker nodes which perform the bulk of the work. If a node (computer associated with the cluster) loses connection or breaks, the master node will assign the work to someone new, just like your boss would if you stopped working.

The master and worker nodes consist of several pieces of software which allow it to perform its task. It gets pretty complicated so I will quickly give a high-level overview.

Master Node:

  • API server, communication between master node and user (using kubectl)
  • Scheduler, assigns a worker node to each application
  • Controller Manager, performs cluster level functions, such as replicating components, keeping track of worker nodes, handling node failures
  • etcd, a reliable distributed data store that persistently stores the cluster configuration (which worker node is doing what at a given time).

Worker Node:

  • Docker, to run your containers
  • Package your app's components into 1 or more docker images, and push them to a registry
  • Kubelet, which talks to the API server and manages containers on its node
  • kube-proxy, which load-balances network traffic between application components
The configuration of a Kubernetes cluster.

Doing all of this is great, but it isn’t particularly helpful unless we have 100 computers at our disposal to make use of the power that Kubernetes and Dask afford us.

Enter the cloud.

Dask Cloud Deployment

If you want to run Dask to speed up your machine learning code in Python, Kubernetes is the recommended cluster manager. This can be done on your local machine using Minikube or on any of the 3 major cloud providers, Microsoft Azure, Google Compute Cloud, or Amazon Web Services.

You are probably familiar with cloud computing since it is pretty much everywhere these days. It is now very common for companies to have all of their computing infrastructure on the cloud, since this reduces their capital expenditure on computing equipment and moves it to operational expenditure, requires less maintenance and also significantly reduces the running cost. Unless you are working with classified information or have very strict regulatory requirements, you can probably get away with running things on the cloud.

Using the cloud allows you to leverage the collective performance of several machines to perform the same task. For example, if you are performing hyperparameter optimization on a neural network and it will need to rerun the model 10,000 times to get the best parameter selection (a fairly common problem) then it would be nonsensical to run it on one computer if it will take 2 weeks. If you can run this same model on 100 computers you will likely finish the task in a few hours.

I hope I have made a good case for why you should make use of the cloud, but be aware that it can get quite expensive if you use very powerful machines (especially if you do not turn them off after using them!)

To set up the environment on the cloud, you must do the following:

  1. Set up a Kubernetes cluster
  2. Set up Helm (a package manager for Kubernetes, it is like a Homebrew for Kubernetes cluster)
  3. Install Dask.

First run the following

helm repo update

and then

helm install stable/dask

See https://docs.dask.org/en/latest/setup/kubernetes-helm.html for all the details.

Deep Learning on the Cloud

There are several useful tools which are available for building deep learning algorithms with Kubernetes and Dask. For example, TensorFlow can be put on the cloud using tf.distributed of kubeflow. The parallelism can be trivially used during grid optimization since different models can be run on each worker node. Examples can be found on the GitHub repository here.

What do you use?

For my own research (I am an environmental scientist) and in my consulting work (machine learning consultant) I regularly use either JupyterHub, a Kubernetes cluster with Dask on Harvard’s supercomputer Odyssey, or I will run the same infrastructure on AWS (no real prejudice against Azure or the Google Cloud, I was just taught how to use AWS first).

Example Cloud Deployment on AWS

In this section, I will run through the setup of a Kubernetes Cluster running Dask on AWS. The first thing you need to do is set up an account on AWS, you will not be able to run the following lines of code unless you already have an account.

First, we download the AWS command line interface and configure it with our private key supplied by AWS. We then install Amazon’s Elastic Container Service (EKS) for Kubernetes using the brew commands.

pip install awscli
aws configure
brew tap weaveworks/tap
brew install weaveworks/tap/eksctl

Creating a Kubernetes cluster is now ludicrously simple, we only need to run one command, but you should specify the cluster name, the number of nodes, and the region you are in (in this case I am in Boston so I choose us-east-1 ) and then run the command.

eksctl create cluster --name=cluster-1 --nodes=4 --region=us-east-1

Now we must configure the cluster with the following commands:

kubectl get nodes
kubectl --namespace kube-system create sa tiller
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller

Now we set up Helm and Dask on the cluster

helm init --service-account tiller

Wait two minutes for this to complete and then we can install Dask.

helm version
helm repo update
helm install stable/dask
helm status agile-newt
helm list
helm upgrade agile-newt stable/dask -f config.yaml
helm status agile-newt

A few more Kubernetes commands.

kubectl get pods
kubectl get services

For more details and a shell, you will need a command like this. Your exact pod names will be different.

kubectl get pod agile-newt-dask-jupyter-54f86bfdd7-jdb5p
kubectl exec -it agile-newt-dask-jupyter-54f86bfdd7-jdb5p -- /bin/bash

Once you are in the cluster, you can clone the GitHub repository and watch Dask go!

Kaggle Rossman Competition

I recommend that once you have got the Dask cloud deployment up and running you try running the rossman_kaggle.ipynb . This is example code from the Kaggle Rossman competition, which allowed users to use any data they wanted to try and predict pharmacy sales in Europe. The competition was run in 2015.

The steps in this notebook run you through how to set up your coding environment for a multilayer perceptron in order to apply it to a parallel cluster and then perform hyperparameter optimization. All of the steps in this code are split into functions which are then run in an sklearn pipeline (this is the recommended way to run large machine learning programs).

There are several other examples on the repository that you can run on the parallel cluster and play with. Also, feel free to clone the repository and tinker with it as much as you like.

Where can I learn more?

To learn more about Dask, check out the following links:

To learn more about Dask with Kubernetes:

To learn more about Helm:

If you are struggling to work through any of the above steps, there are multiple other walkthroughs that go through the specifics in more detail:

For setting up the cluster on the Google Cloud (sadly could not find one for Microsoft Azure) check these links out:

Now you should have a working parallel cluster on which to perform machine learning on big data or for big compute tasks!

Thanks for reading! 🙏

--

--

ML Postdoc @Harvard | Environmental + Data Science PhD @Harvard | ML consultant @Critical Future | Blogger @TDS | Content Creator @EdX. https://mpstewart.io