The world’s leading publication for data science, AI, and ML professionals.

Bootstrapping Dask on 1000 cores with AWS Fargate

Harness the power of the cloud to provision and deploy 1000 cores to process large amounts and get stuff done

Photo by NASA on Unsplash
Photo by NASA on Unsplash

As a data engineer, I found myself many times in need of a system where I can prototype heavy big data tasks. My team traditionally used EMR with Apache Spark for such tasks but with news about Apache retiring hadoop projects and with the common paradigm that python is the go-to tool for data analysis, I was in pursuit of a more modern, python-native framework.

My goal was to produce a one-click deployment of a 1000-core cluster that can be run to prototype processing of ~1TB of data and then torn down. Another condition is that this resources need not to interfere with the rest of my team’s stack. Therefore the one-click-eks-fargate-Dask tool was born!

AWS Fargate

AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS).

In short, Fargate enables the developer to privision CPU and memory based on requirements specified by pods instead of provisioning nodes beforehand.

eksctl

eksctl is a cli tool for managing EKS clusters. It allows provisioning and deploying an EKS cluster with a single command with EC2 (compute nodes) or Fargate (serverless) backend. As of this post’s writing, it does not support specifying Fargate spot instances, which is too bad.

eksctl alsl provides utility commands for providing IAM access to kubernetes service accounts.

Dask

Dask is a library that enables processing larger-than-memory data structures with parallel computing on a remote or local cluster. It provides familiar python APIs for data structures like arrays (numpy) and data frames (pandas). It is very straight forward to spin up a local process-based cluster based but in my humble opinion, Dask’s place to shine is it’s amazing ecosystem allowing to set up a cluster on Kubernetes, YARN and many other multi-processing clusters.

Putting it all together

First, I use this script to deploy an EKS Fargate cluster. This script requires to run in the context of an IAM role that has permissions for many AWS deployment actions (mainly deploying a Cloudformation stack). Not that the IAM role used for the dask workers provides full access to the S3 service because I loaded my data from parquet files on S3.

After running this command you should have an EKS cluster named dask up and running. For the next part we are going to deploy a dask helm release. Helm releases can be customized by setting values. These are the values that I used for my dask cluster:

Then install the helm release:

And that is it! You have a dask cluster installed! You can run this example notebook that uses the open ookla dataset on s3 and see how scaling the cluster improves the performance.

The Dask Data Frame API is not 100% compatible with pandas (arguably, by design). There is a bunch of useful resources on Dask’s documentation website.

Remember to scale down your cluster after you finished using it!

Conclusion

I was looking for a quick method without much dependencies to prototype big data processing on a python-native framework. This is by no means suitable for production use and is probably not cost-optimized either, but it got the job done and in a matter of a few minutes I was able to process >300GB of parquet files (which can balloon to 1TB in data frame)


Related Articles

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.