Data Engineering

Bootstrapping Dask on 1000 cores with AWS Fargate

Harness the power of the cloud to provision and deploy 1000 cores to process large amounts and get stuff done

Apr 20, 2021

3 min read

Photo by NASA on Unsplash

As a data engineer, I found myself many times in need of a system where I can prototype heavy big data tasks. My team traditionally used EMR with Apache Spark for such tasks but with news about Apache retiring hadoop projects and with the common paradigm that python is the go-to tool for data analysis, I was in pursuit of a more modern, python-native framework.

My goal was to produce a one-click deployment of a 1000-core cluster that can be run to prototype processing of ~1TB of data and then torn down. Another condition is that this resources need not to interfere with the rest of my team’s stack. Therefore the one-click-eks-fargate-Dask tool was born!

AWS Fargate

AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS).

In short, Fargate enables the developer to privision CPU and memory based on requirements specified by pods instead of provisioning nodes beforehand.

eksctl

eksctl is a cli tool for managing EKS clusters. It allows provisioning and deploying an EKS cluster with a single command with EC2 (compute nodes) or Fargate (serverless) backend. As of this post’s writing, it does not support specifying Fargate spot instances, which is too bad.

eksctl alsl provides utility commands for providing IAM access to kubernetes service accounts.

Dask

Dask is a library that enables processing larger-than-memory data structures with parallel computing on a remote or local cluster. It provides familiar python APIs for data structures like arrays (numpy) and data frames (pandas). It is very straight forward to spin up a local process-based cluster based but in my humble opinion, Dask’s place to shine is it’s amazing ecosystem allowing to set up a cluster on Kubernetes, YARN and many other multi-processing clusters.

Putting it all together

First, I use this script to deploy an EKS Fargate cluster. This script requires to run in the context of an IAM role that has permissions for many AWS deployment actions (mainly deploying a Cloudformation stack). Not that the IAM role used for the dask workers provides full access to the S3 service because I loaded my data from parquet files on S3.

After running this command you should have an EKS cluster named dask up and running. For the next part we are going to deploy a dask helm release. Helm releases can be customized by setting values. These are the values that I used for my dask cluster:

Then install the helm release:

And that is it! You have a dask cluster installed! You can run this example notebook that uses the open ookla dataset on s3 and see how scaling the cluster improves the performance.

The Dask Data Frame API is not 100% compatible with pandas (arguably, by design). There is a bunch of useful resources on Dask’s documentation website.

Remember to scale down your cluster after you finished using it!

Conclusion

I was looking for a quick method without much dependencies to prototype big data processing on a python-native framework. This is by no means suitable for production use and is probably not cost-optimized either, but it got the job done and in a matter of a few minutes I was able to process >300GB of parquet files (which can balloon to 1TB in data frame)

Written By

Imri Paran

See all from Imri Paran

Topics:

Aws Fargate, Dask, Data Engineering, Helm, Kubernetes

Share this article:

Related Articles

Feature Engineering with Microsoft Fabric and Dataflow Gen2
Data Engineering

Fabric Madness part 3

Roger Noble

April 15, 2024

13 min read
Feature Engineering with Microsoft Fabric and PySpark
Basketball

Fabric Madness part 2

Roger Noble

April 8, 2024

14 min read
Data: Where Engineering and Science Meet
Data Engineering

Our weekly selection of must-read Editors’ Picks and original features

TDS Editors

September 29, 2022

4 min read
Data pipeline design patterns
Data Engineering

Choosing the right architecture with examples

Mike Shakhomirov

January 2, 2023

10 min read
The Docker Compose of ETL: Meerschaum Compose
Data Engineering

This article is about Meerschaum Compose, a tool for defining ETL pipelines in YAML and…

Bennett Meares

June 19, 2023

7 min read
How to identify your business-critical data
Analytics

Practical steps to identifying business-critical data models and dashboards and drive confidence in your data

Mikkel Dengsøe

June 14, 2023

10 min read
A Prequel to Data Mesh
Data Engineering

My version to justify the existence of Data Mesh

Rohan Paithankar

December 3, 2023

4 min read
The Unstructured Data Funnel
Data Engineering

Why a funnel is the centre of the war between data’s heaviest hitters

Hugo Lu

December 15, 2023

11 min read
Data Lakes vs Data Warehouses
Data Engineering

What is the difference between Data Lakes and Warehouses?

Giorgos Myrianthous

February 7, 2022

5 min read

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.