Getting Started with PySpark on AWS EMR

A step-by-step guide to processing data at scale with Spark on AWS

Brent Lemieux
Towards Data Science
8 min readJul 19, 2019

--

Data Pipelines with PySpark and AWS EMR is a multi-part series. This is part 1 of 2. Check out part 2 if you’re looking for guidance on how to run a data pipeline as a product job.

  1. Getting Started with PySpark on AWS EMR (this article)
  2. Production Data Processing with PySpark on AWS EMR (up next)

--

--