Optimized Docker Images for Apache Spark — Now Public on DockerHub

Get started and do your work with all the common data sources supported by Spark.

Published in

Towards Data Science

4 min readMay 12, 2021

Our optimized Docker images for Apache Spark are now freely available on our DockerHub repository, whether you’re a Data Mechanics customer or not.

This is the result of a lot of work from our engineering team:

We built a fleet of Docker Images combining various versions of Spark, Python, Scala, Java, Hadoop, and all the popular data connectors.
Automatically tested them across various workloads, to ensure the included dependencies are working together — in other words, save you from dependency hell 😄

Our philosophy is to provide high quality Docker images that come “with batteries included”, meaning you will be able to get started and do your work with all the common data sources supported by Spark. We will maintain this fleet of images over time, up to date with latest versions and bug fixes of Spark and the various built-in dependencies.

What’s a Docker Image for Spark?

When you run Spark on Kubernetes, the Spark driver and executors are Docker containers. These containers use an image specifically built for Spark, which contains the Spark distribution itself (Spark 2.4, 3.0, 3.1). This means that the Spark version is not a global cluster property, as it is for YARN clusters.

You can also use Docker images to run Spark locally. For example you can run Spark in a driver-only mode (in a single container), or run Spark on Kubernetes on a local minikube cluster. Many of our users choose to do this during their development and their testing.

Using Docker will speed up your development workflow and give you fast, reliable, and reproducible production deployments. Image by Author.

‍To learn more about the benefits of using Docker for Spark, and see the concrete steps to use Docker in your development workflow, check out our article: Spark and Docker: Your development cycle jut got 10x faster! .

What’s in these Docker Images?

They contain the Spark distribution itself — from open-source code, without any proprietary modifications.

They come built-in with connectors to common data sources:

AWS S3 (s3a:// scheme)
Google Cloud Storage (gs:// scheme)
Azure Blob Storage (wasbs:// scheme)
Azure Datalake generation 1 (adls:// scheme)
Azure Datalake generation 2 (abfss:// scheme)‍

They also come built-in with Python & PySpark support, as well as pip and conda so that it’s easy to to install additional Python packages. (If you don’t need PySpark, you can use the lighter image with the tag prefix ‘jvm-only’)

Finally, each image uses a combination of the versions from the following components:

Apache Spark: 2.4.5 to 3.1.1
Apache Hadoop: 3.1 or 3.2
Java: 8 or 11
Scala: 2.11 or 2.12
Python: 3.7 or 3.8

Note that not all the possible combinations exist, check out our DockerHub page to find the support list.

Our images includes connectors to GCS, S3, Azure Data Lake, Delta, and Snowflake, as well as support for Python, Java, Scala, Hadoop and Spark! Image by Author.

How To Use Our Spark Docker Images

You should use our Spark Docker images as a base, and then build your own images by adding your code dependencies on top. Here’s a Dockerfile example to help get you started:

Once you’ve built your Docker image, you can run it locally by running: docker run {{image_name}} driver local:///opt/application/pi.py {args}

Or you can push your newly built image to a Docker registry that you own, then use it on your production k8s cluster!

Do not directly pull our DockerHub images from your production cluster in an unauthenticated way, as you risk hitting rate limits. It’s best to push your image to your own registry (or purchase a paid plan from Dockerhub).
Data Mechanics users can directly use the images from our documentation. They have a higher availability and a few additional capabilities exlusive to Data Mechanics, like Jupyter support.

Conclusion — We hope these images will be useful to you

Are these images working well for you? Do you need new connectors or versions to be added? Let us know, we’d love your feedback.

Originally published at https://www.datamechanics.co.