The world’s leading publication for data science, AI, and ML professionals.

Apache Spark: Sharing Fairly between Concurrent Jobs

Ensuring equal resource distribution among concurrent jobs within a spark application irrespective of their size

within an Application

Photo by Krzysztof Maksimiuk on Unsplash
Photo by Krzysztof Maksimiuk on Unsplash

Note: In this post, phrases "spark job" and " job" are used to refer to Spark actions like save, collect, count, etc., and the phrase "concurrent jobs" is referred to multiple parallel spark actions running simultaneously within an application.

In my previous post, we discussed boosting a monotonous Apache Spark Application by running its spark jobs concurrently through Scala Futures or Parallel Collections which reduced the application time to one-fourth. (If you haven’t read it earlier, it would be worth reading now.)

However, there can be a scenario when only achieving the concurrency at a spark job level is not enough to optimize the application’s performance. For instance, one large job consumes all the available spark resources and pushing other parallel jobs into a waiting state until the former’s tasks are not utilizing all of the resources. This happens because Spark’s default scheduling option within an application is FIFO (First In First Out) which ensures the first job gets all the available spark resources on priority till the point its stages have tasks to run. In most such scenarios, we will never want a spark application to get stuck with just one long job and at the same time, we will wish that all spark jobs whether short or long are sharing resources fairly.

To address our want, Apache Spark provides a perfect solution by which we can change the default scheduling option to FAIR so that tasks between the jobs are performed in a round-robin fashion. It means all the jobs are getting an equal share of Spark resources. By using FAIR scheduling, I reduced my application’s duration by 27 percent, hence, I will highly recommend it for any spark application running multiple parallel spark jobs simultaneously.

Perquisite to leverage fair scheduling: Multiple jobs are submitted from separate threads in a Spark Application. If you don’t know how to do it, please have a look here.

At this point, we are ready to explore how we can have FAIR Scheduling implemented in a Spark application with concurrent jobs.

Start with creating "fairscheduler.xml" with a pool name of your choice and scheduling mode as FAIR. Here, we need to explicitly mention its scheduling mode because, by default, it is FIFO for each pool. You can know more about it here.

Secondly, we need to configure Spark application to use fair scheduling either while creating spark session or submitting spark application.

And finally, we need to set the pool defined in "fairscheduler.xml" as a local property of spark context.


Confirming FAIR Scheduling through Spark UI

It is very important to confirm whether FAIR scheduling is implemented successfully or not, because a slight mistake in performing the above steps, can get back you to square one. Further,

  1. Setting only "spark.scheduler.mode" to "FAIR" is not enough as stages of jobs still are running in the default pool whose scheduling mode is FIFO.
  2. That’s why we create our own pool with scheduling mode FAIR.
  3. And set this pool as a local property of spark context.

To confirm our work, we can use Spark UI.

  1. Check the "Jobs" page that Scheduling mode is FAIR, however, as said earlier, it does not ensure that scheduling is FAIR at pool level as well.
  2. Hence, go to the "Stages" page, and check the pool name of running or completed stages. In our case, it should be "mypool".

Below snippets will help you to understand it more clearly.

Spark UI's Jobs webpage
Spark UI’s Jobs webpage
Spark UI's Stages webpage
Spark UI’s Stages webpage

To conclude, FAIR scheduling is a must to have feature if an application consists of concurrently running big and small spark jobs. In this way, we can increase the application’s performance significantly in terms of resource utilization and ultimately, saving cost.

I hope, you enjoyed this post. I would love to hear your suggestions or feedback in the comments section.

References


Related Articles