Apache YARN & Zookeeper

All about Resource Allocation and High Availability in Hadoop

Prathamesh Nimkar
Towards Data Science

--

YARN Architecture

Architecture and Working

YARN or “Yet Another Resource Negotiator” does exactly as its name says, it negotiates for resources to run a job.

YARN, just like any other Hadoop application, follows a “Master-Slave” architecture, wherein the Resource Manager is the master and the Node Manager is the slave. The master allocates jobs and resources to the slave and monitors the cycle as a whole. The slave receives the job and requests (additional) resources to complete the job and actually undertakes the execution of the job.

  1. The Client sends a job (jar file in most cases) to the Resource Manager (RM).
  2. The RM contains two parts, namely, Scheduler and Application Manager (AM). The Scheduler receives the job request and requests the AM to search for available Node Managers (NM). The selected NM spawns the Application Master (App Master). Please note, the RM Scheduler only schedules the job. It cannot monitor or restart a failed job. The AM monitors the end-to-end life cycle of the job and can reallocate resources if a NM fails. It can also restart a failed job in App Master.
  3. The App Master checks the resources provided for the job in the container. The job now resides in the App Master. Do note, that it communicates the status of the job to AM.
  4. It is important to note that the Yarn Container can only be used when a Container Launch Context (CLC) certificate is provided by the App Master. It works like a key to unlock the container’s resources. This is internal to YARN. The App Master job is now executed in the container. However, if the resources provided are not sufficient then
  5. the App Master creates a list of requests and
  6. this list is sent directly to the RM Scheduler (not through the NM). RM again requests the AM for more resources.
  7. A new container is launched through a new NM without an App Master. The job is successfully executed and the resources are released.

RM is a master process, which is a single point of failure, hence let’s add High Availability to YARN’s Resource Manager in our Cloudera Manager setup on GCP.

YARN Schedulers

Different types of YARN Schedulers

FIFO Scheduling

As the name suggests, First in First out or FIFO is the most basic scheduling method provided in YARN. Currently (most likely) discontinued in Hadoop 3.x, FIFO places jobs submitted by the client in queues and executes them in a sequential manner on a first-come-first-serve basis.

FIFO Scheduling

Jobs 1, 2 and 3 have different storage and memory requirements. Although we could run multiple jobs together, in FIFO, they will run sequentially. Such is the waste. Hence, this scheduling methodology is not preferred on a Production/Shared Cluster as it suffers from poor resource utilization.

Since FIFO worked on a sequential basis, there was a massive under-utilization of resources. A real life drawback was that critical client SLAs were breached due to its sequential queuing as the clients/users had to wait unnecessarily. This resulted in creation of private clusters within an organization, wherein, each department was disconnecting from the central system (data lake) to avoid waiting. This further decreased utilization and increased operational costs drastically, that could otherwise have been saved.

Capacity Scheduler was introduced to maximize utilization.

Capacity Scheduling

Capacity Scheduling

The central idea is that multiple departments fund the central cluster or “root” denoted as 100% of available resources as seen on top of the above chart. Each department or “leaf” is guaranteed a specific range of capacity to carry out its jobs whenever required without waiting. This can be in percentages or absolute numbers. This is the second-row in the hierarchy, wherein, the minimum is guaranteed and the maximum is limited and allocated depending on free resources available within the entire cluster. Thus, the minimum range always equals 100% on each hierarchical level. Naturally, each individual leaf’s maximum cannot be over 100% either. Furthermore, we always limit the maximum of each leaf to below 100%. 80% seems to be a preferred maximum choice in Production.

We can segregate the leaf to more accurately divide the resources by creating a new child-level in the hierarchy. This enables detailed capacity planning, for example, “ETL” in the image above, gets 60% of 60% of root i.e. 36% of all resources as a guaranteed minimum. Here at this level too, the minimum sums up to 100%. Lastly, if there are 2 “ETL” jobs, the priority is based on FIFO. The priority functionality, depending on a case-to-case basis, is a drawback of Capacity Scheduling.

Fair Scheduling

You need to understand how fair and capacity scheduling works, as Fair scheduling builds upon its drawbacks.

If a single job is run, Fair Scheduling (FS) will throw all its resources at it. If another job is added, it will ensure that a “fair” amount of resources are added to finish that job too. So, FIFO scheduling issues are resolved here. It basically prevents the new job from starving for resources.

FS can also do capacity scheduling if configured and it will automatically add resources at the leaf/department level. Furthermore, it can prioritize jobs based on “weights” or priority that can be added at the leaf or job level, thus allowing for maximum capacity utilization.

The central idea is to ensure that all jobs receive a “fair” amount of resources over-time to execute successfully. This is especially true when there are 2 jobs and one of them is small. The small job also receives an adequate amount of resources to execute instead of being put on hold until the big job completes, even if it falls under the same leaf/queue/department. This can be thought of as dynamic resource balancing.

Lastly, you should perhaps also enable High Availability on HDFS:

Apache Zookeeper

A Zookeeper is a person who manages animals in a zoo. You’ve probably noticed, most of the Hadoop application icons are related to animals in one way or another. Anyway, Apache Zookeeper, similarly, is an application which manages “animals” (read applications) in a “zoo” (read Hadoop ecosystem).

So, what does it really do?

Coordination

One of the pain-points of any massively distributed application is coordination over scale. For example, if you have 100 nodes and you wish to implement a configuration change in all of them, you have to manually do it on all nodes.
Can you imagine writing a config change on 100 nodes? Doable? Yes?

How about this then — Yahoo at one point had 40,000 Hadoop nodes. Can you imagine writing a config change for all of them? Would you do it?

You’re probably thinking of writing a piece of code that deploys the configuration change across all nodes. While that is a great idea, there are many concerns that are borne out of it, such as tedious and frequent updates/customization to the script for every small change etc., which we will not get into, but am sure you get the point.

Here’s where Zookeeper comes into the picture, wherein a change/update can easily be applied to all nodes without any overhead. It is seamless and has to be done on a master node once, to be applied automatically across all nodes.

The primary use of Zookeeper isn’t coordination. It is maintaining high availability.

Maintaining High Availability

Apache Zookeeper Basic Architecture

It does this through a simple feedback loop. The Zookeeper, through its Fail-over Controller, monitors the Leader and Follower/Stand-by nodes.

It receives a heartbeat or instant notification of the current health/status of each node. The moment a leader fails, one of the stand-by nodes is elected as the new leader by Zookeeper almost immediately. The election is completed and the new leader sends a message to the underlying application, post which the application “reports” to the new leader.

This brings me to the end of Apache YARN and Zookeeper.

Got questions? Don’t hesitate to ask!

References:

[1] Apache Hadoop YARN (2019), Apache Hadoop, ASF

[2] Project Description — Apache Zookeeper, Apache Hadoop, ASF

--

--