Hadoop with GCP Dataproc

An introduction to Hadoop, its services & architecture

Published in

Towards Data Science

10 min readMay 16, 2021

Today, big data analytics is one of the most rapidly growing fields in the world because of the vast amount of benefits one can gain through it. With its tremendous growth and tons of benefits, also comes its own set of issues. One of the major issues in storing big data is, you need a large space to store thousands of terabytes of data, which you cannot achieve via your personal computer. Even if you managed to store a part of your big data, it would take years to process it. As a solution to this, Hadoop was developed by Apache Software Foundation.

Introduction

Let’s start with: What is Hadoop?

Hadoop is an open-source framework designed for storing and processing big data.

Thus, Hadoop offers two major functionalities, storing big data and processing big data. We use HDFS (Hadoop Distributed File System) for storing big data and MapReduce for processing big data. We will be talking more about HDFS throughout the rest of this article.

Before talking about HDFS, let’s first take a look at DFS. In a Distributed File System (DFS), you split your data into small chunks and store them across several machines separately.

HDFS is a specially designed distributed file system for storing a large data set in a cluster of commodity hardware.

NOTE: Commodity hardware is cheap hardware. For example, laptops you use daily are commodity hardware.

Generally, you store your files and directories in your computer machine’s Hard Disk (HD). An HD is divided into Tracks, then Sectors, and finally Blocks. Generally, the size of one such block in your HD is 4 KB.

NOTE: A block is a group of sectors that the operating system can point to.

Set of Blocks in a Hard Disk of size 500 GB (*Image by author)*

If I want to store a file of size 2 KB in my HD, it is stored inside one block, but there is going to be a remaining 2 KB of empty space. HD cannot use this remaining space again for some other file. Therefore that space will be wasted.

Now on top of this HD, we are going to install Hadoop with HDFS.

HDFS is given a block size of 128 MB by default in Hadoop 2.x (64 MB in Hadoop 1.x)

If I want to store a file (example.txt) of size 300 MB in HDFS, it will be stored across three blocks as shown below. In block 3, only 44 MB will be used. It will have 84 MB of free space, and this remaining space will be released for the use of some other file.

Data distribution of a file of size 300 MB in Hadoop (*Image by author)*

Thus, Hadoop manages data storage more efficiently than HD.

NOTE: Here I have taken a file of size 300 MB for the mere benefit of explaining the concept. Usually, Hadoop deals with very large files of size terabytes!

HDFS Services

HDFS has two main services, namely NameNode and Datanode.

NameNode: A master daemon that runs on the master machine which is a high-end machine.

DataNode: A slave daemon that runs on commodity hardware.

NOTE: Why we use a high-end machine for the NameNode is, because all the metadata is stored at the NameNode. If the NameNode fails, we lose all the information regarding where each part of the file is stored, which ultimately could lead to the loss of access to the entire Hadoop cluster. Thus, there would be no usage of the cluster anymore even the DataNodes are active because we will not be able to access the data stored there. The most common practice that is being followed to overcome this issue is to use a secondary NameNode as a backup.

NameNode

Master daemon
Maintains and manages DataNodes
Records metadata (e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc.)
Receives heartbeats and block reports from all the DataNodes

NOTE: Heartbeat tells the NameNode that this DataNode is still alive.

DataNodes

Slave daemons
Store actual data
Serve read and write requests made by the client

Block Replication in HDFS

In order to be fault-tolerant, Hadoop stores replicas of the blocks across different DataNodes. By default, the replication factor is 3. That is, it is going to keep three copies of any block in the cluster across DataNodes.

Let’s take our previous example of a 300 MB file.

High-level representation of block replication in HDFS (*Image by author)*

How does Hadoop decide where to store the replicas of the blocks created?

It uses Rack Awareness Algorithm.

When a client requests for a read/write in a Hadoop cluster, in order to minimize the traffic, the NameNode chooses a DataNode that is closer to it. This is called Rack Aware.

Distributions of a block and its replicas in Racks to support fault-tolerance (*Image by author)*

The replica/s of a block should not be created in the same rack where the original copy resides. Here, the replicas of block 1 should not be created in rack 1. They can be created in any other rack apart from rack 1. If I store the replicas of block 1 in rack 1 and if rack 1 fails, then I am going to lose my data in block 1.

NOTE: A rack is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in a rack is greater than the bandwidth between two nodes on different racks.

Why store the two replicas of block 1 on the same rack (rack 2)?

There are two reasons for this. Firstly, the chances of both the racks (rack 1 & rack 2) failing at the same time are minimum. Secondly, the network bandwidth required to move a data file from one DataNode present in a rack to a DataNode in the same rack is very less compared to moving a data file from a DataNode present in another rack. There is no point in consuming extra bandwidth when it is not required.

Write Architecture of HDFS

Taking our previous example back again, let’s say we have a client who wants to store a file called example.txt which is of size 300 MB. Since he does not have enough space in the local machine he wants to put it into the Hadoop cluster but the client does not know what are the DataNodes that are having free space to store his data. So the client first contacts the NameNode. The client sends a request to the NameNode saying that he wants to put the example.txt file into the cluster.

Now the NameNode looks into the file size of the example.txt, calculates how many blocks are needed and how to split that file into a number of 128 MB of blocks. So the 300 MB file will be split into 3 blocks each holding 128 MB, 128 MB, and 44 MB respectively. Let’s call each split of the file as a.txt, b.txt, and c.txt. After that, NameNode does a quick check-up on which DataNodes are having free spaces and then gives a response back to the client saying, please go store your 300 MB file in DataNode 1, 3, and 5.

Now the client first approaches DataNode1 directly to store a.txt file there. The cluster is going to keep two more backup files of the a.txt by default. Once the a.txt is stored, DataNode1 sends a copy of that file to another DataNode which is having some free space, let’s say DataNode2. And similarly, DataNode2 gives a copy of that file to another DataNode, let’s say DataNode4. Once DataNode4 stores that file with him, he sends an acknowledgment to DataNode2 saying that the file you have sent has been stored in my local DataNode. Same way, DataNode2 gives an acknowledgment to DataNode1 saying that the file you have sent has been stored in my DataNode2 as well as in DataNode4. Now, DataNode1 will give an acknowledgment back to the client saying that the sent file has been stored in the DataNode1, 2 and 4.

Inter-communication between Data Nodes and Client (*Image by author)*

But how does the NameNode know where the a.txt file has exactly been stored? All the DataNodes give block reports to the NameNode for every short period of time which tells the NameNode how many blocks have been occupied in the respective DataNodes.

NOTE: A block report contains the block ID, the generation stamp, and the length for each block replica the server hosts.

Through these block reports, the NameNode updates the information at the metadata accordingly. Same way, b.txt file, and the c.txt file will be stored in the cluster.

What happens when one DataNode goes down?

All the DataNodes give a heartbeat to the NameNode from time to time, which helps the NameNode to figure out whether the DataNode is alive or not. If any of the DataNode is not giving a proper heartbeat, then the NameNode considers that DataNode dead. Let’s say DataNode1 dies. Then the NameNode will remove it from the a.txt file in metadata and allocate that file to another DataNode which has free space. Let’s say it is sent to DataNode7. Then the DataNode7 sends a block report back to the NameNode, and the NameNode will update the metadata for a.txt.

How Heartbeat works in a Hadoop cluster (*Image by author)*

Read Architecture of HDFS

Let’s say the client wants to read the example.txt file he has stored previously. The client first contacts the NameNode saying that he wants to read the example.txt file. The NameNode will look into the metadata it has regarding the said file, selects the closest replica of each stored split to the client, and send the relevant IP addresses of those DataNodes back to the client. Then the client will directly reach out to those DataNodes where the blocks are stored and read the data. Once the client gets all the required file blocks, it will combine these blocks to form the file, example.txt.

NOTE: While serving the read request of the client, HDFS selects the replica which is closest to the client. This reduces read latency and network bandwidth consumption.

High-level visual representation of Read architecture of HDFS (*Image by author)*

Before moving on to the hands-on, let me briefly tell you about Dataproc.

Dataproc is a managed service for running Hadoop & Spark jobs (It now supports more than 30+ open source tools and frameworks). It can be used for Big Data Processing and Machine Learning.

The below hands-on is about using GCP Dataproc to create a cloud cluster and run a Hadoop job on it.

Hands-on

I will be using the Google Cloud Platform and Ubuntu 18.04.1 for this practical.

First, you need to set up a Hadoop cluster.

Select or create a Google Cloud Platform project
You need a create a Cloud Bigtable instance. For that first, enable the Cloud Bigtable and Cloud Bigtable Admin APIs
Now create a Cloud Bigtable instance via GCloud shell

gcloud bigtable instances create INSTANCE_ID \
    --cluster=CLUSTER_ID \
    --cluster-zone=CLUSTER_ZONE \
    --display-name=DISPLAY_NAME \
    [--cluster-num-nodes=CLUSTER_NUM_NODES] \
    [--cluster-storage-type=CLUSTER_STORAGE_TYPE] \
    [--instance-type=INSTANCE_TYPE]

*Creating a Bigtable instance via GCloud shell*

NOTE: Make sure to use a cluster-zone where Cloud Bigtable is available.

Enable the Cloud Bigtable, Cloud Bigtable Admin, Cloud Dataproc, and Cloud Storage JSON APIs.
Install the gsutil tool by running gcloud components install gsutil
Install Apache Maven, which will be used to run a sample Hadoop job.
sudo apt-get install maven
Clone the GitHub repository GoogleCloudPlatform/cloud-bigtable-examples, which contains an example of a Hadoop job that uses Cloud Bigtable. git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git
Now creates a Cloud Storage bucket. Cloud Dataproc uses a Cloud Storage bucket to store temporary files.gsutil mb -p [PROJECT_ID] gs://[BUCKET_NAME]

Creating a cloud storage bucket

NOTE: Cloud Storage bucket names must be globally unique across all buckets. Make sure you use a unique name for this. If you get a ServiceException: 409 Bucket hadoop-bucket already exists, that means the given bucket name is already being used.

You can check the created buckets in a project by running gsutil ls

Checking the created Gcloud storage bucket

Create a Cloud Dataproc cluster with three worker nodes.

Go to the Navigation Menu, under “BIG DATA” group category you can find “Dataproc” label. Click it and select “clusters”. Click the “create cluster” button. Give a suitable name to your cluster, change the Worker nodes into 3. Click the “Advanced options” at the bottom of the page, find “Cloud Storage staging bucket section”, click “Browse” and select the bucket you have made previously. If all completed, click “create cluster” and wait for few minutes till the cluster is created.

Once the cluster is created successfully, it will be displayed as follows.

You can go inside the created cluster and click on the “VM Instances” tab. There you can find a list of nodes created for your cluster.

Go to the directory java/dataproc-wordcount
Build the project with Maven

mvn clean package -Dbigtable.projectID=[PROJECT_ID] \
    -Dbigtable.instanceID=[BIGTABLE_INSTANCE_ID]

Building with Maven

Once the project is fully built, you will see a “build successful” message.
Now start the Hadoop job ./cluster.sh start [DATAPROC_CLUSTER_NAME]

gcloud dataproc jobs submit pig --cluster test-hadoop --execute ‘fs -ls /’

Details of the execution of the Hadoop job

The output of the Hadoop job

You can delete the Cloud Dataproc cluster by

gcloud dataproc clusters delete [DATAPROC_CLUSTER_NAME]

Resources

[1] GCP Dataproc documentation

[2] GCP Big Table Examples

Cheers! 🙂