The World of Hadoop
Interconnecting Different Pieces of the Hadoop Ecosystem
When learning Hadoop, one of the biggest challenges I had was to put different components of the Hadoop ecosystem together and create a bigger picture. It’s a huge system which comprises of different components which can be contrasting as well as complementing to each other. Understanding how these different components are interconnected is a must have piece of knowledge for anyone willing to utilize Hadoop based technologies in a production level big data application. Hadoop ecosystem possesses a huge place in the big data technology stack and it’s a must have skill for data engineers. So, let’s dig a little deeper into the world of Hadoop and try to untangle the pieces of which this world is made.
Starting with a formal definition for Hadoop can help us getting an idea of the overall intention of Hadoop ecosystem:
Hadoop is an opensource software platform for distributed storage and distributed processing of very large data sets on computer clusters
As the definition indicates, the heart of Hadoop is made of its ability to handle data storage and processing in a distributed manner. In order to achieve this, the overall architecture of Hadoop and its distributed file system has been inspired by Google File System (GFS) and Google MapReduce. The distributed nature of Hadoop architecture makes it suitable for very large data, specially by removing a single point of failure when processing these large amounts of data.
Now let’s try to walk through different building blocks of the Hadoop ecosystem and understand how these pieces are interconnected.
I tried to organize the entire Hadoop ecosystem into three categories.
- Hadoop Core Stack
- Hadoop Data Manipulation Stack
- Hadoop Coordination Stack
Hadoop Core Stack
- HDFS (Hadoop Distributed File System): As the name implies HDFS is a distributed file system that acts as the heart of the overall Hadoop eco system. It allows storing data in a distributed manner in different nodes of clusters but is presented to the outside as one large file system. It is built on top of commodity hardware and said to be highly fault tolerant. HDFS is capable of handling huge data sets through spawning clusters of hundreds of nodes. By architecture, it follows a master/slave approach where each cluster consists of a single name node (master) and multiple data nodes (slaves). Name node manages the file system namespace, metadata, and client access to files. Data nodes store actual blocks of files. More details about HDFS can be found here: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
2. YARN (Yet Another Resource Negotiator): YARN is the resource manager which arbitrates all available cluster resources. It also follows the master/slave approach. YARN has a resource manager (master) per cluster and then a node manager (slaves) per node. The Resource Manager is the authority that mediates resources among all the nodes in the cluster. It keeps the meta data about which jobs are running on which node and manages how much memory and CPU is consumed and hence has a holistic view of total CPU and RAM consumption of the whole cluster. The Node Manager is the per-machine agent who is responsible for monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Resource Manager. ApplicationMaster is the instance of a framework specific library which allows different applications to negotiate requests to resources. Containers are the abstractions of resources which are allocated for requests. More details about YARN can be found here: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
3. MapReduce: The entire programming model of parallelism in Hadoop is based on the software framework called MapReduce. This model allows processing large data sets in parallel on clusters of commodity hardware. MapReduce jobs first splits data into a set of independent chunks called map tasks which can be executed/processed in parallel. Then the sorted outputs of the map tasks are input into a reduce task which will produce an aggregated result. Hadoop’s parallel programming model is based on this concept. More details about Hadoop MapReduce can be found here: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Hadoop Data Manipulation Stack
On top of HDFS, YARN and MapReduce there are various data querying, processing and ingesting technologies which can be used to build big data applications in real-world. Now let’s look into those technologies. These technologies are intended to serve various purposes but can be integrated into a one large application too.
- Apache Pig: This is a high-level scripting framework which allows to write SQL like scripts on distributed data. The developers don’t need to know java or python to work with Apache Pig. This platform is suitable for analyzing large data sets because its capability to support parallelization. Pig’s infrastructure layer is based on MapReduce. More details about Apache Pig can be found here: https://pig.apache.org/
- Apache HBase: This is a no-sql, columnar database provided to allow real time read/write access to big data. Having all the advantages of a no-sql database, HBase provides the ability to host large databases with its distributed nature and scalability. More details about Apache HBase can be found here: https://hbase.apache.org/
- Apache Hive: This is a data warehousing framework which enables reading, writing, and managing data that resides in HDFS and other storage such as HBase. This comes up with a SQL like query language to perform ETL tasks on the data thus allowing to impose a structure on data from different formats. Apache Drill and Impala are two alternatives for this. More details about Apache Hive can be found here: https://hive.apache.org/
- Apache Spark: This is an analytics engine that can run on top of Hadoop as well as many other platforms like Kubernetes and Apache Mesos. With Apache Spark, applications can be written using Java, python, Scala, R and SQL too. Spark offers high performance through its state-of-the are design which includes a DAG scheduler, query optimizer and physical execution engine. Spark offers a variety of libraries for machine learning, graphs, data frames and streaming which can be combined to build high quality big data systems. More details about Apache Spark can be found here: https://spark.apache.org/
- Apache Storm: This is a real time streaming data processing system which is an addition to Hadoop eco system which can augment its batch processing power to real time processing. It is said that Apache Storm can process over a million of jobs on a node in a fraction of a second. Kafka, Flink and Spark stream processing are alternatives for Storm. More details about Apache Storm can be found here: https://storm.apache.org/
- Apache Solr: This is a search platform which can be used on top of Hadoop. It provides enterprise search features with distributed indexing. Apache Solr comes with a bundle of features such as full text search, hit highlighting, real time indexing and many more. More details about Apache Solr can be found here: https://lucene.apache.org/solr/
- Apache Sqoop: This is a tool which can be used for ingesting or pumping data into HDFS. As Hadoop basically deals with big data, pumping petabytes of data into HDFS is going to be challenging, specially because data comes from different sources and in different formats. Sqoop is a tool which is basically designed to import structured data from relational databases into HDFS. Flume is an alternative to Sqoop which supports unstructured data too. More details about Sqoop and Flume can be found here: https://www.dezyre.com/article/sqoop-vs-flume-battle-of-the-hadoop-etl-tools-/176
The technologies described above are the main components of Hadoop data manipulation layer. What technology to use has to be decided based on the business requirements and the features they offer. The above technologies can be used separately as well as in combination with each other to work on top of Hadoop and its distributed file system.
Hadoop Coordination Stack
On top of Hadoop Core and its related data driven technology stack there should be a mechanism to manage and coordinate things, especially due to the distributed nature of Hadoop. For that also there are different platforms provided.
- Apache ZooKeeper: This is a coordination service which can be used by Hadoop to manage and coordinate clusters by providing mechanisms to share data without inconsistencies using different synchronization mechanisms. This provides different services such as naming of nodes, synchronization, locking and configuration management etc. Apache Ambari is an alternative for ZooKeeper. More details about Apache ZooKeeper can be found here: https://zookeeper.apache.org/
- Apache Oozie: This is a scheduling system which can manage Hadoop jobs. As you can see, there is a huge stack of technologies that can operate on Hadoop and they serve different purposes. Any real-world big data application utilizing the above technologies might need to schedule the jobs created with above technologies into a specific pipeline or order. Oozie has integrations with different jobs in the Hadoop stack including Hive, Pig, Sqoop, Java MapReduce etc. AirFlow is an alternative for this. More details about Apache Oozie can be found here: https://oozie.apache.org/
- Apache Atlas: This is a platform which allows enterprises using Hadoop to make sure their data adheres to governance compliance. With the arising legalities across usage of data, big data governance can be a critical concern for any real-world application. Specially this provides mechanisms to manage metadata, classify data into categories such as personally identifiable information (PII) and sensitive data. More details about Apache Atlas can be found here: https://atlas.apache.org/#/
My main attempt through this article was to summarize and structure the ecosystem of Hadoop based on technologies and their intended purposes. When I was a student, I found it slightly difficult to grasp a summarized view because the Hadoop ecosystem is huge and diverse. I have organized the entire Hadoop ecosystem into certain parts here in a way that was easy for me to grasp personally. I believe it will be helpful for someone else too. So, if you are interested in diving deeper into big data engineering, then this might be the starting point.