Data in Warehouses or Data in Hadoop: What’s suitable for your business and Why?

Michael Williams
Towards Data Science
4 min readJul 26, 2019

--

Big-data is the most sought-after innovation in the IT industry that has shook the entire world by storm. Partly, due to the fact that Hadoop and related big data technologies are growing at an exponential rate. One main reason for the growth of Hadoop in Big Data is its ability to give the power of parallel processing to the programmer. But the recent headlines like “Will Hadoop replace Data Warehousing?” or “The Death of MapReduce” have been surfacing all over the web thereby raising the different round of debates for Big Data programmers. As we move forward in the middle of 2019, the hype around Hadoop still seems to be increased with lots of expectations and excitement coming out in the form of IoT.

To clear up any confusion for those who are in favor of Hadoop, here we will look into some facts and answer some of the most popular questions that are being asked over internet like:

  • Will Hadoop replace data warehouse?
  • Do you still need a warehouse if you’ve Bigdata?
  • Is traditional warehouse era dead now?

But before we dive deep into context, let’s clear any doubt of clouds and know the terms correctly.

What is a Data Warehouse?

Data Warehouse is an architecture of data storing or data repository which uses a different design from standard operational databases. Unlike operational databases, Data warehouses are designed to provide a long-range of data over time. As a result of which, data aggregation is triggered and transaction volume becomes a trade-off.

What are Data Warehouse used for?

Most of the enterprises utilize data warehouse to analyze their business data. They are needed often when their analytics requirement ran afoul of the operation database performance. Most of the operational databases aren’t capable of running complex queries as it requires the database to enter a temporary and fixed state. This is where a data warehouse come into play. A data warehouse can do most of analytical work which gives transactional database enough time to focus on transactions.

Another optimal benefit of a data warehouse is its ability to analyze data from multiple sources. It is capable of negotiating differences in storage schema by leveraging the ETL (Extract, Transfer and Load) process. ETL is defined as a process that extracts the data from different RDBMS source systems, then transforms the data (like applying calculations, concatenations, etc.) and finally loads the data into the Data Warehouse system.

What is Hadoop?

Hadoop boasts of a similar architecture as MPP data warehouses, but with some obvious differences. Unlike Data warehouse which defines a parallel architecture, hadoop’s architecture comprises of processors who are loosely coupled across a Hadoop cluster. Each cluster can work on different data sources. The components such as data catalog, data manipulation engine and storage engine can work independently with Hadoop serving as a collection point.

What purpose does Hadoop Serve for?

Hadoop has become the favourite of big-data technology because it possesses the ability of processing large amounts of semi-structured and unstructured data. Here are some examples of use-cases where Hadoop can be used for:

  • Large scale Enterprises — Hadoop can be used in Large scale Enterprise projects that demands server clusters where programming skills and specialized data management skills are limited and implementation cost is a costly affair.
  • Large Datasets — Hadoop can be used to establish high scalability and to save both money and time for managing Datasets that are in petabytes or terabytes.
  • Separate Data Sources — Big Data applications that requires data from separate data sources often deals with Hadoop clusters; therefore, Hadoop plays a very important role in application development.

Problems with traditional warehousing

Traditional data warehouse can’t control complex hierarchical data types and other unstructured data types. With the need of scalability, cost factor is added disadvantage which data warehouse is not capable to handle.

Moreover, a DWH can’t hold-on data that lacks a definite schema as it follows schema on-write mechanism. This is where Hadoop comes into play as it favours schema on read. Data warehousers typically need to spend a lot of time in modelling the data which is not feasible considering the business model.

Will Hadoop going to replace Data Warehouse?

There is no doubt to say that Hadoop ecosystem has been evolving rapidly and efficiently. We won’t be surprised if soon Hadoop becomes capable to handle all types of mission critical workloads thereby eliminating the need of a data warehouse. However, Hadoop and Data warehouse both are going to stay in the business for a longer period of time. Organizations can rely on their existing data warehouses and bring in Hadoop functionality to meet the business requirements and harness the power of application development.

--

--