The world’s leading publication for data science, AI, and ML professionals.

Do I need Big Data? And if so, how much?

Many companies follow the hype of big data without understanding the implications of the technology.

Photo by Jan Antonin Kolar on Unsplash
Photo by Jan Antonin Kolar on Unsplash

I call myself a "Big Data Expert". I have tamed many animals in the ever growing Hadoop zoo like HBase, Hive, Oozie, Spark, Kafka, etc… I helped companies to build and structure their Data Lake using appropriate subsets of these technologies. I like to wrangle with data from multiple sources to generate new insights (or to confirm old insights with evidence). I love to build Machine Learning models for predictive applications. So, yes, I would say that I am well experienced with many facets of what people would call "Big Data".

But at the same time, I became more and more skeptical of blindingly following the promises and the hype without understanding all the consequences and without evaluating the alternatives.

What is "Big Data"?

Before jumping into a discussion, let me first explain what I understand by the term "Big Data". If you asked someone on the street, what he thinks the term "Big Data" stands for, you probably get very broad answers about "huge amounts of data which is continuously generated by peoples interacting with computer systems which then is used to analyze personal preferences".

There is a very well known more formal definition of "Big Data", which focuses on three specific aspects of data:

  • High volume – the amount of data that is stored
  • High velocity – the speed at which new data arrives
  • High variety – the variety of formats and types (images, text, structured data)

It’s interesting to see that this definition really is only about the data itself and not about the processes to analyze the data. That is in stark contrast to the imagined answer above, which also contains the usage of the data. From my experience, the analytics part is the most important aspect of "Big Data" both from a companies and from a societies point of view.

In this article, I want to focus on the technology aspect of "Big Data", i.e. on the software and hardware stack required to work with high volume, high variety and possibly high velocity data. This includes both Hadoop but also scalable NoSQL databases like MongoDB, Cassandra etc.

Big Data Promises

A couple a years ago, I had the impression that big software and consulting companies sold the "Big Data" hype to their customers by promising the creation of a new level of insights from data. Today I would say that only a small fraction of companies actually succeeded in implementing a successful Big Data strategy, most other companies failed or are happy that they didn’t follow the hype. Simply think about Cloudera, one of the huge proponents of building Data Lakes, which almost went bankrupt although they had all the technology at hand to generate the insights to create the right decisions.

But this situation doesn’t mean that Big Data is useless – I would argue that it was massively oversold by creating the expectation of a simple button that had to be pressed in order to generate new insights. We are still very far away from this single button solution and we even might never get it at all.

The secret of the companies who successfully created a Big Data strategy is very simple and down to earth: They didn’t simply follow the hype and instead analyzed their problems and searched for tools to address their issues and only used Big Data tools where it made sense. That is they started from specific technical problems to find solutions and they didn’t start from a specific solution in search of a business problem to solve.

Asking the Right Question

Before jumping onto the bandwagon of "Big Data" by implementing Hadoop and Spark you should ask yourself a very simple question "Do I actually have Big Data?". Just because you setup a Hadoop cluster, the amount of data generated by your operational systems doesn’t magically increase.

Big Data and Technology

It is really important to distinguish between Big Data itself (like the definition I gave above) and some technology which promises to support you with handling Big Data by almost unlimited scalability.

Of course it absolutely makes sense to follow the "variety" of Big Data by combining data from different sources to gain new insights. But does that already require new technology? The answer simply depends on the volume of the data and if your current system can handle the load.

Where does "Big Data" Technology come from?

To give an idea when to use Big Data technology like Hadoop and Spark, let’s shed some light onto the origin of these technologies. Yahoo didn’t start the Hadoop project, just because they though it could be useful – they started the project, because there existing technology (probably some classic relational databases) couldn’t cope with the amount of data any more. That is,Yahoo reached a hard limit of their software and hardware stack that prevented them from moving forward.

In this situation they decided to implement a new technology that could scale better with their data and -this is really important to realize— at the same time they accepted to lose 90% of all the convenient features of a traditional database. They started the Hadoop project, and got the following:

  • Scalability in terms of data
  • Scalability in terms of CPU power

At the same time they lost the following features with Hadoop:

  • SQL
  • Tables
  • Relations
  • Indices
  • Transactions
  • Simple programming model (have you ever used Hadoop Map/Reduce?)
  • Tools and integration

Formally they lost more than they got. But there was no alternative, since Yahoos amount of data and their scaling requirements could not be handled any more by the then-existing technology. They had no other choice if they wanted to continue to work with their data.

Of course the situation has changed since the inception of Hadoop: Nowadays traditional databases scale much better (although they are often still limited in this aspect). On the other side, the "Big Data" technology also catches up like for example atomic upserts for data lakes (DeltaLake and IceBerg) and continuous improvements on SQL support. So both sides (traditional databases and Big Data technology) improve by learning from the other side.

Gradual move to Big Data

With keeping in mind that first you actually already need to have Big Data before it makes sense to scale out, it often makes sense to follow a gradual approach: Don’t put all your bets on Big Data technology from the very beginning, as this will have many negative implications (more on them below). Instead follow the agile mindset: Get things working with a reasonable development effort. If this means to use MariaDB, then that is perfectly fine. If you find that using Cassandra is a better long term decision and doesn’t hurt development costs too much, that is also perfectly reasonable. But don’t try to build a super scalable financial transaction system using HBase when you just don’t have the transaction volume to justify that approach.

When you started using "simple" technologies like MariaDB or Postgres and you find yourself limited by these technologies, then again, don’t immediately jump onto a completely different stack. Instead carefully analyze the overall situation and find out, which scenarios don’t perform as desired and what options are at hand. Those may include to move some analytics stuff into a more scalable NoSQL database or into a Data Lake. This may also include investigating the scalability options offered by the existing technology. As I already mentioned above and as I will explain in some more detail below, NoSQL databases and Data Lakes aren’t the cheap universal silver bullet. They really shine in certain use cases but are much less flexible than a traditional database.

The best strategy is to use a mixture of appropriate technologies depending on the use case:

  • Use RDBMS for relational data and master data like CRM
  • Use highly scalable NoSQL databases for high volume data like IoT events or customer interactions
  • Use data lakes based on cheap storage and scalable CPU power (for example S3 and Spark) for analytics

Implications of Big Data Technology

Now that I explained that both worlds (the old relational SQL database and the shiny new Big Data technology) will continue to exist for good reasons, let me discuss the implications of using Big Data stuff instead of a relational database.

Costs of Big Data

Big Data stuff is expensive. If you asked random people on the street, what they think is more expensive: A simple database or Big Data, they would probably answer "Big Data". That term already sounds expensive. But nevertheless many companies already pay premium prices for their DB2 or Oracle databases, and apparently these systems are so expensive that managers cannot imagine that "Big Data" is even more expensive.

I have no idea, how much a license for Oracle costs (apparently a lot of money), but going down the "Big Data" route isn’t cheap either. First of all, most bigger companies insist to use a "Hadoop distribution" (i.e. Cloudera as the only survivor) which also impose license costs. But unfortunately the full story doesn’t stop here, it just began.

When you buy a traditional database, you get a well-integrated package of a storage-backend, a query execution engine, a SQL query frontend, indexing, foreign key relationships, aggregation, transactions and much more.

When you implement a Big Data stack, you get HDFS (or S3, or Azure Blob Storage, or…) as storage backend, YARN (or Kubernetes, or Mesos, or…) as a cluster OS, Spark (or Map/Reduce, or Tez, or…) as an execution engine, Hive (or Impala, or Presto, or Drill, or…) as an SQL frontend, Solr (or Elasticsearch) as an index engine, HBase (or Cassandra) as a column store for fast single record access and transactions. You essentially get a set of disintegrated building blocks waiting for a developer to glue them together to provide 20% of the functionality of a traditional database system.

Agility of Big Data

This doesn’t mean that Big Data technology is scam, it’s really the opposite: All these components work really great and scale very well, both with the amount of data and with the number of machines in a cluster. Each Big Data component is an expert in its domain – but often completely useless in all neighbor domains. Often there are multiple concurring candidates within a single domain (like Hive vs Impala vs Presto vs Drill vs Spark for an SQL query engine), each of them with specific strengths and weaknesses. Therefore you really need to understand each technology very well as a basis to a good decision.

At the first sight, this situation might look scary. And I have to admit, it really is scary at the start of the journey into Big Data. Having to chose between so many competing technologies is not easy. But after some time, you’ll understand the fundamental concepts which helps you to judge most products simply by look at their spec sheet.

Once you reached this point, you will understand that all these options provide you the freedom to separately chose the technology that is best for each problem. This is a completely different approach than the well integrated package offered by relational databases. If you actively realize the choice (instead of buying a complete Big Data package with all the choices being already made for you), it will give your projects a huge agility.

For example I am a really big fan of Spark, which can well run on a "Spark standalone" cluster without YARN, Mesos or Kubernetes or which can run even completely non-distributed on a local machine. Eventually you can mix and match individual components (like Spark, Presto, S3, Alluxio) to precisely fulfill your requirements.


Conclusion

Does your company need "Big Data" technology? If you work with an ever increasing amount of data, then probably yes at some point in time. Can you dump your RDBMS? Probably no, except if your company has deep pockets and devotion and accepts many intermediate failures on the journey. How much "Big Data" do you need? It depends on your requirements and on your objective.

Big Data technology is far away from being a free lunch. Chosen wisely and used for the right tasks Big Data technology can enable new use cases and boost performance. You should never simply follow any hype without knowing implications and your requirements. This is even more true for the huge zoo of Big Data technologies.


Related Articles