2003–2023: A Brief History of Big Data

Summing up 20 years of history of Hadoop and everything related

Furcy Pin
Towards Data Science

--

Whenever I enter a library in a RPG video game , I can’t help but look at every shelf to get a better grasp of the game’s universe. Anyone remembers “A brief history of the Empire” in the Elder Scrolls ?

Big Data, and in particular the Hadoop ecosystem, was born a little more than 15 years ago and has evolved in ways few could have anticipated.

Since its birth and open-sourcing, Hadoop has become the weapon of choice to store and manipulate petabytes of data. A wide and vibrant ecosystem with hundreds of projects has formed around it, and it is still used at many large companies, even if several other cloud-based proprietary solutions are now rivaling it. With this article, I aim to rapidly retrace these 15 years¹ of evolution of the Hadoop ecosystem, explain how it has grown and matured over the past decade, and how the Big Data ecosystem kept evolving in the past few years.

So buckle up for a 20-year travel through time, as our story starts in 2003, in a small town south of San Francisco...

Disclaimer: my initial plan was to illustrate this article with logos of companies and software mentionned, but the extensive use of logos being prohibited on TDS, I decided to keep things entertaining with random images and useless trivia. It’s fun to try to remember where we were and what we did at the time.

2003–2006: The Beginning

Started in 2003: iTunes, Android, Steam, Skype, Tesla. Started in 2004: Thefacebook, Gmail, Ubuntu, World of Warcraft. Started in 2005: Youtube, Reddit. Started in 2006: Twitter, Blu-ray. Waze, Oblivion. (Photo by Robert Anderson on Unsplash)

It all started at the beginning of the millenium, when an already-not-so-small startup in Mountain View called Google was trying to index the entirety of the already-not-so-small internet. They had to face two main challenges, yet unsolved at such scale:

How to store hundreds of terabytes of data, on thousands of disks,
across more than a thousand machines, with no downtime, data loss,
or even data unavailability ?

How to parallelize computation in an efficient and resilient way to
handle all this data across all these machines ?

To better understand why this was a difficult problem, consider that when you have a cluster with a thousand machines, there is always at least one machine down on average².

From 2003 to 2006, Google released three research papers explaining their internal data architecture, which would change forever the Big Data industry. The first paper came out in 2003 and was entitled “The Google File System”. The second paper came out in 2004 and was entitled “MapReduce: Simplified Data Processing on Large Clusters”, and has been cited more than 21 000 times since then, according to Google Scholar. The third one came out in 2006 and was entitled “Bigtable: A Distributed Storage System for Structured Data”. Even if these papers were essential to the birth of Hadoop, Google did not participate in the birth itself, as they kept their source code proprietary. The story behind that story, however, is extremely interesting, and if you haven’t heard of Jeff Dean and Sanjay Ghemawat then you should definitely read this article from the New Yorker.

Meanwhile, the father of Hadoop, a Yahoo! employee named Doug Cutting, who was already the creator of Apache Lucene (the search engine library at the core of Apache Solr and ElasticSearch), was working on a highly distributed web crawler project called Apache Nutch. Like Google, this project needed distributed storage and computation capabilities to achieve massive scale. Upon reading Google’s papers on the Google File System and MapReduce, Doug Cutting realised his current approach was wrong, and inspired himself from Google’s architecture to create in 2005 a new subproject for Nutch, which he named after his son’s toy (a yellow elephant): Hadoop. This project started with two key components: the Hadoop Distributed File System (HDFS), and an implementation of the MapReduce framework. Unlike Google, Yahoo! decided to open source the project as part of the Apache Software Foundation, thus inviting all the other major tech companies to use and contribute to the project, and help them narrow the technological gap with their neighbors (Yahoo is based in Sunnyvale, next to Mountainview). As we will see, the next few years exceeded the expectations. Of course, Google did quite well too.

2007–2008: Hadoop’s early adopters and contributors

Started in 2007: iPhone, Fitbit, Portal, Mass Effect, Bioshock, The Witcher. Started in 2008: Apple App Store, Android Market, Dropbox, Airbnb, Spotify, Google Chrome. (Photo by Leonardo Ramos on Unsplash)

Quite soon, other companies faced with similar volumetry issues started using Hadoop. Back in the days, it meant a huge commitment as they had to install and manage the clusters themselves, and writing a MapReduce job was not a walk in the park (trust me). Yahoo!’s attempt at reducing the complexity when writing MapReduce jobs came out as Apache Pig, an ETL tool capable of translating its own language known as Pig Latin into MapReduce steps. But soon others started contributing to this new ecosystem as well.

In 2007, a young but rapidly growing company called Facebook, lead by a 23-year old Mark Zuckerberg, open-sourced two new projects under the Apache license: Apache Hive, and Apache Cassandra a year later. Apache Hive is a framework capable of converting SQL queries into Map-Reduce jobs on Hadoop, while Cassandra is a wide column store aimed at accessing and updating content in a distributed way on a massive scale. Cassandra did not require Hadoop to function but rapidly became part of the Hadoop ecosystem as connectors for MapReduce were created.

Meanwhile, a lesser known company called Powerset, who was working on a search engine, inspired themselves from Google’s Bigtable paper to develop Apache HBase, another wide column store relying on HDFS for storage. Powerset was soon acquired by Microsoft, to bootstrap a new project called Bing.

Last but not least, another company had a decisive role in Hadoop’s quick adoption: Amazon. By starting Amazon Web Services, the first on-demand Cloud, and quickly adding support for MapReduce via the Elastic MapReduce service, Amazon allowed startups to easily store their data on s3, Amazon’s distributed file system, and deploy and run MapReduce jobs on it, without the hassle of managing a Hadoop cluster.

2008–2012: Rise of the Hadoop vendors

Started in 2009: Bitcoin, Whatsapp, Kickstarter, Uber, USB 3.0. Started in 2010: iPad, Kindle, Instagram. Started in 2011: Stripe, Twitch, Docker, Minecraft, Skyrim, Chromebook. (Photo by Spencer Davis on Unsplash)

The main pain point of using Hadoop was the great amount of effort required to setup, monitor and maintain a Hadoop cluster. Soon enough the first Hadoop vendor Cloudera was founded in 2008, quickly joined by Hadoop’s father Doug Cutting. Cloudera proposed a pre-packaged distribution of Hadoop, called CDH, along with a cluster monitoring interface Cloudera Manager, that finally made it easy to install and maintain a Hadoop cluster, along with its companion softwares like Hive and HBase. Hortonworks and MapR were founded soon afterwards for the same purpose. Cassandra also got its vendor when Datastax was founded in 2010.

Soon enough, everyone agreed that although Hive was a great SQL tool to handle huge ETL batches, it was a poor fit for interactive analytics and BI. Anyone used to standard SQL databases expects them to be able to scan a table with a thousand rows in less than a few milliseconds, where Hive was taking minutes (that’s what you get when you ask an elephant to do a mouse’s job). This is when a new SQL war started, a war which is still raging today (although we’ll see that others have entered the arena since then). Once again, Google had indirectly an enormous influence on the Big Data world, by releasing in 2010 a fourth research paper, called “Dremel: Interactive Analysis of Web-Scale Datasets”. This paper described two major innovations: a distributed interactive query architecture that would inspire most of the interactive SQL that we will mention below, and a column-oriented storage format that would inspire several new data storage format, such as Apache Parquet, developed jointly by Cloudera and Twitter, and Apache ORC, developed jointly by Hortonworks and Facebook.

Inspired by Dremel, Cloudera, in an attempt to solve Hive’s high latency problem and to differentiate itself from its competitors, decided in 2012 to start an new open-source SQL engine for interactive querying called Apache Impala. Similarly, MapR started it’s own open-source interactive SQL engine, called Apache Drill, while Hortonworks decided that they would rather work on making Hive faster than create a new engine from scratch, and started Apache Tez, a sort of version 2 for MapReduce, and adapted Hive to execute on Tez instead of MapReduce. Two reasons probably drove this decision: first, being smaller than Cloudera they lacked the manpower to have the same approach as them, second, most of their clients were already using Hive and would rather have it work faster rather than switching to another SQL engine. As we will see, soon enough many other distributed SQL engines appeared, and “everyone is faster than Hive” was the new motto.

2010–2014 : Hadoop 2.0 and the Spark revolution

Started in 2012: UHDTV, Pinterest, Facebook reaches 1 billion active users, Gagnam Style video reaches 1 billion views on Youtube. Started in 2013: Edward Snowden leaks NSA files, React, Chromecast, Google Glass, Telegram, Slack. (Photo by Lisa Yount on Unsplash)

While Hadoop was consolidating and adding a new key component, YARN (Yet Another Resource Manager) as its official resource manager, a role that was previously done clumsily by MapReduce, a small revolution began when the open source project Apache Spark started to gain traction at an unprecedented rate. It quickly became clear that Spark would become a great replacement for MapReduce, as it had better capabilities, a simpler syntax, and was in many cases much faster than MapReduce, especially thanks to its ability to cache data in RAM. The only downside compared to MapReduce was its instability at first, an issue that faded as the project matured. It also had great interoperability with Hive, since SparkSQL was based on Hive’s syntax (actually, they borrowed Hive’s lexer/parser at first), which made migrating from Hive to SparkSQL quite easy. It also gained great traction in the Machine Learning world, as previous attempts at writing Machine Learning Algorithms over MapReduce like Apache Mahout (now retired) were quickly outperformed by Spark implementations. To support and monetize Spark’s rapid growth, its creators founded Databricks in 2013. Since then, it aims at making data manipulation at scale accessible to everyone, by providing simple and rich APIs in many languages (Java, Scala, Python, R, SQL, and even .NET) and native connectors to many data sources and format (csv, json, parquet, jdbc, avro, etc.). One interesting thing to note is that Databricks adopted a different market strategy from their predecessors: instead of proposing on-premise deployments for Spark (which Cloudera and Hortonworks quickly added to their own platform), Databricks went for a cloud-only platform offer, starting with AWS (which was by far the most popular cloud at the time), followed by Azure and GCP. Nine years later, we can safely say this was a smart move.

Meanwhile, new projects for dealing with real-time events were open-sourced by other rising tech companies, like Apache Kafka, a distributed message queue made by LinkedIn, and Apache Storm³, a distributed real-time compute engine made by Twitter. Both were open-sourced in 2011. Also, during this period, Amazon Web Services were becoming as popular and successful as ever: Netflix’s incredible growth in 2010, made mostly possible by Amazon’s cloud, would alone illustrate that point. Cloud competitors finally began to arise, with Microsoft Azure becoming generally available in 2010, and Google Cloud Platform (GCP) in 2011.

2014–2016 Reaching the Apex⁴

Started in 2014: Terraform, Gitlab, Hearthstone. Started in 2015: Alphabet, Discord, Visual Studio Code. (Photo by Wilfried Santer on Unsplash)

Since then, the number of projects that are part of the Hadoop ecosystem continued to rise exponentially. Most of them started being developed before 2014 and some of them became open-source before that time too. The number of projects started becoming confusing, as we reached the point where for every single need, multiple software solutions existed. More high-level projects also started to emerge like, Apache Apex (now retired) or Apache Beam (mostly pushed by Google), aiming at providing a unified interface to handle both batch and streaming processing on top of various distributed back-ends like Apache Spark, Apache Flink or Google’s DataFlow.

We can also mention that we finally started to see good open-source schedulers arrive on the market, thanks to Airbnb and Spotify. Scheduler usage is generally tied to the business logic of the enterprise using it, and it is also a pretty natural and straightforward piece of software to write, at least at first. Then you realize that it is a very hard task to keep it simple and easy to use for others. Which is why pretty much every big tech company has written and (sometimes) open-sourced its own: Yahoo!’s Apache Oozie, Linkedin’s Azkaban, Pinterest’s Pinball (now retired), and many more. However, there never was a wide consensus of one of them being a very good choice, and most companies stuck to their own. Fortunately, around 2015, Airbnb open-sourced Apache Airflow, while Spotify open-sourced Luigi⁵, two schedulers that have quickly reached a high adoption across other companies. In particular, Airflow is now available in SaaS mode on Google Cloud Platform and Amazon Web Services.

On the SQL side, several other distributed data warehouses emerged, that aimed at providing faster interactive query capabilities than Apache Hive. We already talked about Spark-SQL and Impala, but we should also mention Presto, open-sourced by Facebook in 2013, which has been rebranded by Amazon as Athena in 2016 for their SaaS offering, and has been forked into Trino by its original developers after they left Facebook. On the proprietary side, several distributed SQL analytics warehouses were released too, such as Google’s BigQuery, first released in 2011, Amazon’s Redshift in 2012 and Snowflake, founded in 2012.

To get a list of all the projects that are referenced as part of the Hadoop Ecosystem, check out this page where more than 150 projects are referenced.

2016–2020 The rise of containerisation and deep learning, and the downfall of Hadoop

Started in 2016: Occulus Rift, Airpods, Tiktok. Started in 2017: Microsoft Teams, Fortnite. Started in 2018: GDPR, Cambridge Analytica scandal, Among Us. Started in 2019: Disney+, Samsung Galaxy Fold, Google Stadia (Photo by Jan Canty on Unsplash)

In the following years, everything kept accelerating and interconnecting. Keeping up with the list of new technologies and companies in the Big Data market became increasingly difficult, so to keep things short I will speak of four trends that, in my eyes, had the most impact on the Big Data ecosystem.

The first trend was the massive migration of data infrastructures to the cloud, where HDFS was replaced by cloud storages like Amazon S3, Google Storage or Azure Blob Storage.

The second trend was containerization. You probably have heard about Docker and Kubernetes already. Docker is a containerisation framework that was launched in 2011 and got quickly popular starting from 2013. In June 2014, Google open-sourced it’s internal container orchestration system Kubernetes (a.k.a. K8s) which was immediately adopted by many companies to build the foundation of their new distributed/scalable architectures. Docker and Kubernetes allowed companies to deploy new types of distributed architectures, more stable and scalable, for many use cases including event-based real-time transformations. Hadoop took some time to catch up with docker, as support for launching Docker containers in Hadoop arrived with version 3.0 in 2018.

The third trend, as mentioned earlier, was the rise of fully managed massively parallel SQL data warehouses for analytics. The rise of the “Modern Data Stack” and of dbt which was first open-sourced in 2016 illustrates that point well.

Finally, the fourth trend that impacted Hadoop was the advent of Deep Learning. In the last half of the 2010’s, everyone has heard about Deep Learning and AI : AlphaGo passed a milestone by beating the world champion Ke Jie at the game of go, like IBM’s Deep Blue did with Kasparov at chess 20 years before. This technological leap, which already accomplished marvels and promises even more, like self-driving cars, is often associated with Big Data, as it requires to crunch huge amounts of information to be able to train itself. However, Hadoop and Machine Learning were two very different worlds, and they had a hard time working together. In fact, Deep Learning drove the need for new approaches to Big Data, and proved that Hadoop wasn’t the right tool for everything.

Long story short: data scientists working on deep learning needed two things that Hadoop wasn’t able to provide at the time. They needed GPUs, which Hadoop cluster nodes usually did not have, and they needed to install the latest version of their deep learning libraries, such as Tensorflow or Keras, which was difficult to do on a whole cluster, especially when multiple users were asking for different version of the same library. This particular problem was well addressed by Docker, but the Docker integration for Hadoop took quite some time to become available and Data Scientists needed it now. Therefore they generally preferred to spawn one VM on steroids with 8 GPUs than use a cluster.

This is why when Cloudera made its IPO in 2017 they were already focusing their developments and marketing on their latest software, the Data Science Workbench, which was not based on Hadoop or YARN, but rather on containerization with Docker and Kubernetes, and allowed a data scientist to deploy their model with their own environment as a containerized application, without risking security or stability issues.

This wasn’t enough to stop their decline. In October 2018, Hortonworks and Cloudera merged, and only the Cloudera brand remained. In 2019, MapR was acquired by Hewlett Packard Entreprise (HPE). In October 2021, a private investment firm named CD&R acquired Cloudera at a lower stock price than its initial price.

The decline of Hadoop does not mean its death, though, as many large companies are still using it, especially for on-premise deployments, and all the technologies that were built around it keep using it, or at least parts of it. Innovations are still being made, too. For instance, new storage formats were open-sourced, such as Apache Hudi initially developed at Uber in 2016, Apache Iceberg which was started at Netflix in 2017, and Delta Lake which was open-sourced by Databricks in 2019. Interestingly, one of the main goals behind these new file formats was to circumvent a consequence of the first trend I mentioned: Hive and Spark were initially built for HDFS, and some of the performance properties guaranteed by HDFS were lost in the migration to cloud storages like S3, which caused inefficiencies. But I won’t go into the details here, as this particular subject would require another full article.

2020–2023 The modern era

Started in 2020: COVID-19 pandemic. Started in 2021: Log4Shell vulnerability, Meta, Dall-E. Started in 2022: War in Ukraine, Midjourney, Stable Diffusion. (Photo by Jonathan Roger on Unsplash)

Nowadays, Hadoop deployments in the cloud have been mostly replaced by Apache Spark or Apache Beam⁶ applications (mostly on GCP), to the profit of Databricks, Amazon’s Elastic Map Reduce (EMR), Google Dataproc/Dataflow or Azure Synapse. And I have also seen many young companies aim straight for the “Modern Data Stack” approach, built around a SQL analytics warehouse such as BigQuery, Databricks-SQL, Athena or Snowflake, fed by no-code (or low-code) data ingestion tools, and organised with dbt, which don’t seem to need distributed computing tools like Spark at all. Of course, companies that still prefer on-premise deployments are still using Hadoop and other open-source projects like Spark and Presto, but the proportion of data that is moved to the cloud keeps increasing every year, and I see no reason for that to change for now.

As the data industry kept maturing, we also saw more metadata management and catalog tools being built and adopted. In that scope, we can mention Apache Atlas, started by Hortonworks in 2015, Amundsen, open-source by Lyft in 2019, and DataHub, open-sourced by Linkedin in 2020. Many private technology startups appeared in that segment, too.

We have also seen startups built around new scheduler technologies , like Prefect, Dagster and Flyte, whose open-source repositories were started in 2017, 2018 and 2019 respectively, and that are challenging Airflow’s current hegemony.

Finally, the concept of lakehouse has started to emerge. A lakehouse is a platform that combines the advantages of a datalake and a data warehouse⁷. This allows Data Scientists and BI users to work within the same data platform, thus making governance, security, and knowledge sharing easier. Databricks were the first to coin the term and position themselves on this product offer, thanks to the versatility of Spark between SQL and DataFrames. They were followed by Snowflake with Snowpark, by Azure Synapse and more recently by Google with BigLake. On the open-source side, Dremio provides a lakehouse architecture since 2017.

2023 — Who can tell what the future will be like ?

Started in 2023: who knows ? (Photo by Annie Spratt on Unsplash)

Since it all started, the number of open-source projects and startups in the Big Data world has kept increasing, year after year (just take a look at the 2021 landscape to see how huge it has become). I remember that around 2012 some people were predicting that the new SQL wars would end and true victors would eventually emerge. This did not happen yet. How all of this will evolve in the future is very difficult to predict. It will take a few more years for the dust to settle. But if I had to take some wilde guesses, I would make the following predictions.

  1. As others have already noted, the main existing data platforms (Databricks, Snowflake, BigQuery, Azure Synapse) will keep on improving and add new features to close the gaps between each other. I expect to see more and more connectivity between every component, and also between data languages like SQL and Python.
  2. We might see a slowdown of the number of new projects and companies in the next couple of years, although this would be more from a lack of funding after the burst of a new dotcom bubble (if this ever happens) than from a lack of will or ideas.
  3. Since the beginning, the main lacking resource has been skilled workforce. This mean that for most companies⁸, it was simpler to throw more money at performance problems, or migrate to more cost-effective solutions, rather than spend more time optimizing them. Especially now that storage costs in the main distributed warehouses have become so cheap. But perhaps at some point the price competition between vendors will become more difficult to maintain for them, and prices will go up. Even if prices don’t go up, the volume of data stored by businesses keeps increasing year after year, and the related cost of inefficiency with them. Perhaps at some point we will see a new trend where people start looking for new, cheaper open-source alternatives, and a new Hadoop-like cycle will start again.
  4. In the long term, I believe the real winners will be the cloud providers, Google, Amazon and Microsoft. All they have to do is wait and see in which direction the wind blows the most, bide their time, then acquire (or simply reproduce) the technologies which work the best. Each tool that gets integrated into their cloud makes things so much easier and seamless for users, especially when it comes to security, gouvernance, access control, and cost management. As long as they don’t make major organisational mistakes, I don’t see how anyone could catch up to them now.

Conclusion

I hope you enjoyed this trip down memory lane with me, and that it helped you better understand (or simply remember) where and how it all started. I tried to make this article easy to follow for everyone, including non-technical people, so don’t hesitate to share it with your colleagues who are interested in knowing where Big Data comes from.

To conclude, I would like to stress out that human knowledge and technology in AI and Big Data would never have advanced so quickly without the magic power of open source and knowledge sharing. We should be thankful to Google who initially shared their knowledge through academic research papers, and we should be thankful to all the companies who open-sourced their projects. Open source and free (or at least cheap) access to technology has always been the largest driver of innovation for the internet economy in the past 20 years. Software innovation really took off in the 1980's once people could afford home computers. Same for 3D printing, which has been around for decades and took off in the 2000’s with the arrival of self-replicating machines, or the arrival of the Raspberry Pi that fueled the DYI movement.

Open source and easy access to knowledge should always be encouraged and fought for, even much more than it is now. It is a never ending battle. One such battle, perhaps the most important one, is happening these days with AI. Large companies did contribute to open-source (for instance Google with TensorFlow), but they also learnt how to use open-source software as venus flytraps to lure users into their proprietary ecosystem, while keeping the most critical (and hardest to replicate) features behind patents.

It is vital for humanity and the world economy that we continue to support open source and knowledge sharing efforts (such as Wikipedia) as best as we can. Governments, citizens, companies and most of all investors must understand this: growth may be driven by innovation, but innovation is driven by sharing knowledge and technologies with the masses.

“Do what you must. Come what may” (Sentence written on the walls of the Chapelle de l’Humanité in Paris)

Footnotes

¹ : It’s even 20 years if we count the prequel from Google, hence the title.

² : Maybe in 2022 we made enough progress on hardware reliability to make this less true, but that was definitely the case 20 years ago.

³ : In 2016, Twitter open sourced Apache Heron (still in the Apache incubation phase, it seems) to replace Apache Storm.

⁴: pun intended.

⁵ : In 2022, Spotify decided to stop using Luigi and switched to Flyte

⁶: I suspect Apache Beam to be used mostly on GCP with DataFlow.

⁷: As Databricks puts it, a lakehouse combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses.

⁸: Of course, I’m not talking about companies the size of Netflix of Uber, here.

--

--

[Available for freelance work] Data Engineer, Data Plumber, Data Librarian, Data Smithy.