A tour through history, how we ended up here, what capabilities we’ve unlocked, and where we go next?

How It All Began (1940s)
A long time ago, in December 1945, the first electronic general-purpose digital computer was completed. It was called ENIAC (Electronic Numerical Integrator and Computer). It marked the start of an era where we produced computers for multiple classes of problems instead of custom building for each particular use case.
To compare the performance, ENIAC had a max clock of around 5 kHz on a single core, while the latest chip in an iPhone (Apple A13) has 2.66 GHz on 6 cores. This roughly translates to about four million times more cycles in a second, in addition to improvements in how much work can be accomplished in one of those cycles.
Historically, we’ve gone through cycles of expanding on the newest hardware advances to unlock new software engineering capabilities. There has been a pattern of increasing flexibility while also requiring more responsibility from the engineers. Inevitably there is a desire to reduce the additional burden that the engineers take while providing the same flexibility. This flexibility is enabled by implementing best practices that get teased out as we understand the patterns that work within a particular abstraction.
The Evolution of Data (1960s-1990s)
Historically servers were costly with limited storage, memory, and compute capabilities to solve the problems we wanted to solve without significant effort by the programmers such as memory management. In contrast, today, we now have languages with automated garbage collection to handle this for us. This is why C, C++, and FORTRAN were used so much and continue to be used for high-performance use cases where we try and drive as much efficiency and value out of a system as possible. Even today, most data analytics and Machine Learning frameworks in Python call out to C to ensure performance and only expose an API for the programmer.
To extract as much value as possible out of systems organizing data, companies like IBM invested heavily in particular models for storing, retrieving, and working with data. We have the hierarchical data model that was extremely prevalent during the days of big metal in the mainframe from this work. By creating a standard model, they reduced the amount of mental effort required to get a project started and increased knowledge that could be shared between projects.

Mainframes worked for the day’s problems but were prohibitively expensive, so only the largest enterprises such as banks were able to leverage them effectively. They were very efficient in traversing tree-like structures, but they imposed a very strict one-to-many relation that could be difficult to express for the programmer and make their applications hard to change.
Later on, the relational model was created, which powers most of our databases today. In the relational model, data is represented as sets of tuples (tables) with relations between them. A typical relationship is a foreign key which says that data in two tables should be related to each other. You can’t have a grade without a student to tie it to, and you can’t have a class without a teacher to teach the lesson.

Due to the structure that is applied to the data, we can define a standard language to interact with data in this form. The original inventor of the Relational Model also created its Structured Query Language (SQL), which is the de-facto standard for accessing data today. This is because SQL is easy to read, while also being extremely powerful. SQL is even Turing complete when the system has capabilities for recursion and windowing functions. Turing completeness roughly translates to the language can solve any computational problem given enough time. That’s an excellent theoretical property of SQL, but it doesn’t always mean it is the best tool for every job. This is why we use SQL to access and retrieve data, but leverage Python and other languages to do advanced analytics against the data.
Oracle released the first relational database product in 1979. These systems are called Relational Database Management Systems (RDBMS). Since then, dozens of commercial and open-source RDBMS have been released with various amounts of fanfare. These initial contributions to open source have led to Apache Software Foundation being the de-facto place for tooling in the "Big Data" space with its permissive license enabling commercial activities when leveraging the core source code of the libraries.
These systems work exceptionally well for managing and accessing data leveraging normalized data structures; however, as the data volumes grow, their performance starts to buckle under the stress of the load. We have several optimizations we can leverage to reduce the pressure on the systems, such as indexing, read-replicas, and more. The subject of optimizing RDBMS performance could be several more blog posts and books and is out of scope for this discussion.
The Connected World and All of Its Data (1990s)
Before computers were a staple in every home, and cell phones were in everyone’s pockets, there was significantly less communication and Data surrounding that communication. Almost every website you go to has tracking enabled to understand the user experience better and provide personalized results back to their customers.
This explosion of data collection came from the ability to automate the collection where historically, users had to provide feedback in the form of surveys, phone calls, etc. Today we’re tracked by our activities, and our actions speak louder than our thoughts. Netflix no longer lets you rank or score movies, as they found that the signal wasn’t driving utilization of the ecosystem.
Google’s Search Index and the Need for MapReduce (early 2000s)
Google was one of the first companies to run into the scale of data collection that required sophisticated technology; however, in the late 90’s they didn’t have the enormous budget they have now as one of the most profitable companies in the world. Google’s competitive advantage is its data and how they leverage that data effectively. Let’s look into one of the earliest struggles they had as a business.
The engineering leads at Google had a hard problem to solve, and they could not afford the vastly more expensive enterprise-grade hardware on which traditional companies had relied. However, at the same time, they had the same if not more computing requirements than the other organizations did. Google had built a Frankenstein system to keep up with the growing demands of their business as well as the web in general. The parts in their servers were consumer-grade and were prone to failure, and the code that ran on top of them was not scalable or robust to these failure events either.
Jeff Dean and Sanjay Ghemawat were the engineering leadership at the time. They were diligently rewriting aspects of the Google codebase to be more resilient to failures. One of the biggest problems they had was that hardware would fail while jobs were running and needed to be restarted. As they kept solving for these problems within the code base, they noticed a consistent pattern was being followed, and they could instead abstract it and create a framework around it. This abstraction would come to be known as MapReduce.
MapReduce is a programming model defined in the two-step process, the map phase, and the reduce phase. Maps are applications of functions in an element-wise fashion, and reductions are aggregations. This framework provides a simple interface in which doing the work can be split in a smart way across different workers on a cluster. If we take some time to think through this without getting too deep, we realize that if we can split up our work, then we can also recover our work for a single failure without having to redo the entire set of work. Further, this translates into a much more scalable and robust system. There are many ways to improve the performance of a MapReduce system that we’ll go into as we evolve through the abstraction layers built-up from this concept over time.

By leveraging the MapReduce framework, Google effectively scaled up its infrastructure with commodity servers that were cheap and easy to build and maintain. They can address failures automatically in the code, and even further alert them that a server might need repairs or replacement parts. This effort saved them a significant amount of headaches as the web graph grew to be so large that no single computer yet alone supercomputer would handle the scale.
Both Jeff and Sanjay are still with Google and have collaborated on many technological advancements that shape the data landscape. A fantastic article was written about this inflection point for Google by James Somers at The New Yorker.
MapReduce as an Open Source Implementation – Introduction of Hadoop (mid 2000s)
Google tends to keep its tooling internal as they purpose-build it for their internal systems, but Google also shares its knowledge in the form of academic papers. The "MapReduce: Simplified Data Processing on Large Clusters" paper outlines the MapReduce programming model and includes a sample implementation of a word count algorithm in the appendix. We see that the amount of code to write a MapReduce job is quite a bit higher than a Python script to do the same thing; however, the Python script will only run on a single thread and be limited in throughput while the MapReduce job will scale with as many servers as we need to make the results complete in a reasonable time frame.
We traded complexity for the ability to scale, and the first systems enabling MapReduce were very cumbersome. Many lessons were learned as companies like Google, Facebook, Yahoo, and other technology giants, which were scrappy startups at the time, struggled to make their large data volumes accessible for everyone in their organizations. Many lessons were learned, and various tools were built to solve these problems. The rest of the article will focus on what we learned and how it was brought to the successive generation of tools in the "Big Data" space.
In the mid-2000s Doug Cutting and Mike Cafarella read the Google papers for MapReduce as well as a separate paper on Google’s distributed file system, Google File System. They were working on Nutch, a distributed web crawler, and realized they had the same problems scaling out their system. Doug was looking for full-time work and ended up interviewing with Yahoo, and they bought into the idea of building an open-source system to enable large-scale indexing as they were falling behind Google at the time. Yahoo hired Cutting to continue to work on Nutch, which would spin out a distributed file system and a computation framework, both of which were core components of what would be called Hadoop. They were called the Hadoop Distributed File System (HDFS) and Hadoop MapReduce. This would come to be a confusing nomenclature as Hadoop entered the hype cycle as both were shortened to Hadoop.
MapReduce depends on a distributed file system because it needs to be resilient to failures across the entire hardware stack. Jeff Dean has described this in "The rise of cloud computing systems," where he says, "Reliability must come from software" while describing the failure events in a typical Google cluster in 2006.
- ~1 network rewiring (rolling ~5% of machines down over 2-day span)
- ~20 rack failures (the 40–80 machines within a rack instantly disappearing, 1–6 hours to get back)
- ~5 racks go wonky (40–80 machines within a rack see 50% packet loss)
- ~8 network maintenances (4 might cause ~30-min random connectivity losses)
- ~12 router reloads (takes out DNS and external vips for a couple minutes)
- ~3 router failures (have to immediately pull traffic for an hour)
- ~dozens of minor 30-second blips for DNS ~1000 individual machine failures
- ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.
Many other considerations were essential to improve performance across the stack, such as minimizing network latency by moving data as little as possible. To achieve this, we brought the compute to the storage with large disks in each server instead of keeping storage separate in a traditional NFS or SAN storage solution. When bringing compute to the storage, we don’t want to overburden one particular node in a cluster with compute since it has the data. We also want to make sure if one node fails, we would still have a copy of the data and be able to split the work effectively. If we take it a step further and look at the type of failure events that Google listed, we also need to plan for node failures and rack failures. This means that we also need rack-awareness in our storage solution as well as multiple copies of the data split up across racks.

If this all seems very complicated – that’s because it is! Additionally, the first generation of MapReduce, Hadoop MapReduce, was equally involved, requiring in-depth knowledge of the MapReduce process and the fundamentals of performance optimizations within the framework. One of the significant challenges with MapReduce was defining something non-trivial as a set of MapReduce processes. There was not an expressive system for creating complex logic. Instead, Hadoop MapReduce was focused on providing the lowest-level blocks for building scalable calculations, but not coordinating them effectively. One additional caveat of Hadoop MapReduce is that it was completely disk-driven and nothing was stored in memory. Memory was significantly more expensive at the time MapReduce came out and if you ran out of memory then the jobs would continously fail since it wouldn’t fit on any of the machines. All of these considerations meant that Hadoop MapReduce was slow and hard to write for, but very stable and scalable.
Several solutions were presented to solve these challenges, such as Pig (scripting), Hive (SQL), and MRJob (config). These all effectively boiled down to a wrapper over the top of Hadoop MapReduce to enable iterative algorithms, less boilerplate, and slightly better abstractions of the code. These tools allowed less technical people to leverage the "Big Data" ecosystem without being a senior software engineer with significant Java experience but still left much desired.
Reducing Responsibility while Improving Flexibility with Abstractions (early 2010s)
With the advent and growing adoption of the cloud with Amazon Web Services (AWS) in the early 2010s, people began to think about how to run their Hadoop workloads on AWS. It was everything they had dreamed about for their analytical workloads, which typically ran for a short period as a burst, and servers just waited in a data center for the rest of the time.
However, Hadoop MapReduce was heavily reliant on the HDFS file system at the time. It was impossible to leverage the scalable storage solutions the cloud brought, which would have enabled the clusters to be decommissioned. How could we move these workloads to the cloud while also gaining the cloud’s benefits by reducing costs with scaling to our elastic workloads? What if we slightly relaxed the no-memory utilization constraint and enabled us to keep some data in memory for much lower latency? How could we allow iterative processes such as those used in Machine Learning?
These ideas were central to the advent of Spark, initially released in 2014. The original paper was published in 2010. They saw up to 10x performance increases by allowing even slight amounts of memory to be leveraged, which also reduced the programming complexities dramatically.
Spark in its 1.x form wasn’t an end-all solution for scaling out and working with data efficiently. It still suffered from a complex programming model that required significant knowledge to write efficient code. It provided a supreme amount of flexibility and has removed sets of responsibility which didn’t need to be there. It was a fantastic second-generation tool, and Spark’s popularity has grown dramatically over the years since. It initially surpassed Hadoop and continues to evolve as new requirements unfold.

Improving Experience (mid 2010s)
By 2016 Spark had matured enough to realize that their programming model was becoming the bottle-neck for adoption. More people without software engineering experience were tasked with creating value from data. The evolution of a new field, data science, focused on extracting value from data in a scientific manner rather than just reporting on it with traditional business intelligence tools was emerging.
Pandas had made a name for itself as a tool that anyone could use to work with data. It provided a simple API for users and was created by Wes McKinney at Two Sigma to make life easier for quantitative researchers. They were building the same code over and over for analyzing data. Two Sigma open-sourced the library, and it has become the standard for how people learn to interact with data programmatically.
The team working on Spark realized that many of their users were working in Pandas for their initial exploration and then moving to Spark when they had solidified their ideas. This isolation of tooling made the workflow inconsistent and led to significant effort rewriting code to fit the MapReduce model in Spark. Spark responded by adding the DataFrame API, which mimics the Pandas API, meaning that their users could leverage the same workflow but have it run against Spark’s MapReduce engine allowing the scale-out capabilities even during the exploratory phase of analysis.
In addition to the programmatic usage of Spark, several of their users also wanted to leverage SQL for accessing data. As discussed above, SQL has been the standard for accessing data for almost 50 years. The Spark team made a fantastic decision at this point, they decided to enable Spark SQL, but utilize the same underlying optimization engine they wrote for the DataFrame API. By unifying the optimization engine, any improvements they wrote would affect not only their SQL interface but also their programmatic interface. This engine is known as the Catalyst Optimizer, and it works almost the same as a query plan optimizer in a traditional RDBMS would. All of this work culminated in Spark 2.0, which dramatically improved usability.

Reducing Responsibility with the same Flexibility (2020)
Spark recently took everything a step further with its latest 3.0 release in June 2020, but instead of focusing on new capabilities, they focused on improving performance, reliability, and usability.
To increase performance, Spark has added adaptive query execution and dynamic partition pruning, which are relatively new advancements even in the RDBMS landscape. These performance enhancements can see massive gains. In the TPC-DS benchmarks, Dynamic Partition Pruning improves between 2x and 18x on 60 out of 102 queries. This means that even less compute will be required as companies leverage Spark, and reduce their costs on infrastructure, improve time to value, and focus more on results. These performance enhancements are made transparent to the user and require no code rewrites.
To increase reliability, the team has closed a significant amount of bugs (3,400) between the 2.x and 3.0 branches, provided better Python error handling to simplify the Python API usage, and more. These enhancements are once again improving flexibility without changing the responsibility model of Spark.
To increase usability, Spark has provided a new UI for streaming use cases that provide improved visibility into mini-batch statistics and the overall state of the pipeline. In many use cases, it was hard to understand when the system was experiencing delays and how to respond effectively to these changes, such as scaling-out new nodes or potentially scaling-in to reduce costs.
Spark 3.0 is continuing to improve the ecosystem by reducing the responsibility of their end users while keeping the same level of flexibility available. This allows us to solve even more challenges and strengthen our operations dramatically to keep us focused on value-add work.

Current Challenges and What the Future Holds?
In this article, we focused on Hadoop and Spark to build up the evolution of the ecosystem over time and layout its history. There are 100s of tools within the "Big Data" space, each with their use cases and challenges they’ve aimed to meet. If you look at SQL-like interfaces to data in a distributed framework, you’ll take a ride through history with Hive, Presto, and Impala, which solve different challenges within the space. If you’re interested in data serialization, such as the very traditional and highly inefficient CSV, to reduce storage and make compute take less time, you’ll investigate Avro, Parquet, ORC, and Arrow. There could be 1000s of blog posts associated with going through their history and building our understanding of where we are today and how we got here.
Since the start of the "Big Data"-hype cycle starting in 2010, we’ve advanced so much in technology. We now have Cloud providers such as Amazon, Microsoft, and Google enabling capabilities that could only be dreamed about before. The cloud focuses on the idea of elastically scaling your workloads to meet your needs. In exchange, you pay a premium over what a data center would cost and reduce your need to maintain the operational, regulatory, and technical aspects of data center infrastructure. If we look at analytics use cases, the utilization of hardware is all over the map, and it is a perfect target use case for the cloud.
The cloud providers have made scaling up, down, and out on their solutions incredibly easy for these "Big Data" workloads. There is EMR (Amazon), Azure Databricks (Microsoft), and Dataproc (Google), and each of these providers supports ephemeral workloads on the cloud. They’ve isolated the storage to their object storage solution S3, Azure Data Lake Storage, and Google Cloud Storage, respectively, enabling decoupling the compute from the storage. If you need to spin up an analysis, you do so in a completely isolated cluster without affecting anyone else’s running jobs. The tuning of an on-premise cluster used to be an incredible challenge as different workloads had different performance constraints. With that abstracted away, our teams can focus on their goals instead of the infrastructure.
Additionally, we want to not only reduce our mental overhead for the infrastructure but also our code. We have real-time and batch use cases in today’s world. Historically we had to maintain two separate code bases to support the differences between these use cases. It turns out that batch is just a special case of streaming, so doesn’t it make sense to bring together the way we work with it? Flink, Spark, and Beam are all working to solve this in some fashion, either as a first-class citizen (Beam) or by altering their APIs to make this more comfortable for the end-user (Flink/Spark).
At the heart of most of these advancements is the need to enable extremely complicated analytics on our most massive data sets in the world. We need to facilitate machine learning and artificial intelligence (ML/AI) use cases against our data. To do this, we will need to be extremely efficient in not just processing of our data, but also in the building blocks to process it without impeding the end-user. The democratization of ML/AI tools has been occurring over the last twenty years with scikit-learn, Keras, Tensorflow, PyTorch, MXNet, and many many more libraries to enable traditional statistical modeling in addition to deep learning use cases. The tools in the "Big Data" space aim to take the lessons learned from these tools and integrate them directly into their ecosystem.
So What exactly is "Big Data"?
Big data is when you have a use case that is not economical to scale-up for, and there is a requirement to instead look towards scale-out solutions. This requirement will be different for different companies, each having their own unique needs and constraints. Some use cases historically required scale-up approaches, but as we’ve improved the abstractions for working within the scale-out programming models such as MapReduce, we have been able to recast them into scale-out solutions.
Several companies scale-up on a single machine by shoving as much memory as possible into it. Some companies are running SAP workloads on virtual machines (VM) with 24 TB of memory; others can only afford to scale up to 1.5 TB with more traditional bare-metal machines. However, this only gives us faster access to the data. What about when we need to run non-trivial computations against the data such as for Machine Learning or Artificial Intelligence use cases? We can only keep so many cores on a single machine; for instance, the 24 TB memory VM for SAP use cases has 384 cores. Most "Big Data" clusters have 1000s of CPUs available and are significantly more cost-effective than purchasing such a large VM, which is purpose-built for these SAP use cases.
If you have more than 100 GB of data, then you’re typically struggling with some respect on a single node system. You’re likely spending an unreasonable amount of time and/or effort optimizing code, indexes, or other capabilities to run the jobs on the individual node. Looking at clustered solutions such as Spark or Dask make sense starting from here, but requires non-trivial investment in infrastructure to make work effectively. Cloud providers such as Amazon, Microsoft, and Google have all been working on providing capabilities for scaling out these clustered solutions while reducing the level of effort required of their end customers with the management of infrastructure.
As we progress our technology capabilities, the difference between single node workloads and clustered workloads will begin to be abstracted away. These capabilities will enable us to achieve even greater efficiencies tomorrow just as we’re experiencing today compared to ENIAC in 1945.