A brief introduction to two data processing architectures — Lambda and Kappa for Big Data

Iman Samizadeh, Ph.D.
Towards Data Science
8 min readMar 15, 2018

Big Data, Internet of things (IoT), Machine learning models and various other modern systems are becoming an inevitable reality today. People from all walks of life have started to interact with data storages and servers as a part of their daily routine. Therefore we can say that dealing with big data in the best possible manner is becoming the main area of interest for businesses, scientists and individuals. For instance an application launched for achieving certain business goals will be more successful if it can efficiently handle the queries made by customers and serve their purpose well. Such applications need to interact with data storage and in this article we’ll try to explore two important data processing architectures that serve as the backbone of various enterprise applications known as Lambda and Kappa.

The rapid growth of social media applications, cloud based systems, Internet of things and an unending spree of innovations has made it important for a developer or a data scientist to take well calculated decisions while launching, upgrading or troubleshooting an enterprise application. Although it has been widely accepted and understood that using a modular approach to build an application has multiple advantages and long term benefits, the pursuit for selecting the right data processing architecture still keeps putting question marks in front of many proposals related to existing and upcoming enterprise software. Although there are various data processing architectures being followed around the globe these days let’s investigate the Lambda and Kappa architectures in detail and find out what makes each of them special and in what circumstances one should be preferred over another.

Lambda Architecture

Lambda architecture is a data processing technique that is capable of dealing with huge amount of data in an efficient manner. The efficiency of this architecture becomes evident in the form of increased throughput, reduced latency and negligible errors. While we mention data processing we basically use this term to represent high throughput, low latency and aiming for near-real-time applications. Which also would allow the developers to define delta rules in the form of code logic or natural language processing (NLP) in event-based data processing models to achieve robustness, automation and efficiency and improve the data quality. Moreover, any change in the state of data is an event to the system and as a matter of fact it is possible to give a command, queried or expected to carry out delta procedures as a response to the events on the fly.

Event sourcing is a concept of using the events to make prediction as well as storing the changes in a system on the real time basis a change of state of a system, an update in the databases or an event can be understood as a change. For instance if someone interact with a web page or a social network profile, the events like page view, likes or Add as a Friend request etc… are triggering events that can be processed or enriched and the data stored in a database.

Data processing deals with the event streams and most of the enterprise software that follow the Domain Driven Design use the stream processing method to predict updates for the basic model and store the distinct events that serve as a source for predictions in a live data system. To handle numerous events occurring in a system or delta processing, Lambda architecture enabling data processing by introducing three distinct layers. Lambda architecture comprises of Batch Layer, Speed Layer (also known as Stream layer) and Serving Layer.

1. Batch layer

New data keeps coming as a feed to the data system. At every instance it is fed to the batch layer and speed layer simultaneously. Any new data stream that comes to batch layer of the data system is computed and processed on top of a Data Lake. When data gets stored in the data lake using databases such as in memory databases or long term persistent one like NoSQL based storages batch layer uses it to process the data using MapReduce or utilizing machine-learning (ML) to make predictions for the upcoming batch views.

2. Speed Layer (Stream Layer)

The speed layer uses the fruit of event sourcing done at the batch layer. The data streams processed in the batch layer result in updating delta process or MapReduce or machine learning model which is further used by the stream layer to process the new data fed to it. Speed layer provides the outputs on the basis enrichment process and supports the serving layer to reduce the latency in responding the queries. As obvious from its name the speed layer has low latency because it deals with the real time data only and has less computational load.

3. Serving Layer

The outputs from batch layer in the form of batch views and from speed layer in the form of near-real time views are forwarded to the serving layer which uses this data to cater the pending queries on ad-hoc basis.

Here is a basic diagram of what Lambda Architecture model would look like:

Lambda Architecture

Let’s translate that to a functional equation which defines any query in big data domain. The symbols used in this equation are known as Lambda and the name for the Lambda architecture is also coined from the same equation. This function is widely known to those who are familiar with tidbits of big data analysis.

Query = λ (Complete data) = λ (live streaming data) * λ (Stored data)

The equation means that all the data related queries can be catered in the Lambda architecture by combining the results from historical storage in the form of batches and live streaming with the help of speed layer.

Applications of Lambda Architecture

Lambda architecture can be deployed for those data processing enterprise models where:

  • User queries are required to be served on ad-hoc basis using the immutable data storage.
  • Quick responses are required and system should be capable of handling various updates in the form of new data streams.
  • None of the stored records shall be erased and it should allow addition of updates and new data to the database.

Lambda architecture can be considered as near real-time data processing architecture. As mentioned above, it can withstand the faults as well as allows scalability. It uses the functions of batch layer and stream layer and keeps adding new data to the main storage while ensuring that the existing data will remain intact. Companies like Twitter, Netflix, and Yahoo are using this architecture to meet the quality of service standards.

Pros and Cons of Lambda Architecture

Pros

  • Batch layer of Lambda architecture manages historical data with the fault tolerant distributed storage which ensures low possibility of errors even if the system crashes.
  • It is a good balance of speed and reliability.
  • Fault tolerant and scalable architecture for data processing.

Cons

  • It can result in coding overhead due to involvement of comprehensive processing.
  • Re-processes every batch cycle which is not beneficial in certain scenarios.
  • A data modeled with Lambda architecture is difficult to migrate or reorganize.

Kappa Architecture

In 2014 Jay Kreps started a discussion where he pointed out some discrepancies of Lambda architecture that further led the big data world to another alternate architecture that used less code resource and was capable of performing well in certain enterprise scenarios where using multi layered Lambda architecture seemed like extravagance.

Kappa Architecture cannot be taken as a substitute of Lambda architecture on the contrary it should be seen as an alternative to be used in those circumstances where active performance of batch layer is not necessary for meeting the standard quality of service. This architecture finds its applications in real-time processing of distinct events. Here is a basic diagram for the Kappa architecture that shows two layers system of operation for this data processing architecture.

Kappa Architecture

Let’s translate the operational sequencing of the kappa architecture to a functional equation which defines any query in big data domain.

Query = K (New Data) = K (Live streaming data)

The equation means that all the queries can be catered by applying kappa function to the live streams of data at the speed layer. It also signifies that that the stream processing occurs on the speed layer in kappa architecture.

Applications of Kappa architecture

Some variants of social network applications, devices connected to a cloud based monitoring system, Internet of things (IoT) use an optimized version of Lambda architecture which mainly uses the services of speed layer combined with streaming layer to process the data over the data lake.

Kappa architecture can be deployed for those data processing enterprise models where:

  • Multiple data events or queries are logged in a queue to be catered against a distributed file system storage or history.
  • The order of the events and queries is not predetermined. Stream processing platforms can interact with database at any time.
  • It is resilient and highly available as handling Terabytes of storage is required for each node of the system to support replication.

The above mentioned data scenarios are handled by exhausting Apache Kafka which is extremely fast, fault tolerant and horizontally scalable. It allows a better mechanism for governing the data-streams. A balanced control on the stream processors and databases makes it possible for the applications to perform as per expectations. Kafka retains the ordered data for longer durations and caters the analogous queries by linking them to the appropriate position of the retained log. LinkedIn and some other applications use this flavor of big data processing and reap the benefit of retaining large amount of data to cater those queries that are mere replica of each other.

Pros and Cons of Kappa architecture

Pros

  • Kappa architecture can be used to develop data systems that are online learners and therefore don’t need the batch layer.
  • Re-processing is required only when the code changes.
  • It can be deployed with fixed memory.
  • It can be used for horizontally scalable systems.
  • Fewer resources are required as the machine learning is being done on the real time basis.

Cons

Absence of batch layer might result in errors during data processing or while updating the database that requires having an exception manager to reprocess the data or reconciliation.

Conclusion

In short the choice between Lambda and Kappa architectures seems like a tradeoff. If you seek you’re an architecture that is more reliable in updating the data lake as well as efficient in devising the machine learning models to predict upcoming events in a robust manner you should use the Lambda architecture as it reaps the benefits of batch layer and speed layer to ensure less errors and speed. On the other hand if you want to deploy big data architecture by using less expensive hardware and require it to deal effectively on the basis of unique events occurring on the runtime then select the Kappa architecture for your real-time data processing needs.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Written by Iman Samizadeh, Ph.D.

I am an entrepreneur and technology agnostic.

Responses (6)

What are your thoughts?