The world’s leading publication for data science, AI, and ML professionals.

BEAM (Batch + strEAM) your Data Pipelines on Google Dataflow

Apache Beam is an unified programming model for both Batch & Stream processing

Why do you need Data Pipelines?

In the 21st century, most enterprises rely on scalable platforms & datafication of their services or products to remain competitive in the market. And with the proliferation of data from disparate sources with varying volume, velocity , and variety, Enterprises need a new data strategy. Thus, the need for data pipelines was felt to consolidate data from all disparate sources into a common destinations for quick analytics or to process & stream data between connected applications and systems.

As a result, organizations started deploying either Batch or Streaming pipelines based on their business need.

PAUSE: Before we proceed further let’s walk through our memory-lane a bit to understand: What differentiates the Batch & Streaming Pipeline?

Batch Processing:

  • Used with Bounded dataset.
  • A set of data is collected over a period of time and then processed at a single shot.
  • More concerned about throughput than latency.
  • Use Cases: Find loyal customers of a Bank; Difference in sales after discount etc.

Stream Processing:

  • Used with Unbounded dataset.
  • Data is fed to the processing engine as soon as it is generated.
  • More concerned about latency than throughput.
  • Use Cases: Stock Market sentiment analysis; Real-time detection of fraudulent transactions, IoT devices, etc.

RESUME: To set-up the following data pipelines organizations need to deploy different programming models like Hadoop, Spark, or Flink which have different abstractions & APIs to process batch and streaming data. For eg. in Spark, RDDs / Dataframes are used for Batch Processing whereas you need to program Datastreams for Stream Processing. Hence, they need to maintain two different pipelines along with respective execution engines which not only contribute to overall maintenance overhead but also lock them with the associated execution engines.

To mitigate these challenges Google has incubated a Dataflow model which can be applied to both bounded and unbounded dataset and then donated its SDK to Apache foundation.

Since then the community of contributors nurtured it and thus, we have "Apache Beam" a unified programming model with ease-to-use, data-parallel processing of both streaming and batch workflows, and most importantly platform independence (portable with support of multiple runners) to eliminate any API lock-ins.

Apache Beam is a unified programming model for both Batch & Stream processing and the abstraction layer allows it to be created in any language (Java, Python, Go etc.) and can be executed on any execution framework like Google Cloud Dataflow, Spark, Flink, etc.

The architecture of Apache Beam:

Apache Beam Architecture by the author
Apache Beam Architecture by the author
  • Write the pipeline in your choice of programming language SDKs – Java, Python or Go.
  • Beam / Runner API converts it to a language generic standard which can be consumed by execution engines.
  • Fn API provides language-specific SDK workers which act as an RPC interface for UDFs that are embedded in the pipeline as a specification of the function.
  • The selected Runner executes the pipeline on underlying resources and the right choice of the runner is the key for efficient execution.

Apache Beam Workflow:

Apache Workflow by the author
Apache Workflow by the author

Pipeline: It encapsulates the data-processing tasks from start to end which includes reading I/P data, transforming and writing O/P data.

PCollection: A distributed dataset on which beam pipeline operates. It can be either bounded or unbounded. It is immutable in nature and thus, any transformation on PCollection will create a new PCollection.

PTransform: It represents data processing operations or transformation steps that are applied on PCollection.

I/O Sink & Sources: The Source and Sink APIs provide functions to read data into or write out of collections.

Apache Beam Capability Matrix:

The main advantage of Apache Beam is its portable API layer which can be executed across the diversity of execution engines or runners. As it provides the fairground for different runners to compete on technical innovations that provide better performance, reliability, ease of operational management, etc. Hence, Apache has published a capability matrix by grouping different runner capabilities by their corresponding What / Where / When & How questions:

Reference: Apache Beam Capability Matrix from Apache.org
Reference: Apache Beam Capability Matrix from Apache.org

In the chart, you can see Google Cloud Dataflow runner check all primary capabilities but that’s not the only reason why I believe Google Cloud Dataflow is the recommended choice.

So, Why Google Cloud Dataflow?

First of all, Google unparalleled commitment to open-source and its community. Since 2016, Google has contributed to 15,000+ open-source projects and with Cloud Dataflow, Google even ease out the required overhead of its setup, maintenance, and scaling for efficient execution of complex large scale continuous jobs by providing:

  • A fully managed serverless service.
  • Automatically optimize the pipeline like the number of workers to use, splitting of data streams into key spaces, and processing them in parallel.
  • Liquid Sharding (Dynamic work rebalancing) dynamically adjusts the number of workers to the needs of the job pipeline.
  • Low price Flexible Resource Scheduling (FlexRS) for batch processing which guarantees execution within a six-hour window.
  • Built-in real-time AI capabilities allow building intelligent solutions like predictive analytics, anomaly detection, etc.
  • Out-of-box integration to rest of Google Cloud resources for seamless connectivity.
Google Cloud native Data Sources and Sinks for Dataflow by the author
Google Cloud native Data Sources and Sinks for Dataflow by the author

Conclusion

Here I gave you an overview of the Apache Beam programming model and why Google Cloud Dataflow should be DeFacto runner for it and if you are still reading this, I am certain you are very much interested in exploring Apache Beam for your next data pipeline or looking to transform the existing one in a unified, platform-independent model. So why stop, soon I’ll publish a practical scenario and its implementation for better understanding.

Meantime, here are some of the resources which I have referred and you should also go through for more deep understanding:


Related Articles