A Modest Introduction to Analytical Stream Processing

Architectural Foundations for Building Reliable Distributed Systems.

Scott Haines

Published in

Towards Data Science

23 min readAug 15, 2023

Distributed Streaming Data Networks are Unbounded and Growing at Incredible Rates. Image Created via Author’s MidJourney

Foundations of Stream Processing

Foundations are the unshakable, unbreakable base upon which structures are placed. When it comes to building a successful data architecture, the data is the core central tenant of the entire system and the principal component of that foundation.

Given the most common way data flows into any data platforms is through stream processing platforms like Apache Kafka and Apache Pulsar.

Therefore it becomes critical to ensure we (as software engineers) provide hygienic capabilities and frictionless guardrails to reduce the problem space related to data quality “after” data has entered into these fast-flowing data networks.

This means establishing api-level contracts surrounding our data’s schema (types, and structure), field-level availability (nullable, etc), and field-type validity (expected ranges, etc) become the critical underpinnings of our data foundation especially given the decentralized, distributed streaming nature of today’s modern data systems.

However, to get to the point where we can even begin to establish blind-faith — or high-trust data networks — we must first establish intelligent system-level design patterns.

Building Reliable Streaming Data Systems

As software and data engineers, building reliable data systems is literally our job, and this means data downtime should be measured like any other component of the business. You’ve probably heard of the terms SLAs, SLOs and SLIs at one point or another. In a nutshell, these acronyms are associated to the contracts, promises, and actual measures in which we grade our end-to-end systems.

As service owners, we will be held accountable for our own successes and failures, but with a little upfront effort, standard metadata, and common standards and best practices can ensure things are running smooth across the board.

Additionally, the same metadata can also provide valuable insights into the quality and trust of our data-in-flight, along its journey until it finds its terminal area to rest. The lineage tells a story all on its own.

Adopting the Owners Mindset

For example, Service Level Agreements (SLAs) between your team, or organization, and your customers (both internal and external) are used to create a binding contract with respect to the service you are providing. For data teams, this means identifying and capturing metrics (KPMs — key performance metrics) based on your Service Level Objectives (SLOs). The SLOs are the promises you intend to keep based on your SLAs, this can be anything from a promise of near perfect (99.999%) service uptime (API or JDBC), or something as simple as a promise of 90-day data retention for a particular dataset. Lastly, your Service Level Indicators (SLIs) are the proof that you are operating in accordance with the service level contracts and are typically presented in the form of operational analytics (dashboards) or reports.

Knowing where we want to go can help establish the plan to get there. This journey begins at the inset (or ingest point), and with the data. Specifically, with the formal structure and identity of each data point. Considering the observation that “more and more data is making its way into the data platform through stream processing platforms like Apache Kafka” it helps to have compile time guarantees, backwards compatibility, and fast binary serialization of the data being emitted into these data streams. Data accountability can be a challenge in and of itself. Let’s look at why.

Managing Streaming Data Accountability

Streaming systems operate 24 hours a day, 7 days a week, and 365 days of the year. This can complicate things if the right up front effort isn’t applied to the problem, and one of the problems that tends to rear its head from time to time is that of corrupt data, aka data problems in flight.

Dealing with Data Problems in Flight

There are two common ways to reduce data problems in flight. First, you can introduce gatekeepers at the edge of your data network that negotiate and validate data using traditional Application Programming Interfaces (APIs), or as a second option, you can create and compile helper libraries, or Software Development Kits (SDKs), to enforce the data protocols and enable distributed writers (data producers) into your streaming data infrastructure, you can even use both strategies in tandem.

Data Gatekeepers

The benefit of adding gateway APIs at the edge (in-front) of your data network is that you can enforce authentication (can this system access this API?), authorization (can this system publish data to a specific data stream?), and validation (is this data acceptable or valid?) at the point of data production. The diagram in Figure 1–1 below shows the flow of the data gateway.

**Figure 1–1**: A Distributed Systems Architecture showing authentication and authorization layers at a Data Intake Gateway. Flowing from left to right, approved data is published to Apache Kafka for downstream processing. Image Credit by Scott Haines

The data gateway service acts as the digital gatekeeper (bouncer) to your protected (internal) data network. With the main role of controlling , limiting, and even restricting unauthenticated access at the edge (see APIs/Services in figure 1–1 above), by authorizing which upstream services (or users) are allowed to publish data (commonly handled through the use of service ACLs) coupled with a provided identity (think service identity and access IAM, web identity and access JWT, and our old friend OAUTH).

The core responsibility of the gateway service is to validate inbound data before publishing potentially corrupt, or generally bad data. If the gateway is doing its job correctly, only “good” data will make its way along and into the data network which is the conduit of event and operational data to be digested via Stream Processing, in other words:

“This means that the upstream system producing data can fail fast when producing data. This stops corrupt data from entering the streaming or stationary data pipelines at the edge of the data network and is a means of establishing a conversation with the producers regarding exactly why, and how things went wrong in a more automatic way via error codes and helpful messaging.”

Using Error Messages to Provide Self-Service Solutions

The difference between a good and bad experience come down to how much effort is required to pivot from bad to good. We’ve all probably worked with, or on, or heard of, services that just fail with no rhyme or reason (null pointer exception throws random 500).

For establishing basic trust, a little bit goes a long way. For example, getting back a HTTP 400 from an API endpoint with the following message body (seen below)

{
  "error": {
    "code": 400,
    "message": "The event data is missing the userId, and the timestamp is invalid (expected a string with ISO8601 formatting). Please view the docs at http://coffeeco.com/docs/apis/customer/order#required-fields to adjust the payload."  
  }
}

provides a reason for the 400, and empowers engineers sending data to us (as the service owners) to fix a problem without setting up a meeting, blowing up the pager, or hitting up everyone on slack. When you can, remember that everyone is human, and we love closed loop systems!

Pros and Cons of the API for Data

This API approach has its pros and cons.

The pros are that most programming languages work out of box with HTTP (or HTTP/2) transport protocols — or with the addition of a tiny library — and JSON data is just about as universal of a data exchange format these days.

On the flip side (cons), one can argue that for any new data domain, there is yet another service to write and manage, and without some form of API automation, or adherence to an open specification like OpenAPI, each new API route (endpoint) ends up taking more time than necessary.

In many cases, failure to provide updates to data ingestion APIs in a “timely” fashion, or compounding issues with scaling and/or api downtime, random failures, or just people not communicating provides the necessary rational for folks to bypass the “stupid” API, and instead attempt to directly publish event data to Kafka. While APIs can feel like they are getting in the way, there is a strong argument for keeping a common gatekeeper, especially after data quality problems like corrupt events, or accidentally mixed events, begin to destabilize the streaming dream.

To flip this problem on its head (and remove it almost entirely), good documentation, change management (CI/CD), and general software development hygiene including actual unit and integration testing — enable fast feature and iteration cycles that don’t reduce trust.

Ideally, the data itself (schema / format) could dictate the rules of their own data level contract by enabling field level validation (predicates), producing helpful error messages, and acting in its own self-interest. Hey, with a little route or data level metadata, and some creative thinking, the API could automatically generate self-defining routes and behavior.

Lastly, gateway APIs can be seen as centralized troublemakers as each failure by an upstream system to emit valid data (eg. blocked by the gatekeeper) causes valuable information (event data, metrics) to be dropped on the floor. The problem of blame here also tends to go both ways, as a bad deployment of the gatekeeper can blind an upstream system that isn’t setup to handle retries in the event of gateway downtime (if even for a few seconds).

Putting aside all the pros and cons, using a gateway API to stop the propagation of corrupt data before it enters the data platform means that when a problem occurs (cause they always do), the surface area of the problem is reduced to a given service. This sure beat debugging a distributed network of data pipelines, services, and the myriad final data destinations and upstream systems to figure out that bad data is being directly published by “someone” at the company.

If we were to cut out the middle man (gateway service) then the capabilities to govern the transmission of “expected” data falls into the lap of “libraries” in the form of specialized SDKS.

Software Development Kits (SDKs)

SDKs are libraries (or micro-frameworks) that are imported into a codebase to streamline an action, activity, or otherwise complex operation. They are also known by another name, clients. Take the example from earlier about using good error messages and error codes. This process is necessary in order to inform a client that their prior action was invalid, however it can be advantageous to add appropriate guard rails directly into an SDK to reduce the surface area of any potential problems. For example, let’s say we have an API setup to track customer’s coffee related behavior through event tracking.

Reducing User Error with SDK Guardrails

A client SDK can theoretically include all the tools necessary to manage the interactions with the API server, including authentication, authorization, and as for validation, if the SDK does its job, the validation issues would go out the door. The following code snippet shows an example SDK that could be used to reliably track customer events.

import com.coffeeco.data.sdks.client._
import com.coffeeco.data.sdks.client.protocol._

Customer.fromToken(token)
  .track(
    eventType=Events.Customer.Order,
    status=Status.Order.Initalized,
    data=Order.toByteArray
  )

With some additional work (aka the client SDK), the problem of data validation or event corruption can just about go away entirely. Additional problems can be managed within the SDK itself like for example how to retry sending a request in the case of the server being offline. Rather than having all requests retry immediately, or in some loop that floods a gateway load balancer indefinitely, the SDK can take smarter actions like employing exponential backoff. See “The Thundering Herd Problem” for a dive into what goes wrong when things go, well, wrong!

The Thundering Herd Problem

Let’s say we have a single gateway API server. You’ve written a fantastic API and many teams across the company are sending event data to this API. Things are going well until one day a new internal team starts to send invalid data to the server (and instead of respecting your http status codes, they treat all non-200 http codes as a reason to retry. But wait, they forgot to add any kind of retry heuristics like exponential backoff, so all requests just retry indefinitely — across an ever increasing retry queue). Mind you, before this new team came on board there was never a reason to run more than one instance of the API server, and there was never a need to use any sort of service level rate limiter either, because everything was running smoothly within the agreed upon SLAs.

A happy cartoon whale. This is what happens when the “fail whale” is out of hot water and back into their natural habitat again. — The Not-So-Fail-Whale. What can happen when you restore problems and get back out of the hot water again. Image via Midjourney via the Author.

Well, that was before today. Now your service is offline. Data is backing up, upstream services are filling their queues, and people are upset because their services are now starting to run into issues because of your single point of failure…

These problems all stem from a form of resource starvation coined “The Thundering Herd Problem”. This problem occurs when many processes are awaiting an event, like system resources being available, or in this example, the API server coming back online. Now there is a scramble as all of the processes compete to attempt to gain resources, and in many cases the load on the single process (api server) is enough to take the service back offline again. Unfortunately, starting the cycle of resource starvation over again. This is of course unless you can calm the herd or distribute the load over a larger number of working processes which decreases the load across the network to the point where the resources have room to breathe again.

While the initial example above is more of an unintentional distributed denial of service attack (DDoS), these kinds of problems can be solved at the client (with exponential backoff or self-throttling) and at the API edge via load balancing and rate limiting.

Ultimately, without the right set of eyes and ears, enabled by operational metrics, monitors and system level (SLAs/SLIs/SLOs) alerting, data can play the disappearing act, and this can be a challenge to resolve.

Whether you decide to add a data gateway API to the edge of your data network, employ a custom SDK for upstream consistency and accountability, or decide to take an alternative approach when it comes to dealing with getting data into your data platform it is still good to know what your options are. Regardless of the path in which data is emitted into your data streams this introduction to streaming data wouldn’t be complete without a proper discussion of data formats, protocols, and the topic of binary serializable data. Who knows we may just uncover a better approach to handling our data accountability problem!

Selecting the Right Data Protocol for The Job

When you think of structured data the first thing to come to mind might be JSON data. JSON data has structure, is a standard web-based data protocol, and if nothing else it is super easy to work with. These are all benefits in terms of getting started quickly, but over time, and without the appropriate safeguards in place, you could face problems when it comes to standardizing on JSON for your streaming systems.

The Love / Hate Relationship with JSON

The first problem is that JSON data is mutable. This means as a data structure it is flexible and therefore fragile. Data must be consistent to be accountable, and in the case of transferring data across a network (on-the-wire) the serialized format (binary representation) should be highly compactable. With JSON data, you must send the keys (for all fields) for each object represented across the payload. Inevitably this means that you’ll typically be sending a large amount of additional weight for each additional record (after the first) in a series of objects.

Luckily, this is not a new problem, and it just so happens that there are best practices for these kinds of things, and multiple schools of thought regarding what is the best strategy for optimally serializing data. This is not to say that JSON doesn’t have its merits. Just when it comes to laying a solid data foundation the more structure the better and the higher level of compaction the better as long as it doesn’t burn up a lot of CPU cycles.

Serializable Structured Data

When it comes to efficiently encoding and transferring binary data two serialization frameworks tend to always come up: Apache Avro and Google Protocol Buffers (protobuf). Both libraries provide CPU efficient techniques for serializing row-based data structures, and in addition to both technologies also provide their own remote procedure call (RPC) frameworks and capabilities. Let’s look at avro, then protobuf, and we will wrap up looking at remote procedure calls.

Avro Message Format

With avro, you define declarative schemas for your structured data using the concept of records. These records are simply JSON formatted data definitions files (schemas) stored with the file type avsc. The following example shows a Coffee schema in the avro descriptor format.

{
  "namespace": "com.coffeeco.data",
  "type": "record",
  "name": "Coffee",
  "fields": [
    ("name": "id", "type: "string"},
    {"name": "name", "type": "string"},
    {"name": "boldness", "type": "int", "doc": "from light to bold. 1 to 10"},
    {"name": "available", "type": "boolean"}
 ]
}

Working with avro data can take two paths that diverge relating to how you want to work at runtime. You can take the compile time approach, or the figure things out on-demand at runtime. This enables a flexibility that can enhance an interactive data discovery session. For example, avro was originally created as an efficient data serialization protocol for storing large collections of data, as partitioned files, long-term within the Hadoop file system. Given data was typically read from one location, and written to another, within HDFS, avro could store the schema (used at write time) once per file.

Avro Binary Format

When you write a collection of avro records to disk the process encodes the schema of the avro data directly into the file itself (once). There is a similar process when it comes to Parquet file encoding, where the schema is compressed and written as a binary file footer. We saw this process firsthand, at the end of chapter 4, when we went through the process of adding StructField level documentation to our StructType. This schema was used to encode our DataFrame, and when we wrote to disk it preserved our inline documentation on the next read.

Enabling Backwards Compatibility and Preventing Data Corruption

In the case of reading multiple files, as a single collection, problems can arise in the case of schema changes between records. Avro encodes binary records as byte arrays and applies a schema to the data at the time of deserialization (conversation back from a byte array into an object).

This means you taking extra precaution to preserve backwards compatibility, otherwise you’ll find yourself running into issues with ArrayIndexOutOfBounds exceptions.

Broken schema promises can happen in other subtle ways too. For example, say you need to change an integer value to a long value for a specific field in your schema. Don’t. This will break backwards compatibility due to the increase in byte size from an int to a long. This is due to the use of the schema definition for defining the starting and ending position in the byte array for each field of a record. To maintain backwards compatibility, you’ll need to deprecate the use of the integer field moving forwards (while preserving it in your avro definition) and add (append) a new field to the schema to use moving forwards.

Best Practices for Streaming Avro Data

Moving from static avro files, with their useful embedded schemas, to an unbounded stream of well binary data, the main differentiator is that you need to bring your own schema to the party. This means that you’ll need to support backwards compatibility (in the case that you need to rewind and reprocess data before and after a schema change), as well as forward compatibility, in the case that you have existing readers already consuming from a stream.

The challenge here is support both forms of compatibility given that avro doesn’t have the ability to ignore unknown fields, which is a requirement for supporting forward compatibility. In order to support these challenges with avro, the folks at Confluence open-sourced their schema registry (for use with Kafka) which enables schema versioning at the Kafka topic (data stream) level.

When supporting avro without a schema registry, you’ll have to ensure you’ve updated any active readers (spark applications or otherwise) to use the new version of the schema prior to updating the schema library version on your writers. The moment you flip the switch otherwise, you could find yourself at the start of an incident.

Protobuf Message Format

With protobuf, you define your structured data definitions using the concept of messages. Messages are written in a format that feels more like defining a struct in C. These message files are written into files with the proto filename extension. Protocol Buffers have the advantage of using imports. This means you can define common message types and enumerations, that can be used within a large project, or even imported into external projects enabling wide scale reuse. A simple example of creating the Coffee record (message type) using protobuf.

syntax = "proto3";
option java_package="com.coffeeco.protocol";
option java_outer_classname="Common";

message Coffee {
  string id       = 1;
  string name     = 2;
  uint32 boldness = 3;
  bool available  = 4;
}

With protobuf you define your messages once, and then compile down for your programming language of choice. For example, we can generate code for Scala using the coffee.proto file using the standalone compiler from the ScalaPB project (created and maintained by Nadav Samet), or utilize the brilliance of Buf, which created an invaluable set of tools and utilities around protobuf and grpc.

Code Generation

Compiling protobuf enables simple code generation. The following example is taken from the /ch-09/data/protobuf directory. The directions in the chapter READMEj covers how to install ScalaPB and includes the steps to set the correct environment variables to execute the command.

mkdir /Users/`whoami`/Desktop/coffee_protos
$SCALAPBC/bin/scalapbc -v3.11.1 \
  --scala_out=/Users/`whoami`/Desktop/coffee_protos \
  --proto_path=$SPARK_MDE_HOME/ch-09/data/protobuf/ \
  coffee.proto

This process saves time in the long run by freeing you up from having to write additional code to serialize and deserialize your data objects (across language boundaries or within different code bases).

Protobuf Binary Format

The serialized (binary wire format) is encoded using the concept of binary field level separators. These separators are used as markers that identify the data types encapsulated within a serialized protobuf message. In the example, coffee.proto, you probably noticed that there was an indexed marker next to each field type (string id = 1;), this is used to assist with encoding / decoding of messages on / off the wire. This means there is a little additional overhead compared to the avro binary, but if you read over the encoding specification, you’ll see that other efficiencies more than make up for any additional bytes (such as bit packing, efficient handling of numeric data types, and special encoding of the first 15 indices for each message). With respect to using protobuf as your binary protocol of choice for streaming data the pros far outweigh the cons in the grand scheme of things. One of the ways in which it more than makes up for itself is with support for both backwards and forwards compatibility.

Enabling Backwards Compatibility and Preventing Data Corruption

There are similar rules to keep in mind when it comes to modifying your protobuf schemas like we discussed with avro. As a rule of thumb, you can change the name of a field, but you never change the type or change the position (index) unless you want to break backwards compatibility. These rules can be overlooked when it comes to supporting any kind of data in the long term and can be especially difficult as teams become more proficient with their use of protobuf. There is this need to rearrange, and optimize, that can come back to bite you if you are not careful. (See the Tip below called Maintaining Data Quality Over Time for more context).

Best Practices for Streaming Protobuf Data

Given protobuf supports both *backwards and *forwards compatibility, this means that you can deploy new writers without having to worry about updating your readers first, and the same is true of your readers, you can update them with newer versions of your protobuf definitions without worrying about a complex deploy of all your writers. Protobuf supports forward compatibility using the notion of unknown fields. This is an additional concept that doesn’t exist within the avro specification, and it is used to track the indices and associated bytes it was unable to parse due to the divergence between the local version of the protobuf and the version it is currently reading. The beneficial thing here is that you can also opt-in, at any point, to newer changes in the protobuf definitions.

For example, say you have two streaming applications (a) and (b). Application (a) is processing streaming data from an upstream Kafka topic (x), enhancing each record with additional information, and then writing it out to a new Kafka topic (y). Now, application (b) reads from (y) and does its thing. Say there is a newer version of the protobuf definition, and application (a) has yet to be updated to the newest version, while the upstream Kafka topic (x) and application (b) are already updated and expecting to use some new fields available from the upgrade. The amazing thing is that it is still possible to pass the unknown fields through application (a) and onto application (b) without even knowing they exist.

See “Tips for maintaining good data quality over time” for an additional deep dive.

Tip: Maintaining Data Quality over Time

When working with either avro or protobuf, you should treat the schemas no different than you would code you want to push to production. This means creating a project that can be committed to your companies github (or whatever version control system you are using), and it also means you should write unit tests for your schemas. Not only does this provides living examples of how to use each message type, but the important reason for testing your data formats is to ensure that changes to the schema don’t break backwards compatibility. The icing on the cake is that in order to unit test the schemas you’ll need to first compile the (.avsc or .proto) files and use the respective library code generation. This makes it easier to create releasable library code, and you can also use release versioning (version 1.0.0) to catalog each change to the schemas.

One simple method to enable this process is by serializing and storing a binary copy of each message, across all schema changes, as part of the project lifecycle. I have found success adding this step directly into the unit tests themselves, using the test suite to create, read and write these records directl into the project test resources directory. This way each binary version, across all schema changes, is available within the code base itself.

With a little extra upfront effort, you can save yourself a lot of pain in the grand scheme of things, and rest easy at night knowing your data is safe (at least on the producing and consuming side of the table)

Using Buf Tooling and Protobuf in Spark

Since writing this chapter back in 2021, Buf Build (https://buf.build/) has materialized into the all-things-protobuf company. Their tooling is simple to use, free-and-open-source, and appeared at just the right time to power a few initatives in the Spark community. The Apache Spark project introduced full native support for Protocol Buffers in Spark 3.4 in order to support the spark-connect, and are using Buf for compiling GRPC services and messages. Spark Connect is after all a GRPC native connector for embedding Spark applications outside of the JVM.

Traditional Apache Spark application must run as a driver application somewhere, and in the past this meant using pyspark or native spark, which in both cases still run on top of a JVM process.

Directory structure via Spark Connect. Shows the protobuf definitions, along with buf.gen.yaml and buf.work.yaml which help with code generation.

At the end of the day, Buf Build enables peace of mind in the build process. In order to generate the code, one must run a simple command: buf generate . For simple linting and consistent formatting, buf lint && buf format -w . The icing on the cake however is the breaking change detection. buf breaking --against .git#branch=origin/main is all it takes to ensure that new changes to your message definitions won’t negatively affect anything that is currently running in production. *In the future, I will do a write up on using buf for enterprise analytics, but for now, it is time to conclude this chapter.

So where were we. You now know that there are benefits to using avro or protobuf when it comes to your long-term data accountability strategy. By using these language agnostic, row-based, structured data formats you reduce the problem of long-term language lock-in, leaving the doors open to whatever the programing language is later down the line. Cause honestly it can be a thankless task to be supporting legacy libraries and code bases. Additionally, the serialized formats help to reduce the network bandwidth costs and congestion associated with sending and receiving large amounts of data. This helps as well to reduce the storage overhead costs for retaining your data long-term.

Lastly, let’s look at how these structured data protocols enable additional efficiencies when it comes to sending and receiving data across the network using remote procedure calls.

Remote Procedure Calls

RPC frameworks, in a nutshell, enable client applications to transparently call remote (server-side) methods (procedures) via local function calls by passing serialized messages back and forth. The client and server-side implementations use the same public interface definition to define the functional RPC methods and services available. The Interface Definition Language (IDL) defines the protocol and message definitions and acts as a contract between the client and server-side. Let’s see this in action looking at the popular open-source RPC framework gRPC.

gRPC

First conceptualized and created at Google, gRPC which stands for “generic” remote procedure call, is a robust open-source framework being used for high performance services ranging from distributed database coordination, as seen with CockroachDB, to real-time analytics, as seen with Microsofts Azure Video Analytics.

The diagram shown in Figure 9–3 shows an example of gRPC at work. The server-side code is written in C++ for speed, while clients written in both ruby and java can interoperate with the service using protobuf messages as their means of communicating.

Using protocol buffers for message definitions, serialization, as well as the declaration and definition of services, gRPC can simplify how you capture data and build services. For example, let’s say we wanted to continue the exercise of creating a tracking API for customer coffee orders. The API contract could be defined in a simple services file, and from there the server-side implementation and any number of client-side implementations could be built using the same service definition and message types.

Defining a gRPC Service

You can define a service interface, the request and response objects, as well as the message types that need to be passed between the client and server as easily as 1–2–3.

syntax = "proto3";

service CustomerService {
    rpc TrackOrder (Order) returns (Response) {}
    rpc TrackOrderStatus (OrderStatusTracker) returns (Response) {}
}

message Order {
    uint64 timestamp    = 1;
    string orderId      = 2;
    
    string userId       = 3;
    Status status       = 4;
}

enum Status {
  unknown_status = 0;
  initalized     = 1;
  started        = 2;
  progress       = 3;
  completed      = 4;
  failed         = 5;
  canceled       = 6;
}

message OrderStatusTracker {
  uint64 timestamp = 1;
  Status status    = 2;
  string orderId   = 3;
}

message Response {
    uint32 statusCode = 1;
    string message    = 2;
}

With the addition of gRPC, it can be much easier to implement, and maintain both the server-side and client-side code used within your data infrastructure. Given that protobuf supports backwards and forwards compatibility, this means that older gRPC clients can still send valid messages to newer gRPC services without running into common problems and pain points (discussed earlier under “Data Problems in Flight”).

gRPC speaks HTTP/2

As a bonus, with respect to modern service stacks, gRPC is able to use HTTP/2 for its transport layer. This also means you can take advantage of modern data meshes (like Envoy) for proxy support, routing and service level authentication, all while also reduce the problems of TCP packet congestion seen with standard HTTP over TCP.

Mitigating data problems in flight and achieving success when it comes to data accountability starts with the data and fans outwards from that central point. Putting processes in place when it comes to how data can enter into your data network should be considered a prerequisite to check off before diving into the torrent of streaming data.

Summary

The goal of this post is to present the moving parts, concepts, and background information required to arm ourselves before blindly leaping from a more traditional (stationary) batch-based mindset to one that understandings the risks and rewards of working with real-time streaming data.

Harnessing data in real-time can lead to fast, actionable insights, and open the doors to state-of-the-art machine learning and artificial intelligence.

However, distributed data management can also become a data crisis if the right steps aren’t taken into consideration ahead of time. Remember that without a strong, solid data foundation, built on top of valid (trustworthy) data, that the road to real-time will not be a simple endeavor, but one has its fair share of bumps and detours along the way.

I hope you enjoyed the second half of Chapter 9. To read the first part of this series, head on over to A Gentle Introduction to Analytical Stream Processing.

A Gentle Introduction to Analytical Stream Processing

Building a Mental Model for Engineers and Anyone in Between

towardsdatascience.com

— — — — — — — — — — — — — — — — — — — — — — — —

If you want to find dig in even deeper, please check out my book, or support me with a high five.

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming…

Amazon.com: Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming…

www.amazon.com

If you have access to O’Reilly Media then you can also read the book entirely for free (good for you, not so good for me), but please find the book for free somewhere if you have the opportunity, or get an ebook to save on shipping cost (or needing to find a place for a 600+ page book).

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming…

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully…

learning.oreilly.com

A Modest Introduction to Analytical Stream Processing

Architectural Foundations for Building Reliable Distributed Systems.

Foundations of Stream Processing

Building Reliable Streaming Data Systems

Adopting the Owners Mindset

Managing Streaming Data Accountability

Dealing with Data Problems in Flight

Data Gatekeepers

Using Error Messages to Provide Self-Service Solutions

Pros and Cons of the API for Data

Software Development Kits (SDKs)

Reducing User Error with SDK Guardrails

Selecting the Right Data Protocol for The Job

The Love / Hate Relationship with JSON

Serializable Structured Data

Avro Message Format

Avro Binary Format

Enabling Backwards Compatibility and Preventing Data Corruption

Best Practices for Streaming Avro Data

Protobuf Message Format

Code Generation

Protobuf Binary Format

Enabling Backwards Compatibility and Preventing Data Corruption

Best Practices for Streaming Protobuf Data

Using Buf Tooling and Protobuf in Spark

Remote Procedure Calls

gRPC

Defining a gRPC Service

gRPC speaks HTTP/2

Summary

A Gentle Introduction to Analytical Stream Processing

Building a Mental Model for Engineers and Anyone in Between

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming…

Amazon.com: Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming…

Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming…

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully…

Written by Scott Haines