Bookreview: ‘Streaming Data’ by Andrew G. Psaltis

Thomas Treml
Towards Data Science
6 min readSep 10, 2018

--

Stream processing systems are constantly gaining attention and adoption in industry in recent years, because of the need to process (big-)data more quickly. To accomplish this, a sequence of digital signals, the so-called data stream, is processed and analyzed in order to produce (near) real-time insights. The book ‘Streaming Data’ by Andrew Psaltis is a collection of best practices to design and implement those kind of systems from end-to-end.

Enns.join(Steyr)

This review acts mainly as a personal reference for the content in the book. But it would please me, if the text is also useful for others and maybe motivates the one or the other to buy and read the book in-depth. It is structured as a summary of contents, followed by my personal views on the book and closed by some key take-aways.

The content of the book is broken down into the main tiers of a data streaming system. It is then finalized with an hands-on pet project with the goal to build a system, which analyzes Meetup-RSVPs in real-time.

Collection tier

This chapter is about common interaction patterns and fault tolerance. Andrew Psaltis describes the following interaction patterns in-depth and applies them on various use-cases:

  • Request / response
  • Publish / subscribe
  • One-way pattern
  • Request / acknowledge
  • Stream pattern

Expectedly a special emphasis lies on the stream pattern, which flips the usual setting around, in which a client interacts with a service. There, a service becomes a client and connects to a stream source (instead of a client making requests to a service in the other patterns). Also scaling concepts for these different kinds of interaction patterns are addressed in the book.

In the part about common fault tolerance techniques, three variants of message logging are addressed: receiver-based, sender-based and hybrid. In the first, every received message is persisted on disk before any action is taken on it. Contrary, sender-based message logging writes messages on storage before it is sent. Hybrid message logging is designed to provide the best of both approaches, but by simplifying the dataflow compared to a situation in which both techniques are used to protect the system from data-loss and to make it recoverable. Basically what this approach does is logging a received message on stable storage as soon as it is received, similar to the receiver-based approach, but asynchronously. And like in a sender-based message logger, this log will then only be deleted as soon as the acknowledgement by message queuing tier arrives. Andrew Psaltis argues that this approach should be implemented whenever possible, because it preserves fault tolerance and safety, without the overhead of setting up two separate message loggers with a persistent storage each.

Message queuing tier

The chapter about message queuing starts with an introduction to the three main components: producer, broker and consumer. It is written technology agnostic, so the explained principles can be applied to various products in this domain (eg. RabbitMQ, ZeroMQ or Kafka).

Important is the introduction of message delivery semantics (for the following analysis tier also):

  • At-most-once — the message maybe gets lost, but is never processed more than once
  • At-least-once — the message gets never lost, but is maybe processed more than once
  • Exactly-once — the message gets never lost and is processed only once

Security is also a concern in this chapter, but how to implement it actually is unfortunately only dealt with a references for further readings. Fault tolerance is again a big topic with a checklist-like list of questions you should answer while designing a message queuing solution.

Analysis tier

The content about this tier is actually split up into two separate chapters. The first introduces into distributed stream processing architecture and frameworks to turn it into action. Then the second is about algorithms for stream analytics.

All distributed stream processing systems nowadays consist of three parts:

  • A management component, which distributes thesubmitted application
  • Worker nodes in the cluster, which executes the algorithms
  • Data sources, which act as input for the algorithms to work on

The frameworks that are discussed to be used for this tier are Spark Streaming, Storm, Flink and Samza. Message delivery semantics, state management and fault tolerance, each with respect to the analysis tier, are further topics.

For the algorithmic processing of stream data, time is a most important concept. It needs to be distinguished between event time (when the event actually occurs) and stream time (when the data of an event enters the system). As streams are never ending by nature, they can’t be kept in memory to perform an analysis on, like in traditional batch processing. The concept of a window (a defined amount of the streaming data to perform computation on) is introduced to overcome this problem. There are two windowing techniques:

  • Sliding window — defined by window length and sliding interval
  • Tumbling window with count-based or temporal-based trigger policy

As summarization is the core of analytics, various techniques with respect to stream processing are introduced finally.

In-memory data store

This chapter is all about storing the previously collected and processed data. Long term storage options are covered only superficially, in-memory solutions (embedded in-memory and caching systems) in detail. It needs to be distinguished between systems built with a disk-first approach (traditional databases which offers an in-memory option) and memory-first designed in-memory databases [IMDBs]. Which product to choose depends highly on the specific use-case, but ones that follow the later approach are commonly the best fit for a fast streaming solution.

Data access tier

The last tier is built to make the data accessible to a client, mainly through an API. There are four major patterns to fulfill this purpose, which are discussed in this chapter:

  • Data Sync
  • Remote Method Invocation [RMI]/ Remote Procedure Call [RPC]
  • Messaging
  • Publish / subscribe

Commonly used protocols to build a streaming API are discussed in detail. These are Webhooks, HTTP Long Polling, Server sent events and WebSockets. They are also compared regarding their communication direction and different factors like frequency, latency, efficiency or reliability.

Finally, filtering types to decrease the number of events that are of interest to the streaming client, are discussed. Although the majority of filtering methods should take part in the Analysis tier, there are some consumer specific use-cases, where filtering in the data access tier makes sense (e.g. geo-location based filters). Basically, the pros and cons of static- (predefined decision) vs dynamic-filtering (decision at runtime) in this last tier of the system, are compared against each other.

Personal Review

Overall I really enjoyed reading this book. It has well-structured content of high quality. Complex concepts are broken down in a way that makes it easy to understand. For my personal taste, there are too many diagrams in the text, which prohibits fluent reading sometimes, especially when the diagrams are redundant (e.g. the one with the architecture of the base tiers is plotted multiple times each chapter). I’ve found the references to books or articles in the text exceptionally useful and well curated. Also very useful are the frequent occurring listings of benefits and drawbacks. A glance at a comparison table is such a great way to get a quick overview to differences in various methods or technologies.

The only part I’m not so happy with, is the one about the analysis tier (which is a bit sad, because I bought the book initially to learn about this topic). The core concepts are described well, but I’ve expected more details and at least some content about actual implementation of algorithms to do stream analytics. The comparison of frameworks is good, but I’m missing some code snippets there (e.g. a short hello-world for every framework). Especially because the analysis tier part is also only loosely covered in the pet project of the final chapter.

As I’m not a Java-person (Scala & Python FTW), I’m not the right one to judge the final project. The scope and domain of the project is IMHO a good choice anyways. And its core logic of the code is easy to understand, also for persons who aren’t fluently in Java.

Key Take-aways

  • There aren’t one-size fits all solutions for streaming data systems today. Every involved technology has some strength and drawbacks too. Choose those which fits your specific use-case the best.
  • This book is a great reminder to apply the KISS principle whenever possible. Every unnecessary feature you add is a potential point of failure of the system.
  • The concept of time is highly important when it comes to stream analytics.
  • Simple analytical tasks (like counting) in batch mode, become really complex in a streaming system.
  • Capabilities of products in the persistence layer are improving constantly and the boundaries between the different approaches are blurring (way beyond SQL vs. NoSQL dichotomy).

--

--

Data Professional with background in Sociology, who ❤️ #coding #machinelearning #espresso #cycling #donkies