Spark Streaming Made Easy with JupyterLab SQL Magic

jupyterlab-sql-editor is a JupyterLab extension which makes it a breeze to execute, display and manage Spark streaming queries

Jean-Claude Cote
Towards Data Science

--

Photo by Fadhila Nurhakim on Unsplash

One of the roles of the Canadian Centre for Cyber Security (CCCS) is Computer Emergency Response Team (CERT). In this role the CCCS detects anomalies and issues mitigations as quickly as possible.

In an environment where response time is critical, the CCCS leverages Spark Structured Streaming and the Kafka event streaming platform.

In this article we will demo our latest addition to the JupyterLab extension, jupyterlab-sql-editor, support for Spark streaming. See our previous article, Write composable Spark SQL analytics in JupyterLab, for a complete list of features and how to install jupyterlab-sql-editor.

Typical Spark Streaming API Usage

Here’s a typical example of using the Spark streaming API in a notebook. Notice the use of .format(“memory”). The console sink does not work in a notebook environment, thus we use the memory sink which is a well known alternative in notebooks.

In order to display the results of the streaming query, we retrieve data from the mem_results table created by the streaming query.

Wrapping Boilerplate Code

The display_df function of jupyterlab-sql-editor is now able to detect that the provided dataframe is a streaming dataframe. In which case it will do all of the boiler plate code from above and display the current results of the mem_results table.

All you have to do is create a streaming dataframe and pass it to display_df. The display_df can display results in many formats and show the schema (the shape) of the results.

In addition to these features, the display_df function now displays a UI relevant to a streaming query. It displays the status of the streaming query, metrics and a stop button.

Working with SQL

A streaming dataframe can be aliased as a temporary view. In this example, we create a streaming dataframe and then alias it to the view uuid_events.

Using show tables we can see this view exists, which means we can use the %%sparksql magic to query it. In fact we can also invoke%%sparksql --refresh localto make %%sparksql store this view information into its auto-completion cache.

Having registered this view and cached it into %%sparksql, we can now leverage all of %%sparksql’s support for regular tables. These features include output modes, jinja templating, truncation, limits, auto-completion, formatting and syntax highlighting.

Remember, we are actually writing a streaming SQL query. As such, any aggregation will need to be bound to a window of time. For more details, see the Spark Structured Streaming guide. Here’s an example of a streaming query that counts the number of occurrences of each character in a window of 5 seconds.

Below are the live results. Notice the character and count are associated with a time window win.

Conclusion

In this article we showed how to leverage %%sparksql to easily prototype streaming analytics in Spark SQL. The magic takes care of a lot of boiler plate code and let’s you focus on the elaboration of your streaming analytic. You can also validate your results using the many outputs supported by %%sparksql magic (text, grid, html, json). New feature ideas and contributions are welcomed! Here’s our git repo CybercentreCanada/jupyterlab-sql-editor.

--

--

Data democratization advocate and principal engineer at Canadian Centre for Cyber Security | jean-claude.cote@cyber.gc.ca | Twitter @cybercentre_ca