Write composable Spark SQL analytics in JupyterLab

Put a bit of magic in your Spark SQL notebooks using JupyterLab SQL cell magic editor

Jean-Claude Cote
Towards Data Science

--

Photo by Julia Kadel on Unsplash

jupyterlab-sql-editor is an ipython magic that empowers you to write composable analytics in Spark SQL. This jupyterlab extension has a rich set of features:

  • Auto-completion of
    - table names, column names, nested sub-fields
    - table joins
    - functions
  • SQL syntax highlighting and formatting
  • Multiple display outputs
    - html and plain text
    - interactive data grid and json tree
    - capture of dataframe and views

Editing SQL statements

jupyterlab-sql-editor extends JupyterLab with SQL formatting and syntax highlighting.

jupyterlab-sql-editor also register an LSP server which provides autocompletion. Press <tab> key to trigger autocomplete of table and column names including nested sub-fields. Use the %%sparksql --refresh all command to update the local cache file used by the autocompleter. The refresh command enumerates the functions, tables and columns found in the current spark context.

Autocomplete Spark SQL functions with documentation and usage examples.

There are also a few power user features like auto filling all column names in the SELECT statement and auto suggesting JOIN conditions on matching column names.

Displaying results

The sparksql magic is a convenient way of executing spark sql statements. In this example we execute a statement and print the results as text.

%%sparksql --output textSELECT * from range(1, 1000)

We can do the same thing in python.

statement = 'select * from range(1, 1000)'
df = spark.sql(statement)
df.show()

As you can see it’s not really magic. The sparksql magic takes the body of a cell as the statement, executes it and prints the results. The extension simply encapsulates boilerplate python code and makes it readily available via command line options.

Jupyterlab-sql-editor supports a rich set of output formats such as an interactive data grid. We use Bloomberg’s ipydatagrid which can post-process (sort, filter, and search) the results returned by the Spark query.

A tree widget displays the schema of the results. This is particularly useful for columns containing complex nested data structures.

Use the JSON output to navigate the contents of complex nested columns. This output uses the standard ipython tree widget which can search in any of the nested field.

So far we have mostly seen how to query data however you are not limited to queries. You can perform any SQL statement supported by Spark SQL.

%%sparksql
CREATE TABLE prod.db.sample (
id bigint COMMENT 'unique id',
data string)
USING iceberg
%%sparksql
ALTER TABLE prod.db.sample
ADD COLUMNS (
new_column string comment 'new_column docs'
)

Use the line magic %sparksql for simple one-liner statements

%sparksql SET spark.sql.variable.substitute=false
%sparksql DROP TABLE prod.db.sample
%sparksql CALL prod.system.set_current_snapshot('db.sample', 1)

Write composable analytics using SQL

Common table expressions (CTE) is a standard way of breaking large SQL statements into more manageable pieces. Jupyterlab-sql-editor not only supports CTE but lets you capture or alias SQL statements as views which can then be reused in later cells. This mechanism takes advantage of Spark dataframe’s createOrReplaceTempView function. Use the --view option to create a temporary view. To prevent immediate execution of the query use the --output skip option. When this option is used only the view is created.

Autocompletion also works on temporary views. Use the %%sparksql --refresh local option to update your autocomplete cache file with any local views you have created.

Interoperate between dataframe and SQL

So far we have seen how to use SQL views to create composable SQL analytics. However you are not limited to only SQL you can switch from SQL to dataframe and back. Parts of your analysis might be better suited to the dataframe API as other might lend themselves better to the SQL dialect. Jupyterlab-sql-editor makes it very easy to switch between dialects. Use the--dataframeoption to convert an SQL statements as dataframe.

Given any dataframe, switch back to SQL by calling df.createOrReplaceTempView(). For example you might be using Spark ML, Spark GraphFrames, or simply using a datasource like CSV. Many of the libraries in Spark create dataframes.

df = spark.read.csv(path)
df.createOrReplaceTempView(‘my_csv_table’)

Then refer to the view you created in %%sparksql

%%sparksql
SELECT * FROM my_csv_table

Remember to update your local cache by running %%sparksql --refresh local and enjoy the autocompleter!

Installation

jupyterlab-sql-editor has two main dependencies. jupyterlab-lsp and sql-language-server. Installing the extension is very easy.

pip install jupyterlab-lsp jupyterlab-sql-editor

Install the sql-language-server project to provide autocompletion.

sudo npm install -g sql-language-server

For a complete list of configuration options see the detailed installation guide.

Conclusion

In this article we showed how to leverage %%sparksql to easily write composable analytics in Spark SQL. We focused on Spark, however jupyterlab-sql-editor also includes a %%trino magic! More SQL engines might be added in the future. Contributions are welcomed! Here’s our git repo CybercentreCanada/jupyterlab-sql-editor.

In a follow up article we will cover jupyterlab-sql-editor’s support for SQL templating using Jinja and DBT.

Many Thanks

To contributors of these projects:

krassowski/jupyterlab-lsp
joe-re/sql-language-server
zeroturnaround/sql-formatter
cryeo/sparksql-magic
bloomberg/ipydatagrid

--

--

Data democratization advocate and principal engineer at Canadian Centre for Cyber Security | jean-claude.cote@cyber.gc.ca | Twitter @cybercentre_ca