SQL + Jinja is not enough — why we need DataFrames

Furcy Pin
Towards Data Science
20 min readFeb 28, 2022

--

In October 2021, Max Beauchemin, creator of Apache Airflow, wrote an excellent article about 12 trends that are currently affecting data engineers. One of them was beautifully named “Mountains of Templated SQL and YAML”, and it really echoed with my own perception. He compared the SQL + Jinja approach to the early PHP era which — thank god — I never witnessed myself, and explains that :

If you take the dataframe-centric approach, you have much more “proper” objects, and programmatic abstractions and semantics around datasets, columns, and transformations.
This is very different from the SQL+jinja approach where we’re essentially juggling with pieces of SQL code as a collage of strings

So I started an open-source POC in Python to illustrate that point, and in particular, to demonstrate how much further the dataframe-centric approach can bring us.

I called my POC project bigquery-frame, it consists in providing a DataFrame API for Google Big Query. By sharing it, I hope it will shed more light to the power of DataFrames and, if someone at Google reads this, convince them that they should develop a proper DataFrame API for BigQuery and catch up with Spark and Snowflake (if they are not working on it already).

After explaining shortly where DataFrame comes from and how my POC project bigquery-frame works and how to use it, I will give several practical examples that demonstrate things that can be done much more simply and elegantly with DataFrames than with SQL (even with Jinja, even with dbt).

This is a very long post, so I made a table of content, and if you don’t have the time to read it now, why not bookmark it and come back whenever the time is right ? Also, since I could not find any xkcd comic about DataFrame this time, I used copyright-free images to let you relax from time to time. I hope you enjoy them. Also, I hope you like dogs.

Just go a the top of this page if you want to bookmark this article. I also find this feature quite nice to find back articles I liked.

Table of contents

Intro

  • A quick history of DataFrames
  • bigquery-frame: how does it work ?
  • bigquery-frame: how to try it ?

Things you can do with DataFrame that you can’t do with SQL (or at least not as elegantly)

  • Foreword: DataFrame is a superset of SQL
  • On-the-fly introspection
  • Chaining operations
  • Generic transformations
  • Higher level abstractions

Outro

  • Limits of the bigquery-frame approach
  • SQL does have advantages too
  • Towards a unified library for SQL ?

Conclusion

There is almost as many puppies in this picture as there are topics to discuss in this blog post. (Photo by Jametlene Reskp on Unsplash)

Intro

A quick history of DataFrames

The concept of DataFrame has first been popularized by Pandas, open-sourced in 2009. In 2015, Spark 1.3.0 released a DataFrame API along with Spark Catalyst, which for the first time unified the SQL and DataFrame APIs under the same optimizer, allowing Data Engineers to switch easily between SQL and DataFrame, depending on their needs. In 2021, Snowflake recently released Snowpark, a DataFrame API running on Snowflake, clearly inspired by Spark (even its name: SnowPARK).

Bigquery-frame: how does it work ?

The idea behind bigquery-frame is simple: I wanted to make a DataFrame class which would look and behave as similar as possible to a Spark DataFrame, except it would run on BigQuery. Since BigQuery only understands SQL, my approach was simple. Each DataFrame corresponds to a SQL query which is only executed when an action (df.show(), df.collect(), etc.) is performed.

Here is a small example code showing how the code looks like with PySpark (on the left) and bigquery-frame (on the right).

On the left: pyspark (code available here); On the right: bigquery-frame (code available here)

Bigquery-frame: how to try it ?

My project is available on Pypi, so you can just install it with pip or your favorite python package manager (btw, if you don’t know Python Poetry, try it, it’s awesome).

pip install bigquery-frame

(I recommend installing it in a Python virtual environment to avoid any conflict with your main Python installation)

Then, go have a look at the AUTH.md file that explains how to configure your project to access Big Query. If you have a dummy test GCP project with nothing on it (it takes 5 minutes to create one along with a gmail address, and it’s completely free), you can use Method 1. Otherwise, if you use a real GCP project I suggest using Method 2 to use a proper service account with minimal privilege.

The examples folder contains several code examples, most of which I will use in this article.

Things you can do with DataFrame that you can’t do with SQL (or at least not as elegantly)

Foreword: DataFrame is a superset of SQL queries

Before we deep-dive into what DataFrames can do that SQL can’t (or at least not as elegantly), I would like to point out that DataFrame APIs are a superset of SQL queries:

Everything that can be expressed with a SQL query can simply be copy-pasted inside DataFrame code, like this: bigquery.sql(my_sql_query).

If your SQL query has Jinja in it, you can simply compile it with Python code too, like this:

from jinja2 import Template
query_template = Template(my_sql_query_with_jinja)
compiled_query = query_template.render(**my_jinja_context)
df = bigquery.sql(compiled_query)

Of course, this may not work for SQL scripts that use FOR LOOPs and EXECUTE IMMEDIATE to run multi-phased SQL logic, but the good news is that you can use Python and DataFrames to do that in a much, much cleaner way instead!

One more word about SQL scripts

I am well aware that all the examples I will show below to showcase DataFrame’s superiority can probably be re-coded somehow with pure-SQL scripts. But I’m also pretty sure that the result would look horrible. This is why I added the footprint “(at least not as elegantly)”. Therefore, in the examples below, I will focus on the differences between DataFrames and SQL queries and I won’t discuss the SQL scripts alternative. I invite those who are not convinced to try rewriting some of the most complex examples below with generic, clean and unit-tested pure-SQL scripts. Good luck with that. (In case someone is crazy enough to rise to the challenge and prove me wrong: please make sure to also monitor how long it takes you. For reference, it took me a couple of weekends to do this entire bigquery-frame POC, as the time of writing).

And now, let’s have a look at what DataFrames can do for you that SQL queries can’t.

This is where we start giving real examples. Isn’t he cute, by the way ? (Photo by Alvan Nee on Unsplash)

On-the-fly introspection

One of the advantages with DataFrames is that at any point in the transformation chain, the program can inspect the schema of the current result, and act on it.

An example: sort_columns

Let’s start with a simple example: You have a SQL query, and you would like your output table’s columns to be sorted alphabetically. Of course, if the table has only a few columns, you can do it yourself in SQL. Even if it has a lot of columns, it takes less than 30 sec to write a SELECT statement in Sublime Text with the column sorted by name (just press F9).

Classic approach with SQL

While it does solve the problem, this approach has several drawbacks:

  • If you add or rename a new column in T, you must think about updating your final statement.
  • If another developer updates your query, maybe they will not know or think about the convention and add new columns at the end of the final select instead of keeping the names sorted.

With DataFrames, on the other hand, the solution is much more simple and robust:

df = bq.sql("/* SOME TRANSFORMATION */")
df_with_sorted_columns = df.select(*sorted(df.columns))

df.columns returns the list of column names, sorted sorts it, df.select generates a select statement with the column sorted alphabetically. If you add or rename a column in your upstream transformation, it doesn’t break.

You can even write a sort_columns function to make the code even simpler

from bigquery_frame.transformations import sort_columnsdf = bq.sql("/* SOME TRANSFORMATION */")
df_with_sorted_columns = sort_columns(df)

Other applications

This introspection capability does not only work for column names: with df.schema you can access the whole schema (column names, types and description) of your intermediate results. With it, it becomes easy to write generic code such as :

SQL without DataFrames make me as sad as a dog on a leash. (Photo by Qusai Akoud on Unsplash)

Chaining operations

As we just have seen, DataFrames allow you to inspect the schema of your intermediate results and make the next transformation depend on it, but you can also directly inspect the data. This can’t be done with a SQL query either, and for a good reason: no matter which SQL engine you are using, SQL queries must be fully compiled — and output schema must be fully known — before running. This is what allows the SQL optimizer to plan and optimize the query completely. Some advanced engines are capable of Adaptive Query Execution, for instance to choose the right type of join at runtime instead of compile-time: Spark added this feature in 3.0, probably BigQuery and Snowflake have similar optimizations but they keep their sauce secret. But such SQL optimizations are already highly advanced.
The following example will demonstrate how DataFrames allow the user to adapt their query conditionally to an intermediate results: let’s do a pivot!

An example: pivot

Pivot is a very common example of transformation that Excel users want to do and can’t do easily with SQL. Of course, most advanced SQL engine, such as Spark or BigQuery, end up adding a PIVOT syntax in their SQL syntax (in 2018 for Spark, in 2021 for BigQuery). But … there’s a catch ! Spark’s pivot statement is less powerful in SQL that in DataFrame, and as we will see, we get the same outcome between BigQuery and bigquery-frame.

To understand why, let’s have a look at BigQuery’s PIVOT syntax. In their documentation, they give the following example: After generating a table that looks like this:

+---------+-------+---------+------+
| product | sales | quarter | year |
+---------+-------+---------+------|
| Kale | 51 | Q1 | 2020 |
| Kale | 23 | Q2 | 2020 |
| Kale | 45 | Q3 | 2020 |
| Kale | 3 | Q4 | 2020 |
| Kale | 70 | Q1 | 2021 |
| Kale | 85 | Q2 | 2021 |
| Apple | 77 | Q1 | 2020 |
| Apple | 0 | Q2 | 2020 |
| Apple | 1 | Q1 | 2021 |
+---------+-------+---------+------+

They invite us to try a pivot like this :

SELECT * FROM
Produce
PIVOT(SUM(sales) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4'))
+---------+------+----+------+------+------+
| product | year | Q1 | Q2 | Q3 | Q4 |
+---------+------+----+------+------+------+
| Apple | 2020 | 77 | 0 | NULL | NULL |
| Apple | 2021 | 1 | NULL | NULL | NULL |
| Kale | 2020 | 51 | 23 | 45 | 3 |
| Kale | 2021 | 70 | 85 | NULL | NULL |
+---------+------+----+------+------+------+

Here I highlighted the problematic part FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4'): BigQuery expects you to know the values inside the column you want to pivot. They chose their example well, as no one expects to one day see 'Q5' appear in those values. But let’s say you do a pivot by country, and your company launches a new country. Do you really want to have to update your pivot query every time this happens ? Wouldn’t it be simpler if the query was able to infer automatically the values inside the pivot column first, then do the pivot ? When you think about it, this is no surprise that the pure-SQL PIVOT syntax can’t do it: remember when we said that “SQL queries must be fully compiled — and output schema must be fully known — before running” ? This is a perfect example of this limitation.

If you look at Spark-SQL’s pivot syntax, it has the same issue. However, if you look at Spark DataFrame’s pivot method, you’ll see it comes in two flavors: pivot(pivotColumn) and pivot(pivotColumn, values) . The documentation of these methods explains why:

There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.

And as we’ve just seen, only the first type of pivot is available in the SQL version, or in BigQuery.

So, I implemented in bigquery-frame a pivot (and unpivot) method that works like Spark DataFrame’s one. I also illustrated it with an example. I won’t linger on the details, since we still have two more examples to go, and I invite you to go have a look for further information. One thing to note: I made two distinct implementation: the first one doesn’t even use BigQuery’s PIVOT statement, only a GROUP BY, while the second one does. As you can see, both implementations take less than 15 lines of code, showing how much less work it is to write a pivot method with DataFrames than it is to extend a SQL syntax with a PIVOT statement (which requires to update the SQL lexer/parser before even getting started).

By the way, dbt-utils’s pivot suffers from the same limitation that we have with pure-SQL syntax: the row values that will be turned into column must be known in advance.

Other applications

Applications here are limitless: once you can chain multiple transformations and make the nature of the next transformation depend on the previous one, the possibilities are huge. Here we did a simple pivot, but if we add for loops in the mix we can do amazing things such as:

  • Implement graph algorithms that run on BigQuery, just like graphframes does on Spark.
  • Implement advanced feature-engineering logic that starts by analyzing a BigQuery table before automatically choosing how to transform each column based on it’s distribution. All of it running on BigQuery.
DataFrames help reducing code duplication. (Photo by Bharathi Kannan on Unsplash)

Generic transformations

Another thing that can be done with DataFrames is applying generic transformations to our intermediate tables. This is mostly thanks to the on-the-fly introspection feature described above, but it can go much further that simply sorting a DataFrame’s column.

An example: flatten

One of the things SQL has been very bad at since the beginning is handling nested data. Before 2010, most SQL database didn’t even have support for nested data format until Apache Hive started doing it. Nowadays, I find that BigQuery does a much better job than every other SQL engine, including Spark-SQL, when it comes to dealing with nested data. But that’s another subject entirely and I plan to write (yet) another article on that later. Even with BigQuery, we may often need to flatten a nested structure, just to be able to export it as csv or make it usable by “archaic” tools like Excel or Power BI which do not support records (a.k.a. structs in Spark) types.

In bigquery-frame, I added a flatten method that can be used to automatically flatten all non-repeated records (a.k.a. structs) into simple columns.

I wrote a similar function for pySpark a while ago, along with an unflatten method, which is the reverse operation. Spark makes it easier than BigQuery, because it supports any character inside column names, while BigQuery only supports letters, numbers and underscores.

One thing to note if you try to use it: it does not flatten repeated records (a.k.a. array<struct<...>> in Spark). The reason for that is simple: if you have a table with two columns that are repeated records, unnesting them (a.k.a. lateral view explode(...) in Spark) will give you one row per element in the cartesian product of the two arrays. This can quickly lead to combinatoric explosion, and it’s up to the user to handle each array properly.

Other applications

One application that I would like to implement some day would be a method that makes it easy to apply transformations inside nested structures.

Imagine you have a table with 5 levels nesting or more, your column types look as ugly as (array<struct<array<struct<array<struct<…>>>>>>), and you just want to apply a coalesce() to the column at the deepest level of the structure. Good luck writing that with pure SQL. I can easily imagine a method that would work with SQL like dpath does with JSON. Something which would look like this:

df.transform_column("a.b[*].c", lambda c: f.coalesce(c, f.lit(""))
This is how it feels when you reach the limits of an abstraction. (Photo by Jamie Street on Unsplash)

Higher level abstractions

With this last example, we will show how higher level abstractions can help us automate data profiling.

An example: analyze

I implemented in bigquery-frame a method that I called analyze. It is similar to pandas.DataFrame.describe and pyspark.sql.DataFrame.summary, but it doesn’t give exactly the same statistics. Analyze gives for each column: the column name, type, count, count distinct, number of null values, min, max, and an approximation of the 100 most frequent values (using BigQuery’s APPROX_TOP_COUNT method). It also supports nested structures and automatically unnests records and repeated records.

Here is a quick example of it in action:
(full example including imports and df’s creation here)

And if you install pandas, pyarrow and xlsxwriter, you can even export the results into an Excel file with this one-liner :

analyzed_df.toPandas().to_excel(“output.xlsx”, engine=’xlsxwriter’)

Other applications

In the history of computer science, higher levels of abstraction are like industrial revolutions: they unlock so many possibilities that they can completely change the way people work. Here are example of higher level abstractions that people could build more easily with DataFrames.

  • Building tools that automatically compare two DataFrames (here is an open-source example in Spark). This is especially useful to perform non-regression testing. Let’s say I refactor a SQL query to make it cleaner. The first thing I want to make sure of is that the result of the query will remain exactly the same. Today, every software engineer uses git’s diff to review the changes they make to their code. Tomorrow, every analytics engineer should also diff tables and review the changes they make to the data. Hopefully I’ll have time in the future to add this feature into bigquery-frame. I also just found out that Datafold now proposes that feature with a UI to render diff results, this is very exciting.
  • Building a whole data-wrangling interface like Dataiku did in its early days. It probably evolved a lot since, but back in 2013, it used to be called Dataiku Shaker and was just a GUI that chained dataframe transformations (first with Pandas, then Spark for more scalability).
  • Building higher-level pipeline transformation tools like Hamilton (based on pandas dataframes), which allows to get column-level lineage to help manage complex data science feature transformations. Ironically, I believe that column-level lineage is much easier to achieve with SQL than DataFrames, because the SQL queries can be statically analysed.
No matter where this is going, I’m looking forward to it. (Photo by Andrew Pons on Unsplash)

Outro

Limits of the approach

“Query is too Complex”

If you dig into the code of the analyze method, you will see that I had to use some weird workaround to deal with the following error message:

Not enough resources for query planning — too many subqueries or query is too complex

This message appeared in my first implementation when I tried to analyze tables with hundred of columns. So I had to work around it by saving intermediary results into temporary tables before returning the whole result. This makes the execution slower, as I run several queries instead of just one, but it doesn’t change the billing cost thanks to BigQuery’s billing model, because I read each column exactly once and the intermediate results I store in the temporary tables are very small.

This is clearly the limit of my POC’s approach: unlike Spark, BigQuery is currently tailored for SQL-queries only, not crazy-complex transformations with 300 sub-steps. I couldn’t find any documentation explaining exactly which one of BigQuery’s hard threshold triggered this error: for instance I was still way below the 1MB limit for “Maximum unresolved Standard SQL query length”. I believe that the query had so many sub-stages that the query-planner probably consumed too much RAM and crashed, or something like that. I found a workaround but it does make it harder to implement high-level abstractions, compared to Spark.

Adding Python UDFs will be difficult

Another great thing with pySpark DataFrames is that that you can easily add Python UDFs to apply complex business logic to your rows. Granted, this is not optimal in terms of performances, but the productivity gain for the developer can be huge: a unit-tested Python UDF is generally much easier to write and maintain than an (often) akward-looking SQL implementation, even if you encapsule it in a dbt macro or a BigQuery SQL UDF with a CREATE FUNCTION statement: it will be harder to unit-test, maintain and even deploy (if you go for the SQL UDF).

Adding Python UDFs in bigquery-frame sounds difficult, although perhaps not impossible: BigQuery recently release a new feature called Remote Functions which allows you to deploy arbitrary code in a Cloud Function (including Python code!) and then call those functions via a BigQuery UDF. At first glance, this seems to require a lot of steps to set up a new UDF, but maybe it could all be automated into a bigquery.register_udf(my_python_function) one day, like pySpark did it.

Debugging your code

Another downside of frameworks which generate SQL strings is that it can make debugging difficult sometimes. The error message will describe a problem in the generated SQL string, not your Python code. But pySpark DataFrame’s error messages aren’t always great either, so perhaps that evens out.

SQL does have advantages too

This comparison wouldn’t be very fair if I didn’t say a quick word about the actual advantages that the pure-SQL have over DataFrame.

  • SQL is faster to write: When I need to write throw-away code to perform an investigation, I’ll go much faster writing it in SQL than DataFrame. ButI can write SQL inside DataFrames so I’m fine.
  • SQL makes auditing much easier: Some companies must comply with regulators who asks to be capable to audit their data. They want to be able to know who queried a dataset and what they did with it. This is much easier with SQL since the complete description of the transformation holds in a SQL string (well except if you call external functions of course). With pySpark, since your DataFrame code can do anything, call any library, and be spread between hundred of files, you would have to store each version of the code that ever run on your data.
  • SQL can be statically analyzed: I already mentioned this, but building column-level lineage is much easier with SQL than DataFrames. Monzo wrote an article explaining how they did it internally, it’s a very intersting read.
  • With DataFrames you can’t use dbt: Obviously, DataFrame-centric pipelines need a tool like dbt that makes them easier to build and maintain like dbt does with SQL. I’m sure that more open-source initiatives like fal will eventually appear in the next few years. Hard to say if they will be able to catch up with dbt’s incredible popularity.

As we have seen, the bigquery-frame approach does have some limitations because it generate SQL strings. But thanks to this, it also benefit of some of SQL’s upsides, like easier auditing and static analysis of compiled queries.

Towards a unified library for SQL ?

Perhaps one day, someone crazy enough will try to extend my POC into a unified DataFrame abstraction that would work with any SQL dialect (not just BigQuery but also Snowflake, Postgres, Azure Synapse, Spark-SQL, etc). This would have a great advantage: anyone building on top of this abstraction would be able to bring their technology to any SQL engine. Let’s take Malloy, which currently only works on BigQuery and Postgres. If they used an intermediate abstraction level like DataFrames, they could extend to other languages faster and join forces with other projects at solving this problem.

A few hours after I finished writing this article, I saw George Fraser’s last blog post for Fivetran. He claims that SQL could become a library language. I definitely agree that it would be great. But then he says:

First, we have to accept the fact that different implementations of SQL are effectively different languages, and that open-source libraries will need to be built separately for each database management system

Well, if this is really what the industry needs, why not instead build a generic DataFrame abstraction that would take care of translating to any SQL dialect, and let everyone benefit from it and focus on building great libraries on top of it?

The obvious pitfall with this approach would be the risk of getting a leaky abstraction. But I’m not too worried about that. Remember when I said that DataFrames were a superset of SQL? This means that whenever you need to, you can still go back to plain-SQL code. So, if you really need to take advantage of this really niche feature that only BigQuery has and bigquery-frame does not support yet, you can still write it in pure-SQL.

Sparky and Biggy running side to side. (Photo by Alvan Nee on Unsplash)

Conclusion

By writing this post, I wanted to convey the following messages:

  • Dataframe API + Python provide a much more powerful level of abstraction than pure-SQL + Jinja, as I hope to have demonstrated with several examples.
  • It runs on the same infrastructure, giving access to higher-order transformations without requiring to setup a cluster with another technology (like Spark, Dask or Vaex¹).
  • Spark users and Databricks have known this for years now.
  • Snowflake is quickly catching up with Snowpark.
  • dbt Labs should acknowledge this as one of their greatest weakness — in fact, they already didand build a long-term plan to fix it if they want to grow their sphere of influence beyond analytics engineers. But adding DataFrame support directly into dbt seems so far way from the original design that it might prove quite difficult. Some, like fal.ai, are already working in that direction by providing a way to run pandas task between dbt jobs.
  • Last but not least: I feel like Google is lagging behind right now and, as a huge fan of BigQuery, it makes me sad.

This last point is really what motivated me to start this bigquery-frame POC in the first place. I sincerely hope that it will help convince some people at Google to start a similar project, if they haven’t already. BigQuery users need something properly done, not just a POC like this one. I see several possible approaches:

  1. Continuing to improve bigquery-frame, which generates SQL strings. After all, several billion-dollar companies like Tableau or Looker are mostly based on automatically-generated-SQL (and dataviz), and so is Malloy, Looker’s founder’s new project. The main issue I see with that approach is the infamous “Query is too complex” error (c.f. Limits of the approach). BigQuery teams would have to push this hard limit way further to prevent this kind of error from happening too often. This might not mesh well with BigQuery’s on-demand pricing model that bill per Terabyte of data read.
  2. Proposing a real DataFrame API for BigQuery that directly compiles logical plans like Spark Catalyst does. I don’t know the implementation details of Snowpark, but I suspect that’s how they did it (rather than my poor man’s solution of generating SQL strings). One advantage would be that it could return better error messages than the SQL compilation errors you get in bigquery-frame. This might prove challenging for Google if they don’t want to expose too much of BigQuery’s internals (but they did open-source BigQuery’s lexer/parser, so they might be okay with it). Spark has the advantage of being open source, which allows it to run locally. The unit-tests of the transformations I implemented in bigquery-frame were significantly slower than their equivalent in PySpark because of that.
  3. Investing a lot of effort into Apache Beam’s DataFrame and it’s integration with BigQuery. That approach makes a lot of sense, since on GCP, Apache Beam runs on DataFlow which looks like the closest thing to Spark that Google built internally. Perhaps even BigQuery and DataFlow share some common parts behind the sheets. On the paper, Apache Beam already has a DataFrame API and is already well integrated with BigQuery. However, maybe I didn’t look hard enough, but I still haven’t found any code example that looks as simple as this with Apache Beam : bigquery.table("source_table").withColumn(...).write("dest_table") ,
  4. Directing their users towards Spark on Dataproc. This does indeed work, but makes me sad because I think that BigQuery has some huge qualities that Spark-SQL hasn’t (but that’s a story for another time).

Links to the project

I don’t think I will have much time to spend on improving bigquery-frame in the near future, nevertheless I will try to look at any issue or pull request opened on it. Any user feedback or contribution would be most welcome, of course. Please keep in mind this is just a POC before you decide using it in production ;-).

THANK YOU for reading !

Me after finishing writing this post (Photo by Lucas Expedidor on Unsplash)

--

--

[Available for freelance work] Data Engineer, Data Plumber, Data Librarian, Data Smithy.