Modern Data Stack: Which Place for Spark ?

Published in

Towards Data Science

7 min readJan 25, 2022

One year ago, some were already predicting that dbt will one day become bigger than Spark, and the year 2021 proved them right: dbt has become incredibly popular and rumor has it that dbt-labs might raise again at $6 billion valuation. At this rate, they will soon catch up with Databricks which reached a $38 billion valuation in September 2021.

Still, what struck me the most about Spark this year was how absent Spark could be from almost every blog post about the Modern Data Stack, which is built around 2 key components:

A massively-parallel SQL engine (BigQuery, Redshift, Snowflake)
and … dbt

Upstream: no-code Extract/Load tools (Fivetran, Stitch, Airbyte, Hevo). Downstream: BI tools (Tableau, Looker, Power BI, Metabase) and reverse-ETL tools to export data into specialized databases (Customer Data Platforms and such).

The Modern Data Stack according to Dataiku: https://blog.dataiku.com/challenges-to-the-modern-data-stack

I just had to type “Modern Data Stack” in Google Images to notice that all the companies in the Data market are proposing their own list of technologies composing this stack, as they generally try to include themselves in the list.

But I also noticed that this Modern Data Stack is generally built completely without Spark, and that the Databricks ecosystem is mostly seen as a full alternative to it. Of course, Databricks is fully aware of that and, like many others, tries do join the small circle of SQL engines that can fit at the center of the stack: they released in December a full integration with dbt Core as well as Databricks SQL.

Source: https://lakefs.io/thoughts-on-the-future-of-the-databricks-ecosystem/

Recently, I replied to someone from another company who was asking if it would be worth it to add Spark to their Modern Data Stack. Since my team is currently using both pySpark, BigQuery, and (a little bit of) dbt, I myself think a lot about this question. So I answered them with a long list of pros and cons that fuel my current reflections, that I‘m sharing here:

Infrastructure: BigQuery is fully managed by Google, you have nothing to do. In comparison, Spark is much more complex to master, even if this tends to become easier (Spark-serverless is available in preview on GCP, and is coming on Databricks, as well as Databricks SQL).

Learning curve: There again, it’s easier to find or form skilled people on BigQuery (which is only SQL) than Spark. My advice: prefer pySpark to Spark in Scala, Java or .Net. Python is more accessible and is already used by Airflow, for instance. I think the only valid reasons to use another language than Python in Spark are: “to do very advanced things with RDD” and “reuse some company’s code already written in Java without having to re-implement it”.

When I don’t know how to illustrate an article about SQL, I google for xkcd cartoons. — A hundred thanks to XKCD web comics who gave me express permission to use one of their comics in this article: https://xkcd.com/1409/

Code organisation: dbt showed the CORRECT way of organizing SQL transformation pipelines (and I know about it: from 2014 to 2017 I developed a tool that did the same thing as dbt does, but for Apache Hive). To my knowledge, there is no tool that does the same thing for pySpark currently (dbt does support spark-sql, but no pySpark in it’s entirety). This is why we developed an internal tool, and that I hope to open-source it one day. Without such a tool, it is very easy to go back to the same bad practices that dbt helped us to stop.

Expressiveness: I love SQL, even more than BigQuery, but I just can’t do everything that I need with SQL only. Also, I think that using Jinja templates is not the right solution: as Maxime Beauchemin puts it very well in his blog post, this will lead to mountains of templated SQL and YAML. Personally, I believe that this is the same mistake as in the first schedulers: config-as-code has many limits, and Airflow turned the scales by proving that code-as-config (thanks Python) worked much better. Yes, JINJA did solve some of the rigidity issues of config-as-code, but I still find JINJA quite heavy and verbose, not very readable, and unit-testing JINJA code seems rather tedious.

Quand je n’ai pas d’idée d’image pour illustrer un article de blog sur SQL, je cherche “xkcd sql” dans Google — With explicit permission from xkcd: https://xkcd.com/1409/

Limits of SQL: Like Maxime Beauchemin, I believe (at least, I hope) that things will get better when BigQuery, Snowflake or Redshift will provide a DataFrame API similar to the one provided in pySpark. Snowflake actually already did: they recently released Snowpark, whose DataFrame API is clearly borrowed from Spark’s one. I recently started to POC a DataFrame API for BigQuery, to showcase what more could be done with something like this (some of which can already be done with JINJA macros, I admit, but in a way that I find less elegant and to harder to maintain). I’ll talk more about this POC in my next article.

UDFs: Another important limitation of SQL: some logics are much easier to implement with UDFs than with SQL code. In BigQuery, UDFs must be written either in SQL or Javascript (!!!). Idem with Snowflake, which allows for Java too. Go tell a Data Engineer/Analyst/Scientist who only knows Python to write a Javascript UDF… PySpark allows us to write UDFs in Python, and I can’t wait for BigQuery to allow it too.

Extract-Load (E/L): I’m very surprised by the large number of people who seem to use custom Airflow operators to perform E/L tasks, rather than Spark. I think this is one of the greatest strengths of Spark: it has a large number of connectors to read from/write to everything and anything. It can also perform automatic schema detection on json and xml without headache. And, as noted by Ari Bajo, it’s better to go by a central state (Spark) and have O(n) connectors rather than to write O(n²) connectors for each source-destination pair. Spark can do all that, and I assume has much cheaper running costs than a Fivetran (although I must say that the open-source tool Airbyte might be a serious alternative there). The setup cost with Spark might be higher, but once paid, replicating and adding new sources/destination doesn’t take very long. Another advantage: Spark can do both the ETL and reverse-ETL.
It’s true that it does require developers where non-devs people could have used a graphical interface. But I also have the feeling that a non-dev with a GUI will be less capable of investigating and solving potential issues (but I might be wrong on that).

Real-time: In my team, we started to use pySpark for simple real-time cases (raw data ingestion into BigQuery), but I don’t know the subject well enough to compare it to other alternatives (such as lambda functions). We’ll note that thanks to its micro-batch mode, Spark allows us to have exactly-once guarantees quite easily.

Conclusion

To sum-up in a few words, I think that most of these considerations revolve around a central point, which is SQL’s greatest strength, and also its greatest weakness: it’s simplicity. It is thanks to SQL’s simplicity that platforms like BigQuery and Snowflake are so easy to use and that their adoption is so wide, while at the same time reducing the risk of vendor lock-in. Conversely, this simplicity is also its greatest downside, as the limits of SQL are quickly reached, and development best practices are harder to apply and less widespread. Thanks to dbt and JINJA, some of these downsides could be mitigated, but I believe that the industry will have to go further, with DataFrames or other similar APIs, to help Data technicians write more generic and advanced transformations, and better address the ever growing need for data.

In my team, one of the main challenges is that we tend to alternate between pySpark and BigQuery to benefit from what each tool can do best. This makes data-lineage more difficult, since dbt only lets us visualize the BigQuery part, while our internal “dbt for pySpark” tool only lets us see the pySpark part. In the long term, I hope that BigQuery will add the features that are missing compared to pySpark (DataFrame API and Python UDFs), this would allow us to one day migrate our current pySpark transformation towards BigQuery. The only things that would remain on pySpark’s side would be E/L and real-time data processing.

(A French version of this post is available here)

Modern Data Stack: Which Place for Spark ?

Conclusion

Written by Furcy Pin