The limits of SQL in data science, analytics, and engineering

SQL has always seemed like an elegant solution to implement simple data analysis and ETL, but it always seems to break down beyond a certain level of complexity. Might Spark, DataFrames, and Scala (SDS) be a good solution for more complex work with data, even for non-engineers?

Matt Hagy
Towards Data Science

--

I truly want to do more with SQL. It seems like such a simple and elegant way to represent data joins, filtering, and basic transformations in a concise format. In many ways, I feel SQL is closer to a mathematical way of defining computation (i.e., a functional, LISPy, declarative approach) than conventional imperative code, which is a plus for SQL in my opinion. I particularly treasure short SQL queries that can quickly answer random ad hoc questions such as the following.

Further, I find many people who don’t identify as highly technical can learn basic SQL quickly and if necessary master it well enough to perform serious data analysis and transformations. As a data scientist and engineer, I find it great to be able to work with a larger set of stakeholders in thinking through the business logic we want to implement in computing stats or processing data. We simply access more people’s critiques and ideas when non-engineers can work with the logic.

But I always find SQL inadequate once the complexity grows beyond about a dozen lines of code in that I immediately start worrying about correctness and legibility. I’d much rather implement the transformation in Spark with DataFrames using Scala (or if necessary PySpark using Python). I can then break my logic down into simple well-named functions and rigorously unit test each function. Simple integration tests can be used to verify that everything is connected together correctly. All around, I’m simply more confident in the correctness of a Scala solution and feel more capable in refining and maintaining the code in the future.

In many ways, I wish it didn’t have to be that way with SQL. I’d really like a great way to extract arbitrary SQL expression into functions and further have a great environment for testing SQL. I have yet to find anything that works for a range of SQL dialects and is also open source. Let me know if there is a good tool I should be looking into.

I further find SQL to be a difficult language for expressing certain complex logic. For example, parsing JSON and then extracting and transforming specific fields. I know there exist solutions for loading nested JSON into SQL engines for subsequent processing, but I’d rather write a Spark workflow that just directly reads the raw SQL data and perform the parsing and transformations in Scala.

Further, I’ve found myself having to implement complex business logic to perform various analysis and transformations (particularly for feature engineering) that just couldn’t be written in SQL. Or at least it wasn’t straightforward to implement in a conventional SQL approach. Sometimes I’ve found hacky solutions, but they don’t seem like a good representation of the logic we want to develop and maintain. Here’s an example of a Spark workflow that I’d struggle to implement elegantly in SQL.

Can you think of a good way to implement this logic in SQL?

Traditionally, user-defined functions (UDFs) would be developed in a conventional programming language to extend SQL capabilities. I find that solution inelegant in that we’re now mixing two languages. Further, there are limits to what can be implemented in UDFs (and user-defined aggregates, UDAs). Nowadays, I think it’s just better to use (Py)Spark and define the custom logic alongside the data processing definitions.

Lastly, I really like types in any programming language. I think types convey so much information not only to the compiler but also to people reading the code. I find well-named functions, objects, and members —with type annotations — provide a great level of self-documentation in code. Combine that with some small comments and concise external documentation, and you end up with highly legible code that anyone familiar with the language can learn and work with.

Scala has an excellent type system and even modern Python has optional type annotations. While most SQL dialects support some level of types in stored procedures and functions, I’ve found this limited to basic types and therefore can’t be used to the same extent as type annotations in a robust programming language.

I just don’t see a path forward to getting SQL to the same quality as a Spark, DataFrame, and Scala solution. Yet, I recognize the need for data processing logic to be legible and modifiable to more than just engineers. I hypothesize that Spark, DataFrames, and basic Scala (SDS) could be used to that end.

I’m not proposing that non-engineers be able to read, reason about, and modify a full production Scala engineering system, but instead that they can work with small code snippets at a complexity level comparable to SQL. Further, DataFrame operations can build off their current understanding of SQL operations such as joins, filtering, and mapping transformations. They simply need only to learn the new syntax for implementing these familiar operations. Over time, they can learn more Scala to implement more complex logic when their work requires it.

To further facilitate this education, I’d like to develop a small tutorial of transferring SQL concepts to the SDS stack. Further, to really sell the materials, I’d like to also develop a cool SQL->SDS transpiler. People could use this to see how their current SQL logic is transformed into this other tech stack. Combined with a snazzy web app, people could immediately start learning the equivalent SDS code for their work. This would allow people familiar with SQL to start learning the specific SDS knowledge they need to start solving problems.

Such education synergizes with the development of cloud solutions for hosting Spark (e.g., AWS EMR, GCP Dataproc, Databricks, Cloudera Altus Cloud) such that more people can now easily access a Spark environment. Over time, maybe we could even get AWS and GCP to create ad hoc managed Spark environments comparable to the SQL solutions of Athena and Big Query. Overall, I anticipate more and more professionals are going to have access to Spark environments when they want them.

I could even see data scientists transitioning to more complicated data engineering work when they need to build something to facilitate their data science work. Similarly, product managers may find themselves coding up complex business logic in Scala because it’s simply harder to write up the logic in English. If we teach these SQL-to-Scala converts how to modularize code and write tests, we may soon find ourselves with an army of situational engineers that can implement parts of systems important to their work.

What do you think? Do you believe Spark/DataFrames/Scala (SDS) can replace SQL for a growing number of non-engineers? Could it even help us all get more engineering work done, which all data companies are constantly looking to do? Let me know your thoughts in the comment section below because I’d like to collect feedback on this idea before investing too much time in helping SQL aficionados transition to SDS.

If you’re interested in collaborating on SQL-to-SDS tutorials and/or the transpiler, drop me a message at matthew.hagy@gmail.com. Also, feel free to beat me to it! I’ll gladly help develop and promote however I can.

Update: If you’ve found the Scala examples in this article interesting and would like to learn more about this powerful programming language you can check out my article, Quickly learning the basics of Scala through Structure and Interpretation of Computer Programs examples (Part 1).

--

--

Software Engineer and fmr. Data Scientist and Manager. Ph.D. in Computational Statistical Chemistry. (matthagy.com)