Falling in Love with SQL Again

By harnessing the power of Common Table Expressions

Jeremy Dorn
Towards Data Science

--

Photo by Jon Tyson on Unsplash

When I first learned SQL over 15 years ago, I was amazed at how useful and powerful it could be. I was in love. But over the years, as I began writing more complex queries, my code started to become an unmaintainable jumble of JOINs and nested subqueries. It got to the point where if I needed to edit a query I wrote just a few weeks earlier, my eyes would glaze over, I would give up trying to understand it, and I would just write a new version from scratch instead. SQL was losing its charm.

Then, a few years ago something amazing happened. I stumbled upon example SQL code on a random blog that was using this feature called Common Table Expressions. The SQL was long and complex, yet easy to read and understand. I was instantly hooked. Since then, I’ve completely changed the way I think about and write SQL.

What are Common Table Expressions?

Common Table Expressions, or CTEs, are basically a way to list out one or more SELECT expressions at the top of your query and reference them later by name, just like you would a table. They are conceptually similar to creating temporary tables, but much easier to use and with lower overhead.

WITH 
cte1 as (SELECT …),
cte2 as (SELECT …)
SELECT * FROM cte1 JOIN cte2

It’s hard to fully appreciate CTEs in a contrived example like this, so let’s walk through a more real-life use case.

Student Test Scores Example

Let’s say I want to find the average top Math test score for students in California. Thinking about the problem, I basically need to do the following steps:

  1. Get a subset of students (California)
  2. Get a subset of test scores (Math)
  3. Join them together to get all Math test scores from California students
  4. Get the top score per student
  5. Take the overall average

To make this a little more interesting, I’m going to assume that we don’t store a student’s geo location directly and instead must join to their school in order to figure it out.

I’ll go through both my old approach to solving this problem (before I learned about CTEs) and my new and improved approach.

Old Approach (Without CTEs)

I’m going to start by getting all students in California. Like I mentioned above, we need to join the student to their school in order to figure out their location:

SELECT students.id 
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)

Then, I can join in the Math test results:

SELECT students.id, test_results.score 
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
JOIN test_results ON (
students.id = test_results.student_id
AND test_results.subject = 'math'
)

Note that we could have done the reverse — started with Math test results and joined California students. Both are equally valid. At this point, I have all Math scores for all California students. To select just the top score for each student, I can add a MAX and GROUP BY:

SELECT students.id, MAX(test_results.score) as score
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
JOIN test_results ON (
students.id = test_results.student_id
AND test_results.subject = 'math'
)
GROUP BY students.id

Now I have one row per student with their best math score. Lastly, I would wrap this in an outer SELECT to get the overall average:

SELECT AVG(score) FROM (
SELECT students.id, MAX(test_results.score) as score
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
JOIN test_results ON (
students.id = test_results.student_id
AND test_results.subject = 'math'
)
GROUP BY students.id
) tmp

And we’re done! The end result isn’t too bad, but it’s still a little intimidating to read and understand with no context. Adding more nesting levels, joins, and conditions only makes things worse.

One interesting thing to note is that each step tends to add code to both the beginning and end of the SQL — in other words, the logic flows from the inside out in layers.

New Approach (Using CTEs)

I would start the same way, by selecting all students in California. To use Common Table Expressions, I give this SELECT statement a name and add it to a WITH clause at the top of the query:

WITH
student_subset as (

SELECT students.id
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
),

Now I can do the same thing for test scores. Unlike my old approach, I’m not worrying about joins at this point. I’m just selecting an independent data set of all Math test scores for all students.

WITH
student_subset as (
SELECT students.id
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
),
score_subset as (
SELECT student_id, score
FROM test_results
WHERE subject = 'math'
),

After getting these two independent datasets, I can join them together using another CTE:

WITH
student_subset as (
SELECT students.id
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
),
score_subset as (
SELECT student_id, score
FROM test_results
WHERE subject = 'math'
),
student_scores as (
SELECT student_subset.id, score_subset.score
FROM student_subset
JOIN score_subset ON (
student_subset.id = score_subset.student_id
)
),

Now I need to limit this to one row per student with just their top score, which again I can do by adding an additional CTE:

WITH
student_subset as (
SELECT students.id
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
),
score_subset as (
SELECT student_id, score
FROM test_results
WHERE subject = 'math'
),
student_scores as (
SELECT student_subset.id, score_subset.score
FROM student_subset
JOIN score_subset ON (
student_subset.id = score_subset.student_id
)
),
top_score_per_student as (
SELECT id, MAX(score) as score
FROM student_scores
GROUP BY id
)

Lastly, I take the overall average. Since this is the final step, it is done as the main query, outside of the WITH clause:

WITH
student_subset as (
SELECT students.id
FROM students
JOIN schools ON (
students.school_id = schools.id AND schools.state = 'CA'
)
),
score_subset as (
SELECT student_id, score
FROM test_results
WHERE subject = 'math'
),
student_scores as (
SELECT student_subset.id, score_subset.score
FROM student_subset
JOIN score_subset ON (
student_subset.id = score_subset.student_id
)
),
top_score_per_student as (
SELECT id, MAX(score) as score
FROM student_scores
GROUP BY id
)
SELECT AVG(score)
FROM top_score_per_student

And we’re done! The end result is much more verbose (27 lines vs 12 lines), but makes up for it in many ways.

Benefits of CTEs

Logical Flow —With CTEs, the logic flows from top to bottom instead of from the inside out. This is a more natural way to think about and write SQL, although it takes some getting used to.

Readability — The logical flow means you can start reading a query that uses CTEs from the top and immediately begin to understand it. With the more traditional inside-out flow, it can take a frustrating amount of time just to figure out where to start reading. And once you start reading, you will constantly have to jump around to figure out what’s going on. This becomes more and more apparent as the queries get increasingly complex.

Documentation — Another side effect of splitting the query into linear steps is that it gives you a natural place to add documentation. In addition to the CTE names (e.g. “top_score_per_student”), it’s also easy to add inline SQL comments between steps. Doing this without the forced linear structure that CTEs provide is difficult and messy.

Easy Debugging — When the result doesn’t look right, you can easily walk through the steps from top to bottom, verifying the data along the way, to identify exactly where a problem occurs. Without CTEs, you’ll often find yourself butchering the SQL so badly while investigating a bug, that when you finally find it, it can take just as long to unwind everything and get back to your original query. The worst part is that sometimes your fix works on the stripped down butchered SQL, but not on your original query for some reason and you are back to square one.

Re-usability —Multiple queries often share some common pieces of business logic. For example, using the same definition of “Students in California”, but looking at attendance records instead of test scores. With CTEs, each of these definitions is a self-contained SELECT expression and can easily be copy/pasted between queries without modification.

Performance — You might assume using CTEs would result in slower queries, but that’s not actually true most of the time. SQL query engines are really smart and will generally compile both approaches into the same optimized execution path. In a roundabout way, CTEs can actually improve performance since they help you prioritize optimization efforts to focus on the most widely used expressions. For example, if you are constantly using expressions which join students to schools to get geo location, maybe it’s worth creating a new materialized view to get rid of the join.

Parting Notes

I couldn’t find that blog post from years ago that first introduced me to Common Table Expressions, but I’m forever grateful that it helped rekindle my love for SQL. Hopefully this post can do the same for someone else out there.

--

--

Jeremy is the Co-Founder of Growth Book, an open source feature flagging and A/B testing platform.