Writing good SQL

Further structuring the query language by adapting layers

Holger Thoese
Towards Data Science

--

Scientist on doctoring data, maybe SQL?
Photo by National Cancer Institute on Unsplash

Do you want to write good SQL? Sure, but what does “good” mean actually?

In certain real time surroundings only performance counts as “good” and you measure your execution time in milliseconds. In business intelligence and data warehouse environments performance is still important, but often maintainability is even more valuable. It might be a better solution, to produce readable and maintainable SQL code, even if it burns cloud dollars every month. Keep in mind the costs for hours to understand a complex query, the cost that potential errors produce and the hours to fix them. And depending on your database engine optimizer, you often get both. So, in this article we focus on readability and maintainability.

I assume you already know the basic and more advanced elements of SQL: joining, grouping, subqueries or analytic functions. Combining these elements, you can quickly generate a big mess. But what can we do, to keep even large queries with hundreds of lines and 20 or more tables readable?

The very basics

When we say readability, we have to talk on style. There is an excellent SQL style guide from Simon Holywell, take a look at it. But if you’re in a hurry, you get very far with these three advices:

  • Use uppercase for all reserved keywords, like ´SELECT, FROM or WHERE and lowercase for columns, tables and symbolic names
  • Create a gap, by aligning the keywords to the right-hand side and the values to the left-hand
SELECT first_name, last_name, birth_date
FROM emp.employees
WHERE gender = ‘F’;
  • Use comments, whenever a query gets more complicated than a simple SELECT/FROM/WHERE statement
 /* find highest actual salary for each employee */
SELECT first_name, last_name, MAX(salary)
FROM emp.employees e
JOIN emp.salaries s
ON e.emp_no = s.emp_no
WHERE current_date BETWEEN from_date AND to_date -- filter date
GROUP BY first_name, last_name;

With these basic rules, you’re SQL will easily reach at least a moderate readability level.
As your SQL gets larger and more complex you will reach certain limits, where you need more structure than the style guide provides.

Subqueries, or where the pain begins

You know subqueries? Subqueries are great to encapsulate problems within a block of your query. Basically there are three types of subqueries in SELECT statements:

  • In the FROM clause, as a derived table with an alias name
SELECT COUNT(*)
FROM (SELECT gender
FROM emp.employees) e
  • Within the WHERE clause as a qualifier
SELECT *
FROM emp.employees e
WHERE 0 < (SELECT COUNT(*)
FROM emp.salaries s
WHERE s.emp_no=e.emp_no
AND salary > 3000)

Please note that this is also a correlated subquery, which should get used carefully (see https://en.wikipedia.org/wiki/Correlated_subquery).

  • Scalar subqueries (also a correlated subquery)
SELECT last_name, (SELECT MAX(salary)
FROM emp.salaries
WHERE emp_no=e.emp_no)
FROM emp.employees e

So where is the problem?

Subqueries can have more subqueries in any depth. And when you start with subqueries, the probability is high that more subqueries will follow over time. Other people will follow your example. Every subquery needs to get indented and the lines get longer and longer. This is a mayor issue on readability in more complex statements.

And even more important, every subquery splits the surrounding query in code “before” and “after”.

Example for a subquery in the middle of a statement
Subquery in the middle of the query

One of the golden rules for coding in procedural times were:

Don’t write blocks larger than your screen

(and a screen had 25 lines). But with a subquery in the middle, you can end with hundreds of lines between your columns from the SELECT block and your WHERE clause.

This leads to an interesting way of reading SQL queries. Typically, you start somewhere in the middle and slowly work yourself to the edges.

Example for the reading direction with subqueries
Reading direction with subqueries

This “inside-out” direction is in contrast to our normal reading direction from top to bottom, which most of us are comfortable with.

The power of WITH

Some of us write SQL for years and never used the WITH clause. Give it a try, it’s worth it!

In the WITH clause, you can give your subqueries a symbolic name, just like in a FROM clause, while moving it to the beginning of your statement.

WITH t_avg_salary AS (
--averagy salary for every employee
SELECT emp_no, AVG(salary) avg_salary
FROM emp.salaries
GROUP BY emp_no
)
SELECT e.last_name, avg_salary
FROM emp.employees e
JOIN t_avg_salary
ON e.emp_no = t_avg_salary.emp_no
WHERE avg_salary < 3000;

And you can construct a “pipe” of subqueries, which build up on each other

WITH t_avg_salary AS (
-- average salary for every employee
SELECT emp_no, AVG(salary) avg_salary
FROM emp.salaries
GROUP BY emp_no
),
t_avg_salary_3k AS (
-- limit to avg. salary smaller 3k
SELECT *
FROM t_avg_salary
WHERE avg_salary < 3000
),
t_employee AS (
-- Join employee and salary
SELECT e.last_name, avg_salary
FROM emp.employees e
JOIN t_avg_salary_3k
ON e.emp_no = t_avg_salary_3k.emp_no
)
SELECT *
FROM t_employee;

This leads to the following advances:

  • As each subquery is a block in itself, you have the functionality separated from each other
  • Every subquery can get an own headline comment, which clearly improves understandability
  • You can read your code top-down, which is great for readability
  • Each subquery starts at the beginning of the line with a fixed indent. So you have much fewer problems to follow your teams coding style guide, which usually limit line length to 80 characters
  • Adding functionality (or further subqueries) to the query does not force you to reformat the whole query. This leads to small nice changes in your version control system. E.g. for removing the average salary limit:
7,12d6
< t_avg_salary_3k AS (
< -- limit to avg. salary smaller 3k
< SELECT *
< FROM t_avg_salary
< WHERE avg_salary < 3000
< ),
17c11
< JOIN t_avg_salary_3k
---
> JOIN t_avg_salary

Keep your subquery blocks small, with a strong focus on a single functionality in mind. Don’t mix functionality, e.g. joining, filtering, grouping, mapping.

And last but not least, make the final block dead simple, with only SELECT and FROM. Think of it as a layer model. The first blocks are for data acquisition, the last block is only the output. If you put more functionality in it, you’ll end up as you started.

Example for layering with multiple subqueries
Layering with multiple subqueries

Bonus: Build feature toggles in SQL

When using SQL within a larger system, there is often variable substitution involved. If you try to switch between two functionalities within SQL, you quickly end up with ugly code.

Using WITH, you can easily put your database names, table names or different config parameters into a config section. And, you can use it to toggle block usage on and off.

Think you have a stream of subqueries as discussed before. One of them implements a limit on “salary < 3000” which should only be active in a later version of your code.

Replace the join table from the last subquery with a variable “t_avg_salary” (here in python f-string format):

WITH

t_emp_join AS (
-- Join employee and salary
SELECT last_name, avg_salary
FROM t_employee e
JOIN t_avg_salary_3k s
ON e.emp_no = s.emp_no
)
SELECT *
FROM t_employee;

The possible values for “t_avg_salary” are now “t_avg_salary” and “t_avg_salary_3k”.

With the change of your “t_avg_salary” toggle variable, you can change the output of your SQL, without changing the code. Whether you create a dummy derived table, or simply use the following derived table is up to you.

The subquery in between is simply skipped.

--

--