The world’s leading publication for data science, AI, and ML professionals.

10 SQL standards to make your code more readable in 2021

Make a NY resolution for 2021: Easy to read, and maintain SQL

Image licensed to author
Image licensed to author

This is my 20th year working with Sql (eek!), and I’ve shared my 10 key learnings to help make sure your SQL code is easy to read, debug, and maintain.

The key to success is, of course, to enforce these (or your own standards) across your Enterprise. Tip 10 discusses how you can do this.

So, in no particular order, let’s begin.


1 Choose a CASE and Stick to_it

If I was given £1 every time I saw something like this, I think I’d be sitting on a tidy sum:

select first employee_first_name,
       surname employee_last_name,
       title,
       CASE WHEN employment = 1 THEN 'FT' WHEN employment = 2 THEN 'PT' ELSE 'T' END 
       as EmploymentStatus,
       'Y' AS isValid
       ,"HR" Employee-source
from employees
WHERE Valid = 1

SQL keywords lowercase

First, choose a case for your SQL keywords. Historically, these tended to be uppercase, but I have seen a gradual movement to lowercase, and I actually prefer this. Having keywords in lowercase is quicker to type, and easier to read.

Camel, Pascal, Snake, or Kebab?

No, it’s not an exotic lunch menu! And yes, kebab-case is a thing!

Decide on your case for naming tables, columns, variables, etc. Your options are:

camelCase (all words following the first word have title case, no separator)

snake_case (all lowercase, _ separator)

PascalCase (all words in title case, no separator)

kebab-case (all lowercase, -separator)

The choice is very much a personal one here, and you may risk decent in the ranks if you claim one is better than the other. But, consistency is key.

Personally, I like snake_case.

select first employee_first_name,
       surname employee_last_name,
       title,
       case when employment = 1 then 'FT' 
       when employment = 2 then 'PT' 
       else 'T' end as employment_status,
       'Y' as is_valid
       ,"HR" employee_source
from employees
where valid = 1

2 commas in the select clause

This is another thing I often see, and causes me some internal distress (last line):

select first employee_first_name,
       surname employee_last_name,
       title,
       case when employment = 1 then 'FT' 
       when employment = 2 then 'PT' 
       else 'T' end as employment_status,
       'Y' as is_valid
       ,"HR" employee_source

Agree on a standard. Personally I find commas at the end of the line easier to read. However, the "leading commas camp" may argue it makes shifting lines up and down easier (moving the last line up means adding a comma to the end).

Now my internal peace is restored (at least on the last line):

select first employee_first_name,
       surname employee_last_name,
       title,
       case when employment = 1 then 'FT' 
       when employment = 2 then 'PT' 
       else 'T' end as employment_status,
       'Y' as is_valid,
       "HR" employee_source

#3 Single or double quotes for strings

Again, make this a standard and stick to it. I prefer single quotes, I feel they are easier to read:

select first employee_first_name,
       surname employee_last_name,
       title,
       case when employment = 1 then 'FT' 
       when employment = 2 then 'PT' 
       else 'T' end as employment_status,
       'Y' as is_valid,
       'HR' employee_source

4 Column alias formatting

Make a standard for formatting your select clause – this may sound trivial, but trust me, once your codebase grows you will be thankful you did it.

Here are my personal formatting rules I enforce:

Use the as keyword for column aliases

So I prefer this:

select first as employee_first_name,

To this:

select surname employee_first_name,

2. Line up column names

This one I feel really strongly about, and it makes SQL so much easier to read.

Instead of this:

select first employee_first_name,
       surname employee_last_name,
       title,
       case when employment = 1 then 'FT' 
       when employment = 2 then 'PT' 
       else 'T' end as employment_status,
       'Y' as is_valid,
       'HR' employee_source

Line up aliases, like this:

select first      as employee_first_name,
       surname    as employee_last_name,
       title,
       case 
           when employment = 1 then 'FT' 
           when employment = 2 then 'PT'
           else 'T' 
       end        as employment_status,
       'Y'        as is_valid,
       'HR'       as employee_source

Astute readers will note there is no alias for title. My reasoning here, is I prefer to only alias a column if the name is changing.

Also, note the formatting of the case statement. I prefer this layout, as it allows the column alias to line up. Also, nested case statements are much easier to read.

3. Break up case statements

Sometimes, if case statements are complex, or follow each other like this, I put line breaks in. I think it just helps with readability:

select first      as employee_first_name,
       surname    as employee_last_name,
       title,

       case 
           when employment = 1 then 'FT' 
           when employment = 2 then 'PT'
           else 'T' 
       end        as employment_status,

       case 
           when foo= 1 then 'foo1' 
           when foo= 2 then 'foo2'
           else 'foo3' 
       end        as foo,
       'Y'        as is_valid,
       'HR'       as employee_source

#5 Column naming standards

This is potentially an article in itself, but for column names, as well as consistent case and layout, I also use these standards:

Binary columns

I prefix with is_ Here is our is_valid in our example:

'Y'        as is_valid,

I also prefer 1 and 0 to Y and N. It makes aggregation a lot easier.

So use this:

1        as is_valid,

Instead of this:

'Y'        as is_valid,

Date and times

To help consumers know a column is a date, datetime, or timestamp, I suffix columns like this:

select   date_hired       as hire_date,
         last_modified    as last_modified_dtime,
         timestamp_foo    as foo_time

For timezones (which can cause a lot of confusion), I tend to suffix with _utc to signify they are UTC time.

select   date_hired       as hire_date,
         last_modified    as last_modified_dtime_utc,
         timestamp_foo    as foo_time_utc,
         timestamp_local  as foo_time

Counts, currencies

These I like to suffix with _count and _amount, again just to help consumers. With these standards, names then should be very clear:

select   salary           as salary_amount,
         0                as is_salary,
         date_hired       as hire_date,
         months_active    as months_active_count,

6 Table aliases

Another rule I feel strongly about is the consistent use of table aliases. We want to avoid this:

select employee_id,
       first        as employee_first_name,
       surname      as employee_last_name,
       title,

       case 
           when employment = 1 then 'FT' 
           when employment = 2 then 'PT'
           else 'T' 
       end          as employment_status,
       'Y'          as is_valid,
       'HR'         as employee_source,
       salary       as salary_amount,
       0            as is_salary,
       date_hired   as hire_date
from employee e
  left outer join salary s
  on e.employee_id = s.emp_id
where valid = 1
and valid_rec = 1

Note using table aliases (the ‘e’ and ‘s’ in bold) when referencing columns, makes it difficult to know which tables columns are from. For example:

select employee_id,
...
where valid = 1
and valid_rec = 1

Are these referencing the employee or salary tables? The same applies, in fact, to all columns in the select clause.

Also, note the lack of the as keyword in the table alias:

from employee e
  left outer join salary s

Enforcing table alias standards makes this SQL statement a lot clearer to read and easier to modify if needed:

select e.employee_id,
       e.first        as employee_first_name,
       e.surname      as employee_last_name,
       e.title,

       case 
           when e.employment = 1 then 'FT' 
           when e.employment = 2 then 'PT'
           else 'T' 
       end            as employment_status,
       'Y'            as is_valid,
       'HR'           as employee_source,
       s.salary       as salary_amount,
       0              as is_salary,
       s.date_hired   as hire_date
from employee as e
  left outer join salary as s
  on e.employee_id = s.emp_id
where e.valid = 1
and s.valid_rec = 1

7 Use CTEs, or temp tables to break up complexity

Try to avoid overly large, single blocks of SQL. Instead, use CTEs (or temp tables if these are not available on your database platform) to divide up complexity into smaller blocks. This is good practice because:

  1. SQL logic is easier to read
  2. Changes can be made in a more modulised way
  3. Avoids repeating code
  4. When debugging, you can output CTEs to mateliarised tables, helping you pinpoint issues to a particular place

CTE, temp table tips

To format CTEs, this is nice and readable:

lunar as
(
   select full_moon_date,
          is_total_eclipse,
          is_partial_eclipse,
          region,
          is_first_full_moon_on_or_after_march_21
   from dim_lunar_cycle
)
select ... 
from   lunar l

select * (star)

Avoid select (this applies to all code). Instead, you should list all columns explicitly. Select makes code hard to maintain (you have to know the structure of the source tables) and also opens up the risk of breakages, should additional columns be added to source tables.

So, using our lunary CTE as an example, instead of this:

lunar as
(
   select full_moon_date,
          is_total_eclipse,
          is_partial_eclipse,
          region,
          is_first_full_moon_on_or_after_march_21
   from dim_lunar_cycle
)
select l.*,
       d.*
from   dim_date d 
   left outer join lunar l
   on l.full_moon_date = d.date

Specify column names instead:

lunar as
(
   select full_moon_date,
          is_total_eclipse,
          is_partial_eclipse,
          region,
          is_first_full_moon_on_or_after_march_21
   from dim_lunar_cycle
)
select l.full_moon_date,
       l.is_total_eclipse,
       l.is_partial_eclipse,
       l.region,
       l.is_first_full_moon_on_or_after_march_21
       d.day_name,
       d.month_name,
       ...
from   dim_date d 
   left outer join lunar l
   on l.full_moon_date = d.date

8 Plurality of table names

Again, another contentious topic here, and I don’t think there is a right or wrong answer; although many data modelling purists will tell otherwise! (hint, singular).

But again, make sure you agree on a standard and stick to it.

Either opt for this:

select count(1) as employee_count 
from employee

Or this:

select count(1) as employee_count 
from employees

But, don’t mix and match!

9 Table naming

Our penultimate tip – make sure you bring consistency into your table names. For example, when working with dimensional data models, I always prefix dimensions with dim and facts with fact. Anything meta related, is suffixed with meta_ and so on…

select ...
from dim_date                as dim_date,
     dim_customer            as dim_cust,
     fact_order              as fact_order,
     fact_snapshot_customer  as fact_ss_cust,
     meta_batch              as meta_batch
where ...

Poor table naming can really hurt the organisation of data, and make it very difficult for people to find.

Don’t be afraid to use verbose table names

And this applies to column names too. In fact, longer names have a real advantage in that it can self-document your SQL code.

Let’s take the lunar example again.

I used this:

select full_moon_date,
       is_total_eclipse,
       is_partial_eclipse,
       region,
       is_first_full_moon_on_or_after_march_21
from lunar

Instead of this:

select full_moon_date,
       is_total_eclipse,
       is_partial_eclipse,
       region,
       is_first_full_moon
from lunar

Some would fall off their chair at seeing such a long column name. But, I would argue this is very effective at documenting exactly what is going on.

Now, of course, should this logic change – "Houston, we have a problem…".

So apply some judgment with this rule. In my case, lunar cycles are going to be with us for a while.

10 Making it happen

The SQL standards document

The first step is to review and document with your team, a set of SQL standards. This document should form the baseline for all future SQL development. Note this should cover formatting and naming, but not necessarily best practices (e.g. avoid views on views) – these are often better described in a separate document.

But what about my existing SQL?

Many of you will have an existing SQL code base that of course, won’t adhere to your new standards. That’s ok. You may wish to completely leave this alone, or, perhaps if at any point a SQL script is modified, perhaps part of that change involves ensuring it applies to standards.

Table and column naming is more difficult to change retrospectively, another reason to put these SQL standards in early. You could consider a layer of views that apply renaming – so at least consumers have a consistent experience.

How do we ensure developers follow standards?

This is the relatively easy part. You should already be doing peer code reviews, so make sure reviewers are well aware of your SQL standards (and best practices) documents. Have this part of the review.

You may consider automating the process – for example having scripts that interrogate database metatables and verify your standards. For example, I remember during my years at Goldman Sachs, they did exactly that. Hunting out, and alerting on, select * occurrences, as an example.

Another option (again Goldman used this) was in data modelling tools such as Erwin, or IBM DataSphere, write macros to enforce your standards. These can easily check for things like column formatting, index naming, etc. The aim is to catch it early before it ever even hits a database.


Conclusions

I have presented 9 SQL standards that you may want to adopt, or should at least consider, together with a final tip for how to put them into place.

Of course, I have merely presented my own preferences in this article, and your standards will no doubt be different. They may extend to other areas (standard comment headers in views, for example), or fewer areas should you wish to be less prescriptive. But it should lead to a healthy debate, at the very least, and hopefully a set of standards for your team or organisation.

I can’t emphasise how much easier database code is to maintain when these standards are properly defined and enforced. It also discourages "sloppy", or rushed code, that can quickly descend into a series of hacks, tape, and glue.

Start 2021 with clean SQL!

(that includes data scientists, analysts, or anybody in fact, who is writing or contributing SQL code).


Next Steps

  1. Learn more about Ancoris Data, Analytics & AI

Related Articles

Some areas of this page may shift around if you resize the browser window. Be sure to check heading and document order.