The world’s leading publication for data science, AI, and ML professionals.

Common Table Expressions: 5 Tips for Data Scientists to Write Better SQL

From SQL Server to Google BigQuery: Learn why I love CTEs.

CTEs in SQL: 5 Tips for Data Scientists to Write Better SQL

Image licensed to author
Image licensed to author

This is my 20th year working with SQL (eek!). To mark this milestone, I recently wrote an article on the 10 SQL naming standards I live by, which I feel, produce SQL that is easy to read, debug, and maintain.

10 SQL standards to make your code more readable in 2021

Following this theme, I thought I’d explore some of the other elements of the SQL language that I feel, help differentiate great SQL, from simply functional SQL.

One element at the top of my list is the common table expression, or CTE for short. First launched around 2000, CTEs are now widely available in most modern database platforms, including MS SQL Server, Postgres, MySQL, and Google BigQuery.

I have used Google BigQuery for my examples, but the syntax for CTEs will be very similar to other database platforms that you might be using. I’ve used the Santander cycle hire scheme dataset, part of Google’s publicly available datasets.

A quick note for BigQuery users – in case you are wondering why my BigQuery UI looks different from yours, I’m using the updated UI for 2021, just released in preview. If you want to see what I think of the new features, have a read of the article I published a few days ago.

5 great features in BigQuery’s new UI for 2021 (yes, it has tabs!)

What are Common Table Expressions?

For those of you who are not familiar with CTEs, or perhaps just starting out in your Data Science careers, a CTE is a temporary result set from a SQL query, that you can reference in other parts of a wider SQL query as if it were a permanent table in your database. The CTE is temporary and only exists for the duration of the wider query being executed. You can alias (name) your CTEs, in a similar way to how you alias tables or columns names and can have multiple CTEs per query.

Here is the basic syntax:

with employee_cte as (select first_name, last_name from employee)
select e.first_name,
       e.last_name,
       h.hire_date
from employee_cte e
  inner join employee_hire h
  on h.employee_id = e.employee_id

Follow my 5 tips and hopefully, by the end, you’ll have some good best practices to begin exploring them for yourself.

So, in no particular order, let’s begin.


1 Avoid duplicate column transformations

In this query, I want to find all cycle stations that are within 500m of London’s city centre, and return the station name and the distance from the centre:

select name as station_name,

       st_distance
       (
        st_geogpoint( longitude, latitude),
        st_geogpoint(-0.118092, 51.509865)
       )    as distance_from_city_centre_m
from `bigquery-public-data.london_bicycles.cycle_stations`
where st_distance
(
 st_geogpoint( longitude, latitude),
 st_geogpoint(-0.118092, 51.509865)
) <= 500
order by distance_from_city_centre_m

Whilst this is perfectly functional and returns the correct result (below in case you were curious!), you will note the logic in bold to calc the distance is duplicated. This is something we want to avoid; logic should be defined once, making it easier to read and modify later on if required.

Here is where a CTE comes to the rescue:

with stations
as
(
  select name as station_name,
         st_distance
         (
           st_geogpoint( longitude, latitude),
           st_geogpoint(-0.118092, 51.509865)
         ) as distance_from_city_centre_m
  from `bigquery-public-data.london_bicycles.cycle_stations`
)
select station_name,
       distance_from_city_centre_m
from stations
where distance_from_city_centre_m <= 500
order by distance_from_city_centre_m

No more duplicated logic. And, I think this much easier to read than the first query too – what do you think?

And because my stations CTE is temporary, I haven’t had to clutter my database with any materialised (permanent) tables, or views.

Cycle hire stations, within 500m of the center of London
Cycle hire stations, within 500m of the center of London

2 Simplify complex SQL (KISS)

As in other areas of programming, and life for that matter, it really helps if you try and keep things simple where you can. One way we can use CTEs to achieve this in our Sql code is to use CTEs as a way of breaking up complex SQL queries into smaller, bite-size chunks. This produces SQL that is:

  1. Easier to read and understand
  2. Easier to debug; you build your SQL in stages, testing the output of each CTE before continuing to the next.
  3. Easier to maintain; changing logic can now be done in a more modulised way, and in a single place.

In this example, I want to further explore the stations within 500m of the city centre, and single out the station that has seen the most journeys terminate there over the past 12 months. I then want to extract all other stations that have seen more journeys than this and analyse their distance from the city centre.

One way to achieve this (without CTEs) is the following:

select  s.name    as station_name,
        st_distance
        (
          st_geogpoint( s.longitude, s.latitude),
          st_geogpoint(-0.118092, 51.509865)
        )         as distance_from_city_centre_m,
        count(j.rental_id) as journey_count
from `bigquery-public-data.london_bicycles.cycle_stations` s
  inner join  `bigquery-public-data.london_bicycles.cycle_hire` j
        on j.end_station_id = s.id
        and cast(j.end_date as date) >= date_sub('2017-1-1', interval 1 year)
group by station_name, s.longitude, s.latitude
having count(j.rental_id) > 
(
  select journey_count
  from
  (
    select  dense_rank() 
            over (order by count(j.rental_id) desc) as rank,
            s.id                                    as station_id,
            count(j.rental_id)                      as journey_count
    from `bigquery-public-data.london_bicycles.cycle_stations` s
        inner join  `bigquery-public-data.london_bicycles.cycle_hire` j
        on j.end_station_id = s.id
        and cast(j.end_date as date) >= date_sub('2017-1-1', interval 1 year)
    where j.end_station_id in
    (
      select  s.id     as station_id
      from `bigquery-public-data.london_bicycles.cycle_stations` s
      where st_distance(
              st_geogpoint( s.longitude, s.latitude),
              st_geogpoint(-0.118092, 51.509865)
            ) <= 500
    )
    group by station_id
  )
  where rank = 1
) 
order by journey_count desc

Again, this is perfectly functional code. However, there is duplicate logic (in bold), and I find this quite difficult to read. Also, if I had to debug this, it would be difficult – for example, validating the busiest station, as everything is intertwined.

Applying CTEs, and we can greatly improve this (CTEs in bold):

with station_proximity
as
(
  select  id              as station_id,
          name            as station_name,
          st_distance(
            st_geogpoint( longitude, latitude),
            st_geogpoint(-0.118092, 51.509865)
          )               as distance_from_city_centre_m
  from `bigquery-public-data.london_bicycles.cycle_stations`
),
station_journeys 
as
(
select  s.station_id,
        s.station_name,
        s.distance_from_city_centre_m,
        count(1)    as journey_count
from station_proximity s
  inner join  `bigquery-public-data.london_bicycles.cycle_hire` j
  on j.end_station_id = s.station_id
  and cast(j.end_date as date) >= date_sub('2017-1-1', interval 1 year)
group by s.station_id, s.station_name, s.distance_from_city_centre_m
),
stations_near_centre
as
(
  select  sp.station_id,
          sj.journey_count,
          dense_rank() 
            over (order by sj.journey_count desc) journey_rank,
  from    station_proximity sp

  inner join station_journeys sj
  on sj.station_id = sp.station_id
  where   sp.distance_from_city_centre_m <= 500
)
select station_name,
       distance_from_city_centre_m,
       journey_count
from   station_journeys 
where  journey_count >
(
  select journey_count
  from   stations_near_centre s
  where  s.journey_rank = 1
)
order by journey_count desc

I think this is much easier to read. If I wanted to debug this, I can easily change the final select to select from each CTE in turn, validating before I move onto the next one.

(If you are interested, here are the results. As you’d expect, the majority of these stations are close to mainline railway links)

3 Naming is important

My third tip is to take care when naming your CTEs.

In my previous example, I took care to name my CTEs with meaningful names:

with station_proximity
as
( --- ),
station_journeys,
as
( -- ),
stations_near_centre
as
( -- )

I often see this, and I’d always avoid if possible:

with cte1
as
( --- ),
cte2,
as
( -- ),
cte3
as
( -- )

4 Perfect for SQL unions

When working with data, we often have to consolidate data from different sources into a single, conformed table.

CTEs lend themselves really well for this use case.

Say we had cycle journeys held in two systems, A and B. We can effectively use CTEs when writing code to populate a single journeys table like this:

with system_a_journeys
as
( some data wrangling of system a ),
system_b_journeys,
as
( some data wrangling of system b),
all_journeys
as
(
  select *
  from system_a_journeys
  union all
  select *
  from system_b_journeys
)

note I’d never advocate using select (list columns explicitly).

5 Use CTEs to show dependencies

Another novel use of CTEs that can improve the maintainability of your SQL, is to use them to make it easy to see which tables/views your SQL is referencing. This is especially useful for complex queries, spanning many lines, and referencing a lot of tables/views.

To do this, at the start of your SQL statement, create CTE "wrappers" for all of the tables/views you are selecting from. You can also restrict the columns to just those you actually use – not only does this help identify which columns you are consuming, but can also improve performance.

So in my example from tip 2, I would start the query with:

with cycle_stations as 
(
  select id    as station_id,
         name  as station_name,
         longitude,
         latitude
  from `bigquery-public-data.london_bicycles.cycle_stations`
),
cycle_journeys as
(
  select station_id,
         end_date
  from `bigquery-public-data.london_bicycles.cycle_hire`
)

Conclusions

Hopefully from reading this you now have an understanding of common table expressions, and how they can be used to really improve your SQL writing.

CTEs are something I personally use on a daily basis and are one of the best tools in the SQL toolbox for aiding readability and maintainability. They also make writing SQL, a much more pragmatic, and therefore enjoyable experience.

I hope you agree 🙂


Next Steps

  1. Learn more about Ancoris Data, Analytics & AI

Related Articles