The world’s leading publication for data science, AI, and ML professionals.

Understanding SQL: Order of Execution

A brief guide on how databases interpret your SQL queries

Photo by Wengang Zhai on Unsplash
Photo by Wengang Zhai on Unsplash

Introduction

Writing efficient SQL queries is an essential skill for any data analyst working with large volumes of data. I’m sure many of us have endured the pain of having developed a query that runs well on small scales, only to have it slowly grind out the results when applied to a much larger dataset.

Oftentimes, query performance can be significantly improved by simply understanding how a Database interprets the query. Not only will this help you optimize queries for speed and performance, but it will also help debug and troubleshoot erroneous scripts.

So today, I’m going to step you through the order in which a SQL query is executed and touch on some common errors that arise when building queries.

Declarative vs. Procedural Languages

First, it’s important to understand that Sql is a declarative programming language. This means that we define the result we want but provide no instruction on how it is achieved. This contrasts with imperative, or procedural, languages that require each step to produce the output to be explicitly defined. The implication of working with a declarative language like SQL is that, while SQL expects statements to be written in a specified order, the sequence in which the statements are evaluated differs.

To demonstrate, here is the list of the seven common clauses used when building SQL queries and the order they should be used in:

1. SELECT
2. FROM
3. WHERE
4. GROUP BY
5. HAVING
6. ORDER BY
7. LIMIT

Now compare this with the order of execution:

1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY
7. LIMIT

As you can see, the statements have been shuffled around a bit. For example, note that while the SELECT clause is written first it is much lower down the list when it comes to execution. As we’ll see shortly, it’s the order of execution that matters most and is something an analyst must be keenly aware of.

The FROM Clause

Naturally, the database needs to know where data is coming from so this is a logical first step. While simpler queries might only reference a single table, more often the information you need exists across several tables. Accordingly, the JOIN statement is used in conjunction with FROM to combine source tables. If a join is required then the first thing the database will do is bring everything together.

This means that you should think about the size of the source tables, the type of join used, and the number of predicates used in the join. For example, reducing the size of source tables by selecting only the necessary columns, filtering out unnecessary rows, and ensuring there are common identifiers to complete the join, will all improve efficiency. Additionally, INNER JOINs should be preferred over OUTER JOINs as the former are generally faster.

Ultimately, you don’t want to process data that isn’t needed so minimizing the set to work with should be a primary objective, as much as possible.

The WHERE Clause

This clause is used to filter a base table, or joined output, by retuning only those rows that meet a given condition. Any supported data type can be used to filter records. For example, consider the table below which lists a small number of Commonwealth cities, along with their populations:

A small example table called 'cities' (image by author).
A small example table called ‘cities’ (image by author).

If we were to filter this table down to only those cities in New Zealand we could write the following query, for example:

SELECT 
   city
  ,country
FROM 
  cities
WHERE 
  country = 'New Zealand';

which would return the rows containing Auckland, Christchurch, and Wellington. Alternatively, if we wanted to return all cities that have a population larger than half a million then the query would look like this:

SELECT
   city
  ,country
FROM
  cities
WHERE 
  population > 500000;

We could also combine these filters using the AND operator, which would return Auckland only:

SELECT
   city
  ,country
FROM
  cities
WHERE 
  country = 'New Zealand' 
  AND population > 500000;

One important thing to remember about the WHERE clause is that it cannot be used to filter aggregated columns. For example, take a look at the modified query below:

SELECT 
   country
  ,SUM(popualtion) 
FROM 
  cities
WHERE 
  SUM(popualtion) > 5000000
GROUP BY 
  country;

The intent of the above query is to obtain all countries where the combined city population is larger than 5 million. Unfortunately, this query will fail because an aggregation function has been used in the WHERE statement. The issue is that aggregation functions require the GROUP BY clause which is executed after the WHERE clause. This means the WHERE condition cannot be evaluated because the database is not yet aware of any aggregated variables.

We’ll see how to remedy this shortly, but before we do let’s quickly touch on the GROUP BY clause.

The GROUP BY Clause

As you’ve probably recognized, this clause allows us to aggregate, or summarise, a quantity and is used in combination with functions such as COUNT(), SUM(), MIN(), MAX(), and the like. In effect, GROUP BY collapses the variable, or variables, and returns a single value for each distinct element, or combination of elements. For example, if we wanted to tally the city populations for each country, we can group them by country like so:

SELECT 
   country
  ,SUM(popualtion) 
FROM 
  cities
GROUP BY 
  country;

The output will then return four rows – one for each country – along with the aggregated population for each country listed in the table.

The HAVING Clause

This clause solves the problem encountered earlier when trying to filter using an aggregation function with the WHERE clause. The HAVING clause allows us to filter results using grouped and aggregated data because it is executed after the GROUP BY statement. The database is now aware of the aggregations meaning they can be used in all statements that follow. We can now amend the earlier query like so:

SELECT 
   country
  ,SUM(population)
FROM 
  cities
GROUP BY 
  country
HAVING 
  SUM(popualtion) > 5000000;

This will return just two countries: Australia and England.

The SELECT Clause

The SELECT clause is where we define the columns we want in our table, along with any grouped and aggregated fields. This is also where we can apply column aliases using the AS operator. Now, while the select statement comes first when building our queries, it isn’t executed until the data have been sourced and filtered. This is important to recognize because what this means is that aggregated variables and aliases cannot be used in WHERE, GROUP BY, or HAVING statements.

For example, consider the following query that creates a column alias total_pop which is then used with the HAVING clause. This query will throw an error because the alias has not yet been created. The HAVING clause precedes the SELECT clause so there is nothing called total_pop to ** reference**.

SELECT 
   country
  ,SUM(population) AS total_pop
FROM 
  cities
GROUP BY 
  country
HAVING 
  total_pop > 5000000;

I won’t go into detail on these, but the DISTINCT and UNION statements are executed after SELECT and before the ORDER BY clause, with DISTINCT executed before UNION.

The ORDER BY Clause

We’re now reaching the end and much of the heavy lifting has been done. We have sourced (and possibly joined) tables, applied some filtering, grouped and aggregated some fields, and specified the columns we want to be included in our final table.

At this point, you’re likely thinking about how you want the data arranged in the target table. For example, you might have rows ordered chronologically, or perhaps ordered based on some ranking value. This is precisely what the ORDER BY clause does.

The nice about this statement is that, because it’s at the backend of the order, we can use aggregations and column aliases in our GROUP BY statements. For example, suppose we wanted to order country by total city population. We could write a query like the following:

SELECT 
   country
  ,SUM(population) AS total_pop
FROM 
  cities
GROUP BY 
  country
ORDER BY
  total_pop;

Note here that we can use the column alias total_pop in the ORDER BY statement. By default, this will return records in ascending order (i.e., smallest to largest). To return the rows in descending order we can use the DESC operator like so:

SELECT 
   country
  ,SUM(population) AS total_pop
FROM 
  cities
GROUP BY 
  country
ORDER BY
  total_pop DESC;

The LIMIT Clause

When dealing with large tables it’s often not optimal to have the query return all rows, particularly when you’re only developing and testing. The LIMIT clause is exceedingly useful here and allows us to define the number of rows we want back. It’s also used in combination with the ORDER BY clause to return the n-top or n-bottom records. For example, suppose we wanted the top three most populous cities in the table. We could use the ORDER BY and LIMIT clauses as follows:

SELECT 
   city
  ,country
  ,population AS city_pop
FROM 
  cities
ORDER BY
  city_pop DESC
LIMIT
  3;

Note that not all databases support the LIMIT statement, but they will have equivalents that perform a similar function.

Wrapping Up

The order in which statements are executed is an important concept to grasp when building SQL queries, and we’ve touched on some common gotchas that can occur. Though I haven’t provided in-depth examples I hope this brief little primer gets you thinking about how to improve your query performance, and if you’re just starting out with SQL, I hope this article helps you along your journey.


Thanks for reading!

If you enjoyed this post and would like to stay up to date then please consider following me on Medium. This will ensure you don’t miss out on any new content.

To get unlimited access to all content consider signing up for a Medium subscription.

You can also follow me on Twitter, LinkedIn, or check out my GitHub if that’s more your thing 😉


Related Articles