The world’s leading publication for data science, AI, and ML professionals.

Collaboration-Friendly SQL Queries

A few tips to improve the quality of your SQL queries and collaborate better

DATA ENGINEERING AND DATA SCIENCES

Photo by the author.
Photo by the author.

A data pipeline is stuck and needs someone’s attention. Or a query needs to be refactored using a new table because one of the used tables is getting deprecated or unavailable. Or we need to add more to our Sql logic of a very old data pipeline.

In all those situations, we have a feeling after a few minutes. We either appreciate the previous code authors, or we get stuck in hundreds of lines of SQL queries. I bet most people (if not all) would like to get appreciated for their work. However, many data engineers and data scientists don’t know how to craft SQL queries in collaboration-friendly ways.

This article will give you some of the essential tips that help you develop professional and collaboration-friendly SQL queries and data pipelines.

Avoid SELECT *

Don’t get me wrong, I use SELECT often, but only in my temporary ad-hoc queries to investigate tables or to do quick data check. Rarely (there are a few exceptions that I explain later) use SELECT in your pipelines or queries that are non-temporary.

The reason that it is important not to use SELECT * is when you want to update or refactor your pipeline or query later. The first thing to update a query (for example, to replace a table with another table) is to understand what columns are used in the query/pipeline. Let’s consider an example.

Suppose we are maintaining a table for recognizing our senior employees based on some metrics. For some reason, we cannot use the performance_review table anymore (for example, because of some privacy concerns). Our task is to replace the performance_review table with other accessible tables.

WITH current_senior_employees AS (
    SELECT *
    FROM employee
    WHERE age>=45
        AND is_current = true
        AND years_in_company >= 6
),
performance_last_5years AS (
    SELECT *
    FROM performance_review
    WHERE year BETWEEN 2016 AND 2020
)
SELECT
    ARBITRARY(a.first_name), 
    ARBITRARY(a.last_name), 
    ARBITRARY(a.email),    
    SUM(b.num_of_managed_employees) AS total_num_managed_employees,
    SUM(b.num_of_managed_projects) AS total_num_of_managed_projects,
    SUM(b.revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees AS a
LEFT JOIN avg_performance_last_5years AS b
ON a.id = b.employee_id
GROUP BY b.employee_id

As you can see, due to misuse of SELECT , we cannot easily recognize what columns from which tables are required to maintain this table. We need to go deeper into the code to find out that num_of_managed_employees, num_of_managed_projects, revenue_generated_by_managed_projects, and employee_id are the columns that we should find in other tables. Of course, my example is super small compared to regular queries in real jobs, and it is not hard to find out which columns are used. But a single query could be hundreds of lines and tens of joined tables in real-life projects. In those situations, it is super hard to go deep into the code to find out which columns are coming from which tables if the programmer uses SELECT to read everything first and then select the columns from a bunch of loaded columns.

In my example, a better way to write this query is:

WITH current_senior_employees AS (
    SELECT 
        id, 
        first_name, 
        last_name, 
        email
    FROM employee
    WHERE age>=45
        AND is_current = true
        AND years_in_company >= 6
),
performance_last_5years AS (
    SELECT 
        employee_id, 
        num_of_managed_employees, 
        num_of_managed_projects,
        revenue_generated_by_managed_projects
    FROM performance_review
    WHERE year BETWEEN 2016 AND 2020
)
SELECT
    ARBITRARY(a.first_name), 
    ARBITRARY(a.last_name), 
    ARBITRARY(a.email),    
    SUM(b.num_of_managed_employees) AS total_num_managed_employees,
    SUM(b.num_of_managed_projects) AS total_num_of_managed_projects,
    SUM(b.revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees AS a
LEFT JOIN avg_performance_last_5years AS b
ON a.id = b.employee_id
GROUP BY b.employee_id

Use aliases

One of the bad practices is to avoid using aliases when you join multiple tables. Look a the following example.

WITH current_senior_employees AS (
    ...
),
performance_last_5years AS (
    ...
)
SELECT
    ARBITRARY(first_name), 
    ARBITRARY(last_name), 
    ARBITRARY(email),    
    SUM(num_of_managed_employees) AS total_num_managed_employees,
    SUM(num_of_managed_projects) AS total_num_of_managed_projects,
    SUM(revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees
LEFT JOIN avg_performance_last_5years
ON id = employee_id
GROUP BY employee_id

It is the same example as the previous section. The only difference is that I did not use aliases (e.g., a and b) here. In this example, it is difficult to figure out the email column is coming from which table. Just because the column name is unique, you should not stop using aliases (especially when you join multiple tables).

Now imagine if I used SELECT * in addition to not using aliases.

WITH current_senior_employees AS (
    SELECT *
    FROM employee
    WHERE age>=45
        AND is_current = true
        AND years_in_company >= 6
),
performance_last_5years AS (
    SELECT *
    FROM performance_review
    WHERE year BETWEEN 2016 AND 2020
)
SELECT
    ARBITRARY(first_name), 
    ARBITRARY(last_name), 
    ARBITRARY(email),    
    SUM(num_of_managed_employees) AS total_num_managed_employees,
    SUM(num_of_managed_projects) AS total_num_of_managed_projects,
    SUM(revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees
LEFT JOIN avg_performance_last_5years
ON id = employee_id
GROUP BY employee_id

Using SELECT * and avoiding aliases even for a simple query can lead to big confusion.

Remember, use aliases to make your query more readable and easier to understand.

No meaningless aliases

Stop using meaningless aliases like a, b, c, and so on. The next bad thing after not using aliases is to use meaningless aliases. It is very common among data engineers and scientists to use simple aliases like a and b when they join tables. I am not against using a and b for temporary ad-hoc queries, but I do not use them for long-term queries or pipelines. When you share a query with someone else or go back to your query after a long time, it is not easy to understand it with meaningless aliases. Again for super short queries like the previous example, it is not hard to find out what a and b refer to, but when the query becomes long, it takes too much of our short-term memory to understand the query and fix it when necessary. For example, in my query, I could use:

WITH current_senior_employees AS (
    SELECT 
        id, 
        first_name, 
        last_name, 
        email
    FROM employee
    WHERE age>=45
        AND is_current = true
        AND years_in_company >= 6
),
performance_last_5years AS (
    SELECT 
        employee_id, 
        num_of_managed_employees, 
        num_of_managed_projects,
        revenue_generated_by_managed_projects
    FROM performance_review
    WHERE year BETWEEN 2016 AND 2020
)
SELECT
    ARBITRARY(employee.first_name), 
    ARBITRARY(employee.last_name), 
    ARBITRARY(employee.email),    
    SUM(perform.num_of_managed_employees) 
        AS total_num_managed_employees,
    SUM(perform.num_of_managed_projects) 
        AS total_num_of_managed_projects,
    SUM(perform.revenue_generated_by_managed_projects) 
        AS total_revenue
FROM current_senior_employees AS employee
LEFT JOIN avg_performance_last_5years AS perform
ON employee.id = perform.employee_id
GROUP BY perform.employee_id

CTEs are friendly

Don’t be afraid to use CTEs. I have seen Data Engineers and Scientists who use complex nested subqueries to avoid using CTEs (for no obvious reason!). Let me show you how my example query becomes complex if I use subqueries instead of CTEs.

SELECT
    ARBITRARY(employee.first_name), 
    ARBITRARY(employee.last_name), 
    ARBITRARY(employee.email),    
    SUM(perform.num_of_managed_employees) 
        AS total_num_managed_employees,
    SUM(perform.num_of_managed_projects) 
        AS total_num_of_managed_projects,
    SUM(perform.revenue_generated_by_managed_projects) 
        AS total_revenue
FROM (
    SELECT 
        id, 
        first_name, 
        last_name, 
        email
    FROM employee
    WHERE age>=45
        AND is_current = true
        AND years_in_company >= 6
) AS employee
LEFT JOIN(
    SELECT 
        employee_id, 
        num_of_managed_employees, 
        num_of_managed_projects,
        revenue_generated_by_managed_projects
    FROM performance_review
    WHERE year BETWEEN 2016 AND 2020
) AS perform
ON employee.id = perform.employee_id
GROUP BY perform.employee_id

Imagine how complex this could be if we have tens of joined tables (sometimes nested). Remember, there are multiple benefits of using CTEs:

  1. CTEs are friendly to human short memory (more organized).
  2. You can use a CTE multiple times in a query and make your query more organized (opposed to subqueries).

Base query using JINJA2

Jinja2 is very popular among data engineers and scientists to develop data pipelines. It enables them to make dynamic queries and compact large ETL queries via Template() function. Here I only focus on one of the hundreds of Jinja2 Template applications. My focus is to show you how Jinja2 Template can help you to write more clear SQL queries.

If you have multiple queries in your pipeline that use the same CTE or subquery, consider using defining a base query and replacing it in your queries via Jinja2. Here is an example of a pseudo query.

WITH cte1 AS (
        SELECT ... 
        FROM table3
        JOIN (    
            SELECT 
                ...
            FROM table1
            JOIN table2
            ON table1.col1 = table2.col2
            WHERE ... 
            AND ... 
        ) AS table_x
        ON ...
        WHERE ...),
    cte2 AS (
        SELECT ... 
        FROM table4
        JOIN (
            SELECT 
                ...
            FROM table1
            JOIN table2
            ON table1.col1 = table2.col2
            WHERE ... 
            AND ...
        ) AS table_x
        ON ...
        WHERE ...),
    ) 
    SELECT 
        ... 
    FROM cte1 
    JOIN cte2 
    ON ... 
    WHERE ...

If you look closely, you notice that the bolded subquery is repeated in both cte1 and cte2. Using Jinja2 Template (for example, in Python), you can define a base query for the repeated query and replace it within the main query. You might argue that we can define a CTE instead of using the Jinja2 Template in this example. You are absolutely right! If there is only one query in your pipeline with such a subquery, using a CTE is a much better solution. Still, if you have multiple individual queries in your pipeline (for example, for different tasks) that use the same subquery, then you can use Jinja2 Template to share the same query between them.

If you don’t know how to use Jinja2 Template, here is an example in Python.

from jinja2 import Template
base_query = """ 
    SELECT 
        ... 
    FROM table1
    JOIN table2
    ON table1.col1 = table2.col2
    WHERE ... 
        AND ... 
"""
main_query = Template("""
    WITH cte1 AS (
        SELECT ... 
        FROM table3
        JOIN ({{subquery}}) AS table_x
        ON ...
        WHERE ...),
    cte2 AS (
        SELECT ... 
        FROM table4
        JOIN ({{subquery}}) AS table_x
        ON ...
        WHERE ...),
    ) 
    SELECT 
        ... 
    FROM cte1 
    JOIN cte2 
    ON ... 
    WHERE ...
""").render(subquery=base_query)

Some final (but important!!!) tips

In the end, I would like to have some crucial general advice for you on how to craft your queries or pipeline for better collaboration.

Thanks to the query optimization feature of most relational database systems, the query you write get optimized before it gets executed. Therefore, in many cases, you may write an inefficient query (for example, loading all columns in a CTE and then selecting columns). However, the optimizer finds an optimum query plan for execution, and the machine performance is still good. However, always remember, just because the optimizer is helping you to have a better performance, it does not mean that you can write queries any way you like. We must be considerate of people who collaborate with us now and in the future to maintain the code.

Finally, if you think something might take your time at the query development stage, but it will save time later for those who read it, you must do it. For example, it might take you a little bit of time to find and use a proper alias name instead of random "a" or "b," but it will save much time later for those who read it later (even yourself).


Follow me on Medium and Twitter for the latest stories.


Related Articles