DATA ENGINEERING AND DATA SCIENCES

A data pipeline is stuck and needs someone’s attention. Or a query needs to be refactored using a new table because one of the used tables is getting deprecated or unavailable. Or we need to add more to our Sql logic of a very old data pipeline.
In all those situations, we have a feeling after a few minutes. We either appreciate the previous code authors, or we get stuck in hundreds of lines of SQL queries. I bet most people (if not all) would like to get appreciated for their work. However, many data engineers and data scientists don’t know how to craft SQL queries in collaboration-friendly ways.
This article will give you some of the essential tips that help you develop professional and collaboration-friendly SQL queries and data pipelines.
Avoid SELECT *
Don’t get me wrong, I use SELECT often, but only in my temporary ad-hoc queries to investigate tables or to do quick data check. Rarely (there are a few exceptions that I explain later) use SELECT in your pipelines or queries that are non-temporary.
The reason that it is important not to use SELECT * is when you want to update or refactor your pipeline or query later. The first thing to update a query (for example, to replace a table with another table) is to understand what columns are used in the query/pipeline. Let’s consider an example.
Suppose we are maintaining a table for recognizing our senior employees based on some metrics. For some reason, we cannot use the performance_review
table anymore (for example, because of some privacy concerns). Our task is to replace the performance_review
table with other accessible tables.
WITH current_senior_employees AS (
SELECT *
FROM employee
WHERE age>=45
AND is_current = true
AND years_in_company >= 6
),
performance_last_5years AS (
SELECT *
FROM performance_review
WHERE year BETWEEN 2016 AND 2020
)
SELECT
ARBITRARY(a.first_name),
ARBITRARY(a.last_name),
ARBITRARY(a.email),
SUM(b.num_of_managed_employees) AS total_num_managed_employees,
SUM(b.num_of_managed_projects) AS total_num_of_managed_projects,
SUM(b.revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees AS a
LEFT JOIN avg_performance_last_5years AS b
ON a.id = b.employee_id
GROUP BY b.employee_id
As you can see, due to misuse of SELECT , we cannot easily recognize what columns from which tables are required to maintain this table. We need to go deeper into the code to find out that num_of_managed_employees
, num_of_managed_projects
, revenue_generated_by_managed_projects
, and employee_id
are the columns that we should find in other tables. Of course, my example is super small compared to regular queries in real jobs, and it is not hard to find out which columns are used. But a single query could be hundreds of lines and tens of joined tables in real-life projects. In those situations, it is super hard to go deep into the code to find out which columns are coming from which tables if the programmer uses SELECT to read everything first and then select the columns from a bunch of loaded columns.
In my example, a better way to write this query is:
WITH current_senior_employees AS (
SELECT
id,
first_name,
last_name,
email
FROM employee
WHERE age>=45
AND is_current = true
AND years_in_company >= 6
),
performance_last_5years AS (
SELECT
employee_id,
num_of_managed_employees,
num_of_managed_projects,
revenue_generated_by_managed_projects
FROM performance_review
WHERE year BETWEEN 2016 AND 2020
)
SELECT
ARBITRARY(a.first_name),
ARBITRARY(a.last_name),
ARBITRARY(a.email),
SUM(b.num_of_managed_employees) AS total_num_managed_employees,
SUM(b.num_of_managed_projects) AS total_num_of_managed_projects,
SUM(b.revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees AS a
LEFT JOIN avg_performance_last_5years AS b
ON a.id = b.employee_id
GROUP BY b.employee_id
Use aliases
One of the bad practices is to avoid using aliases when you join multiple tables. Look a the following example.
WITH current_senior_employees AS (
...
),
performance_last_5years AS (
...
)
SELECT
ARBITRARY(first_name),
ARBITRARY(last_name),
ARBITRARY(email),
SUM(num_of_managed_employees) AS total_num_managed_employees,
SUM(num_of_managed_projects) AS total_num_of_managed_projects,
SUM(revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees
LEFT JOIN avg_performance_last_5years
ON id = employee_id
GROUP BY employee_id
It is the same example as the previous section. The only difference is that I did not use aliases (e.g., a
and b
) here. In this example, it is difficult to figure out the email
column is coming from which table. Just because the column name is unique, you should not stop using aliases (especially when you join multiple tables).
Now imagine if I used SELECT * in addition to not using aliases.
WITH current_senior_employees AS (
SELECT *
FROM employee
WHERE age>=45
AND is_current = true
AND years_in_company >= 6
),
performance_last_5years AS (
SELECT *
FROM performance_review
WHERE year BETWEEN 2016 AND 2020
)
SELECT
ARBITRARY(first_name),
ARBITRARY(last_name),
ARBITRARY(email),
SUM(num_of_managed_employees) AS total_num_managed_employees,
SUM(num_of_managed_projects) AS total_num_of_managed_projects,
SUM(revenue_generated_by_managed_projects) AS total_revenue
FROM current_senior_employees
LEFT JOIN avg_performance_last_5years
ON id = employee_id
GROUP BY employee_id
Using SELECT * and avoiding aliases even for a simple query can lead to big confusion.
Remember, use aliases to make your query more readable and easier to understand.
No meaningless aliases
Stop using meaningless aliases like a, b, c, and so on. The next bad thing after not using aliases is to use meaningless aliases. It is very common among data engineers and scientists to use simple aliases like a and b when they join tables. I am not against using a and b for temporary ad-hoc queries, but I do not use them for long-term queries or pipelines. When you share a query with someone else or go back to your query after a long time, it is not easy to understand it with meaningless aliases. Again for super short queries like the previous example, it is not hard to find out what a and b refer to, but when the query becomes long, it takes too much of our short-term memory to understand the query and fix it when necessary. For example, in my query, I could use:
WITH current_senior_employees AS (
SELECT
id,
first_name,
last_name,
email
FROM employee
WHERE age>=45
AND is_current = true
AND years_in_company >= 6
),
performance_last_5years AS (
SELECT
employee_id,
num_of_managed_employees,
num_of_managed_projects,
revenue_generated_by_managed_projects
FROM performance_review
WHERE year BETWEEN 2016 AND 2020
)
SELECT
ARBITRARY(employee.first_name),
ARBITRARY(employee.last_name),
ARBITRARY(employee.email),
SUM(perform.num_of_managed_employees)
AS total_num_managed_employees,
SUM(perform.num_of_managed_projects)
AS total_num_of_managed_projects,
SUM(perform.revenue_generated_by_managed_projects)
AS total_revenue
FROM current_senior_employees AS employee
LEFT JOIN avg_performance_last_5years AS perform
ON employee.id = perform.employee_id
GROUP BY perform.employee_id
CTEs are friendly
Don’t be afraid to use CTEs. I have seen Data Engineers and Scientists who use complex nested subqueries to avoid using CTEs (for no obvious reason!). Let me show you how my example query becomes complex if I use subqueries instead of CTEs.
SELECT
ARBITRARY(employee.first_name),
ARBITRARY(employee.last_name),
ARBITRARY(employee.email),
SUM(perform.num_of_managed_employees)
AS total_num_managed_employees,
SUM(perform.num_of_managed_projects)
AS total_num_of_managed_projects,
SUM(perform.revenue_generated_by_managed_projects)
AS total_revenue
FROM (
SELECT
id,
first_name,
last_name,
email
FROM employee
WHERE age>=45
AND is_current = true
AND years_in_company >= 6
) AS employee
LEFT JOIN(
SELECT
employee_id,
num_of_managed_employees,
num_of_managed_projects,
revenue_generated_by_managed_projects
FROM performance_review
WHERE year BETWEEN 2016 AND 2020
) AS perform
ON employee.id = perform.employee_id
GROUP BY perform.employee_id
Imagine how complex this could be if we have tens of joined tables (sometimes nested). Remember, there are multiple benefits of using CTEs:
- CTEs are friendly to human short memory (more organized).
- You can use a CTE multiple times in a query and make your query more organized (opposed to subqueries).
Base query using JINJA2
Jinja2 is very popular among data engineers and scientists to develop data pipelines. It enables them to make dynamic queries and compact large ETL queries via Template()
function. Here I only focus on one of the hundreds of Jinja2 Template applications. My focus is to show you how Jinja2 Template can help you to write more clear SQL queries.
If you have multiple queries in your pipeline that use the same CTE or subquery, consider using defining a base query and replacing it in your queries via Jinja2. Here is an example of a pseudo query.
WITH cte1 AS (
SELECT ...
FROM table3
JOIN (
SELECT
...
FROM table1
JOIN table2
ON table1.col1 = table2.col2
WHERE ...
AND ...
) AS table_x
ON ...
WHERE ...),
cte2 AS (
SELECT ...
FROM table4
JOIN (
SELECT
...
FROM table1
JOIN table2
ON table1.col1 = table2.col2
WHERE ...
AND ...
) AS table_x
ON ...
WHERE ...),
)
SELECT
...
FROM cte1
JOIN cte2
ON ...
WHERE ...
If you look closely, you notice that the bolded subquery is repeated in both cte1
and cte2
. Using Jinja2 Template (for example, in Python), you can define a base query for the repeated query and replace it within the main query. You might argue that we can define a CTE instead of using the Jinja2 Template in this example. You are absolutely right! If there is only one query in your pipeline with such a subquery, using a CTE is a much better solution. Still, if you have multiple individual queries in your pipeline (for example, for different tasks) that use the same subquery, then you can use Jinja2 Template to share the same query between them.
If you don’t know how to use Jinja2 Template, here is an example in Python.
from jinja2 import Template
base_query = """
SELECT
...
FROM table1
JOIN table2
ON table1.col1 = table2.col2
WHERE ...
AND ...
"""
main_query = Template("""
WITH cte1 AS (
SELECT ...
FROM table3
JOIN ({{subquery}}) AS table_x
ON ...
WHERE ...),
cte2 AS (
SELECT ...
FROM table4
JOIN ({{subquery}}) AS table_x
ON ...
WHERE ...),
)
SELECT
...
FROM cte1
JOIN cte2
ON ...
WHERE ...
""").render(subquery=base_query)
Some final (but important!!!) tips
In the end, I would like to have some crucial general advice for you on how to craft your queries or pipeline for better collaboration.
Thanks to the query optimization feature of most relational database systems, the query you write get optimized before it gets executed. Therefore, in many cases, you may write an inefficient query (for example, loading all columns in a CTE and then selecting columns). However, the optimizer finds an optimum query plan for execution, and the machine performance is still good. However, always remember, just because the optimizer is helping you to have a better performance, it does not mean that you can write queries any way you like. We must be considerate of people who collaborate with us now and in the future to maintain the code.
Finally, if you think something might take your time at the query development stage, but it will save time later for those who read it, you must do it. For example, it might take you a little bit of time to find and use a proper alias name instead of random "a" or "b," but it will save much time later for those who read it later (even yourself).