The world’s leading publication for data science, AI, and ML professionals.

Decrypting complex SQL

An effective approach to analyse and understand complex SQL

Photo by @alexmotoc on Unsplash
Photo by @alexmotoc on Unsplash

Tips and Tricks

In this story I want to share a possible way to tackle the tricky understanding of complex (or complicated) SQL queries. When we need to handle some existing SQL code, maybe non-documented, poorly written, or uses SQL versions that we don’t master just yet, it can be frustrating to deal with it.

Whatever thing we need to do with it, just running it as it is, using it in a Data Science or Analytics process, modifying it, etc., in any case we need to understand it.

Content:

· What can make Sql tricky
· Breaking SQL complexity
  ∘ 1. Improve the code format and structure
  ∘ 2. High level SQL review
  ∘ 3. Clarify unknown SQL clauses, syntaxes and functions
  ∘ 4. Split and analyse inner to outer SQL
· In a nutshell...

What can make SQL tricky

When we look a new SQL query the majors obstacles to our full understanding are

  • Intensive use of sub-queries and nested logic
  • Unknown clauses, functions, syntaxes, database-specific commands
  • Poorly structured or poorly documented code

So, supposing that we need to understand a SQL query quickly, can we do anything to optimize our SQL analysis approach?

Breaking SQL complexity

In this section I will present the 4 main things that can make our SQL analysis more efficient.

  1. Improve the code format and structure

2. High level SQL review

3. Clarify unknown SQL clauses, syntaxes and functions

4. Split and analyse inner to outer SQL

I will explain each concept and apply it to a SQL example, which actually isn’t a very complex nor a long one, but the very same ideas apply to any SQL query.

Here we go, somebody gave us to use this piece of SQL for our new Data Mining workflow, and told us that it runs on the SQL Server database. Yeah, great stuff, and now?

WITH tmp_1 AS
(
SELECT Calc1 =
( (SELECT TOP 1 DataValue
FROM (
SELECT TOP 50 PERCENT DataValue
FROM SOMEDATA
WHERE DataValue IS NOT NULL
ORDER BY DataValue
) AS A
ORDER BY DataValue DESC
) + (SELECT TOP 1 DataValue
FROM (
SELECT TOP 50 PERCENT DataValue
FROM SOMEDATA
WHERE DataValue Is NOT NULL
ORDER BY DataValue DESC
)AS A
ORDER BY DataValue ASC))/2
 ),tmp_2 AS
(SELECT AVG(DataValue) Mean, MAX(DataValue) - MIN(DataValue) AS MaxMinRange
FROM SOMEDATA
),tmp_3 AS
(
SELECT TOP 1 DataValue AS Calc2, COUNT(*) AS Calc2Count
FROM SOMEDATA
GROUP BY DataValue
ORDER BY Calc2Count DESC
)
SELECT Mean, Calc1, Calc2 , MaxMinRange AS [Range]
FROM tmp_1 CROSS JOIN tmp_2 CROSS JOIN tmp_3;

1. Improve the code format and structure

It could be silly, but as our eyes appreciate order, the first thing do it is structure the SQL script in a nice and readable way. It takes 5 minutes, by using any SQL client’s script formatting built-in tool, or there are plenty free online pages as well.

Once done on the Example below, we’d get something like this:

Screenshot by Author: example of SQL query formatted
Screenshot by Author: example of SQL query formatted

The big added value is that now we can clearly see all the sub-queries, how deeply they are nested, and each SELECT is very easily identifiable. Those will be the bricks of our next steps.

2. High level SQL review

2.1 Identify the main SQL layer

Any query can be seen as composed by layers of other queries, and sub-subqueries. We can find those layers by looking at the SELECT clauses and their dependencies.

Now, complex queries might have many layers, but to have a general understanding we can just look at Layer 1, and see how it combines the data coming from its inner layers : does it do JOINs, UNIONs, INTERSECTs, etc.?

This gives you some hints on the general purpose of that query.

In our example

Screenshot by Author
Screenshot by Author

So we can understand that LEVEL #1 combines the data of 3 sub-queries tmp_1, tmp_2, tmp_3 by CROSS JOIN.

Let’s go on and learn more on the tables that feed our query.

2.2 Find out tables and data model

To understand what our SQL query does, we need to identify the tables that provide data to each SELECT clause, and also to deduct the relations among them, mentally building a sort of data model implicitly used by the query.

To do so, we just need to look at the FROM clauses of all the possible layers and their sub-queries.

In our simple example, all the sub-queries use only one table named "SOMEDATA". We should have a look to its data, which will be helpful for the next step.

Screenshot by Author
Screenshot by Author

3. Clarify unknown SQL clauses, syntaxes and functions

Now we want to understand in more detail what our SQL does. To do so we need to learn more about all the SQL functions, operators, clauses and objects that we are not familiar with. It’s perfectly normal if we don’t know everything about all the existing SQL flavours, of all possible database vendors and versions.

So just find out any piece of code that you don’t know and lookup the documentation.

For instance, the query in our example here is written to run on the database SQL Server.

In our query example, let’s suppose that we don’t understand two parts:

"..what is "PERCENT"? Might it be some SQL Server’s-specific operator?"

" .. I’ve seen WITH before, but I can’t remember what it does"

After reading the doc, we learned that

TOP 50 PERCENT splits the data in column DataValue in 2 sets, each with half of the data, and takes the top half. Using it in combination with the ORDER BY clause, allows us to consider the top and bottom 50% of the data values.

WITH .. AS creates record sets on the fly, and names it "tmp_1", which can be then referred to in other parts of the same query.

4. Split and analyse inner to outer SQL

Now we have all the elements to understand the whole query. To do so we need to analyse SQL part by part, staring with the inner query layers, then moving up towards the layer #1.

Every time that we have understood all the sub-queries belonging to a layer, we should be able to understand what their outer layer does as well.

Let’s apply this principle to our query example: we start by looking at the inner sub-queries of section WITH tmp_1 (see the comments 1.), then we move to 2., then 3. combines the knowledge of 1. and 2. together, and that allows to fully understand the whole tmp_1 section.

Once we got to 7. , we understood that the SQL script purpose is just to calculate median, mean, range and mode on a single table’s column.

Screenshot by Author: example of SQL analysis
Screenshot by Author: example of SQL analysis

In a nutshell…

In this story I presented a very effective way to analyze complex SQL queries. The 4 main steps are very practical and can be applied to query of any complexity. The approach works well for anybody from SQL beginners to experts, it will just change the time needed to complete it.

Thanks for reading.


Feel free to subscribe to my "Sharing Data Knowledge" Newsletter.

If you wish to subscribe to Medium, feel free to use my referral link https://medium.com/@maw-ferrari/membership : it costs the same for you, but it contributes indirectly to my stories.


Related Articles