
If you are a developer or analyst working with data day by day, SQL queries would be one of the essential skills for many different roles. When I was a tutor at Uni, a student complained that SQL was the worst programming language. Well, whether SQL is a "programming language" I guess could be controversial. However, it is no doubt that it is quite structured and rigid in terms of its syntax. Therefore, it will be relatively easier to implement such a tool to fully understand its meaning. No, we don’t need ChatGPT 🙂
In this article, I’m going to introduce a library that helps us parse SQL queries and even build a query programmatically. If you don’t know why we need to do so, that means it would be even more important to read through the rest of the content.
1. Installation

Boring part as usual, but I have to have this section anyway. To install this library, just simply run the following.
pip install sqlglot
Additionally, if you are in a more flexible environment such as your laptop, it is recommended to install the package together with its Rust tokenizer. This will make the parsing process more efficient.
pip3 install "sqlglot[rs]"
Then, in our Python code, we should import the library before use. For all my examples in this article, I will use the alias sg
for the library sqlglot
, as we need to use several different functions in this package. If you want to run my examples, please don’t forget to run the line of code below.
import sqlglot as sg
2. Transpiling Between Different SQL Dialects

This is the first use case that this library will offer for today. I’ll give you a real use case in my work.
This happened two years ago. We have a legacy Data Visualisation tool that connects to our Azure Synapse and then runs lots of SQL queries to get the results to be rendered into charts. One of our projects is to decommission the Azure Synapse and embrace Databricks. However, the problem is that there are over 500 dashboards and thousands of SQL queries in that Visualisation tool that need to be converted from T-SQL to Databricks SQL for compatibility.
We have spent lots of effort to do such a job. At that time, unfortunately, we didn’t know this sqlglot
library.
Now, let’s have a simple example. Suppose we have a Databricks SQL Query below. Most of the syntax should be generic, but the LIMIT 10
is specifically for some Database Management Systems.
databricks_query = "SELECT * FROM Customer LIMIT 10"
If you are familiar with SQL Server or Azure Synapse, they use different syntax to limit the result set to a certain number of rows. That is SELECT TOP n ... FROM ...
. Now, let’s see how to use sqlglot
to transpile the query very easily.
sg.transpile(databricks_query, read="databricks", write="tsql")[0]
In the code above, we call the transpile()
function in sqlglot
. Then, we provide the SQL statement databricks_query
. We then tell the function that this is Databricks Dialect read="databricks"
, and we want it to transpile it into T-SQL write="tsql"
.
It is worth mentioning that the function will actually return a list. This is in case there are multiple SQL statements included in one query string, such as the following.
SELECT * FROM Customer LIMIT 10;
SELECT * FROM Product LIMIT 10;
OK, coming back to our transpiling. We just need to print the transpiled SQL query as follows.
print(sg.transpile(databricks_query, read="databricks", write="tsql")[0])
We can see that the query has been transpiled to T-SQL correctly!

Of course, we can also transpile it from T-SQL back to Databricks SQL as follows.
tsql_query = "SELECT TOP 10 * FROM Customer"
print(sg.transpile(tsql_query, read="tsql", write="databricks")[0])

Another point that needs to be mentioned is that, sqlglot
will try its best to retain all the elements in the original query when transpiling. For example, if there is a comment in the query, it will reserve it.
tsql_query = """
-- Get 10 Customers
SELECT TOP 10 * FROM Customer
"""
print(sg.transpile(tsql_query, read="tsql", write="databricks")[0])
The above code gives the following output.

3. Pretty Formatting

It is not surprising at all to know that sqlglot
can format a SQL query and output it in a pretty format, because it can understand it after all.
Just continue the previous example, although it outputs the transpiled query with comments, everything is in one line. Now, if we want to improve the readability, we can simply add pretty=True
to the function.
print(sg.transpile(tsql_query, read="tsql", write="databricks", pretty=True)[0])

See, not only does the comment have a new line, but the query itself was formatted in pretty style.
Of course, we can have a more complicated SQL query as follows. It can be formatted, too.
sql = """
WITH baz AS (SELECT a, c FROM foo WHERE a = 1) SELECT f.a, b.b, baz.c, CAST("b"."a" AS REAL) d FROM foo f JOIN bar b ON f.a = b.a LEFT JOIN baz ON f.a = baz.a
"""
print(sg.transpile(sql, write="databricks", identify=True, pretty=True)[0])
The above function, identity=True
means that the target dialect is not specified and the source dialect will also be used as the target dialect.

4. Getting Metadata of the Query

Here is another real use case from my work. On our Data Platform, hundreds of users send ad hoc queries to our databases every day. There is a strong requirement that we want to know "Who at a certain time queried which table". This information is quite useful. For example, once we want to change any tables, we know who might be impacted. Also, some tables that have not been used for more than a year, might be a potential opportunity to reduce or remove the data ingestion frequency to save our cloud cost.
Unfortunately, the logs of Azure Synapse do not provide such information, and it is not easy to extract the tables from a SQL query, too. We have written a whole branch of regular expressions to achieve this feature since we didn’t know sqlglot
at that time.
Suppose we have a query as follows. It is kind of complex because there is CTE (Common Table Expression) in it. If sqlglot
can handle this, it will be able to handle any kind of sub-queries, too.
# SQL query
sql = """
with cte1 as (
SELECT a.col1, a.col2, b.col3, b.col4, c.col5
FROM Table_A a LEFT JOIN Table_B b ON a.id = b.id
)
SELECT *
FROM cte1 LEFT JOIN Table_C c on cte1.col1 = c.col1
"""
# Function to extract table names
parsed = sg.parse_one(sql)
Let me sample why we use the parse_one()
function. There is another function called parse()
in sqlglot
. Similar as the transpile()
function above, it will be able to parse multiple statements in one query and return an array. In our context, I don’t want to make the example over-complicated. So, let’s use only one statement and the parse_one()
function gives us one parsed tree, which is easier to demonstrate.
After parsing the SQL query into the variable parsed
, we can have a look at what’s inside.

sqlglot
has its terminology to mark down the query in a tree-liked hierarchical object. It is NOT required to understand this terminology to use sqlglot
. So, don’t worry, I’ll show you.
Now, let’s get all the tables from the parsed object.
for table_exp in parsed.find_all(sg.exp.Table):
print(table_exp)

In the above code, we just called the find_all()
function to the parsed object, it returns a list of everything we are looking for. sg.exp.Table
is an enumeration just telling the function that we are looking for tables particularly. Similarly, there are sg.exp.Column
, sg.exp.Schema
and so on.
Well, there are still two more problems in the above output. Let’s assume we only care about the table names, not the table alias and the CTE names. Although it won’t be fully out-of-the-box this time, it is still not difficult and pretty intuitive to achieve.
The simple idea is that we want to get all the CTE names, and then get all the table names, whichever in the table name set but not in the CTE name set are of interest.
Firstly, let’s find all the CTE names.
ctes = set()
for cte in parsed.find_all(sg.exp.CTE):
ctes.add(cte.alias)
The enumeration item sg.exp.CTE
tells the find_all()
function to get all CTE. Then, we add the cte.alias
to the ctes
set.
After that, let’s get all the tables in a set. We can get the name-only string by calling the table_exp.text("this")
to get the identifier only. Before we add each of the table names into the set, just check if it is in the CTE set. Do not add if so.
tables = set()
for table_exp in parsed.find_all(sg.exp.Table):
table_name = table_exp.text("this")
if table_name not in ctes:
tables.add(table_name)
Let’s execute the code and get the set of the table names in the query.

5. Error Handling

So far, you may wonder what if the SQL language is not valid. Don’t worry, sqlglot
will pick it up and tell you where the error is and why it is not a valid SQL query.
Suppose we have the following query.
SELECT foo( FROM bar
We can tell it is not a valid SQL quickly. Let’s try to parse it and catch the error. The error is sg.errors.ParseError
provided by sqlglot
.
import pprint
try:
sg.parse("SELECT foo( FROM bar")
except sg.errors.ParseError as e:
pprint.pprint(e.errors)
In this example, I want to leverage pprint
just for demonstration purposes. It will output the error message in a pretty format. Otherwise, the list will be printed in one line. Don’t worry about your case. You don’t have to use pprint
.

We see that the error message contains enough information regarding the position and error types.
6. Building SQL Statement Programmatically

Why do we want to use sqlglot
to build SQL statements for us? There could be many potential reasons based on your actual scenarios.
- Reduce the risk of SQL injection attacks
- Error handling and validation
- Readability
- Maintainability
- Extensibility
I’ll only give one example here. It is not uncommon to see people using WHERE 1=1
in their code. Indeed, it is a clever solution when dealing with the scenario that we don’t if we will have filter conditions. However, this is a good example to prove that using string construction techniques for building SQL statements has limitations. That is, we need to handle the cases that the syntax doesn’t allow consistent patterns to be appended to the query string.
Now, let’s build a simple query to try this feature of sqlglot
.
query = sg.select("*").from_("Customer").where(
sg.condition("age > 18").and_("is_vip = 'Y'")
)
print(query.sql())
The function from_()
and and_()
with the underscores are merely avoiding the naming conflicts with Python native functions, so please don’t be confused and there is nothing magic here regarding the underscores.
The .sql()
can be used to convert the built-up query to a string.

Another use case of this feature could be the polyglot requirements. That is, once we build the query, we can output it to different SQL dialects easily.
Let’s use the same example we had at the beginning, building a query as follows.
# Building the query
query = sg.select("*").from_("Customer").limit(10)
Now, we can output the query in Databricks SQL.
# Generate query for Databricks SQL dialect
sql_databricks = query.sql(dialect="databricks")
print(sql_databricks)

Of course, we can also easily output it in T-SQL dialect, without the need to re-building or transpiling.
# Generate query for T-SQL dialect
sql_tsql = query.sql(dialect="tsql")
print(sql_tsql)

7. Query Optimisation

Last but not least, the library also has some sort of query optimisation capabilities. For example, if we have defined very complex conditions as follows.
bad_sql = sg.parse_one("""
SELECT *
FROM my_table
WHERE a=1 OR (b=2 OR (c=3 AND d=4))
""")
I would say it’s not even readable. Now, let’s see how sqlglot
can optimise it.
good_sql = sg.optimizer.optimize(bad_sql)
print(good_sql.sql(pretty=True))

It is still complex, but much more readable now.
Apart from optimising the syntax expressions, it can also achieve common query optimisation for the performance. However, it may require lots of context to be provided. You may try it by yourself if that’s of interest.
Summary

So far, we have seen so many features provided by sqlglot
. I’m sure that one or more of them should have a chance to improve our productivity or even solve some problems that we have never been able to.
This library shows us in the generative AI era, some sort of certainty is a gem. Yes, I’m saying that ChatGPT may give us errors sometimes. However, for a language like SQL which has more rigid syntax, we certainly don’t have to verify the SQL queries generated by sqlglot
. Surely, it will work anyway 🙂
Unless otherwise noted all images are by the author