The 5 Most Popular SQL Transforms

Analyzing the usage of SQL Generator

Josh Berry
Towards Data Science

--

Photo by Austin Distel on Unsplash

Introduction

In prior posts, I have referred to the SQL Generator website as a helpful tool for analysts to generate complicated SQL quickly. This week, I got access to the data so we can look at which are the most popular transformations.

Basic Analysis

The data itself is quite limited. Since the website is free & open to everyone, there is no user data available to analyze. We only have the timestamp of when somebody clicks “Generate SQL,” along with the type of transformation it was.

As of the time of this post, the site has 76 different transformations. Let’s take a quick look at the most popular transformations and talk about the SQL.

#5: Correlation (3%)

I was surprised to see that Correlation made the Top 5. This transform creates every possible pair from the numeric columns in your table and run CORR() SQL function.

It is rather simple, but I suppose that the act of writing the SQL for every single pair of columns is very time-consuming and tedious. Therefore it makes sense that users are saving themselves time by using the SQL Generator for this type of transformation.

The table structure (right) is automatically analyzed for numeric columns, and each pair-wise combination is built into the SQL (bottom)

#4: Clean (3%)

The clean transformation is a combination of impute, filter, cast, and rename. It allows you to do one or multiple of these transformations to each column of your data.

In this example, I mixed and matched some different transformations using Clean():

  • removed Load_Datetime
  • Cast Record_Deleted_Flag to a number and filtered for where it is not equal to 1
  • Cast Order and Tax Amounts to FLOAT and replaced missing with 0
  • Renamed column Ref1 to MY_NEW_COLUMN_NAME
Clean lets you apply Drops, Renames, Cast, Impute, and Filters at the same time

#3: Join (4%)

The fact that join is so popular was also a surprise to me. I assume that joins are quite common and easy to configure, but perhaps that is just because I have been writing SQL for the past 20 years.

Unless you’ve connected Rasgo to your data-warehouse, the SQL Generator will have no visibility into your real database, which requires that you configure your table structure on the right-hand side of the page. However, you can setup as many tables as you’d like and then use the interface to setup your joins and have the SQL written for you.

Setting up Joins requires table definitions

#2: Pivot (5%)

The fact that pivot is the second most popular transform is not surprising. Pivot is notoriously difficult to deal with, especially because every RDBMS decides to use a slightly different syntax.

This one still requires some setup, however, because you must configure the LIST_OF_VALS which is the unique values that you want to pivot across as columns. If you omit this argument, Rasgo will try and query the distinct values for you. (See: connecting Rasgo to your data-warehouse above)

LIST_OF_VALS must be configured if Rasgo is not connected to your data-warehouse.

#1: Aggregate (17%)

Aggregate is by far the most popular transformation, and it’s easy to see why. The interface lets you easily select multiple types of aggregations and generates the SQL for you.

Furthermore, some RDBMS (I’m looking at you, BigQuery) have neglected to add popular functions such as MEDIAN and MODE, which makes writing the SQL confusing and tedious.

Try it for yourself! Open this example in the SQL Generator, change the dialect drop-down to BigQuery, and watch as BigQuery SQL gets generated for you.

Aggregate has an easy to use interface which makes it the most popular SQL transformation

A side note about the Pareto Principle

One of the things that surprised me the most about this data was how spread out all of the usage was. The classic “80/20 rule” does not apply here.

The 80/20 rule, commonly referred to as the Pareto Principle, suggests that 80% of the usage would account for 20% of the transformations. Specifically, this would mean that 15 of the 76 SQL transformations should account for 80% of the usage.

In reality, SQL usage seems to be much more diverse than that. We see that the Top 15 transformations only accounts for 50% of the usage. So, instead of the 80/20 rule we have a 50/20 rule.

It is unfortunate that the Pareto Principle gets confused with the “80/20 rule”. In reality, the Pareto Principle simply teaches us that most things are not evenly distributed. The fact that 80 and 20 add up to 100 is a common misconception. If you want to learn about the Pareto Principle, I highly recommend this blog/video by one of my favorite writers Kalid Azad.

Conclusion

I see two main takeaways with this analysis:

  1. SQL Generator is more popular for automating tedious SQL rather than complex logic
  2. SQL usage is diverse — in other words, we can’t just learn 5 things and suddenly become experts.

I hope that you have found this as interesting as I have. If you have suggestions or comments about other SQL Transformations that you think are common, you can find me hanging out in Locally Optimistic and DataTalks.Club.

--

--

Data scientist @ Rasgo, DataRobot, Comcast. Passionate about teaching and helping others.