The world’s leading publication for data science, AI, and ML professionals.

Tricks for Advanced SQL Querying

One of the most valuable technical skills I use working on a data science team in the retail space is SQL…

Taking advantage of all that SQL offers can improve and enhance your data.

Photo by Markus Spiske on Unsplash
Photo by Markus Spiske on Unsplash

One of the most valuable technical skills I use working on a data science team in the retail space is SQL. Users of legacy relational database systems and hybrid/cloud data management solutions alike can use SQL to flexibly access business data and transform or display it in creative and innovative ways; see Resources for an excellent review. A thorough understanding of a business’ database structure, combined with crafty solutions that allow for upstream data transformation and aggregation, are crucial to efficient & sound ETL. Here are a few tricks I’ve learned while building complex pipelines that make my job both easier and more interesting.

1) Calculate Rolling Averages & Attach Historical Data By Joining A Table Onto Itself

When working with time series data, it can be helpful to calculate a rolling average for an observation, or attach a historical value. Let’s say I want to get the number of widgets sold by day for a company. I may want to include a 7-day moving average, or attach last week’s weekday widgets sold to see how the business has performed vs. last week. I can do that by joining a dataset onto itself, and using operations on the date column to select a single value or range of observations.

First, let’s start with just getting widgets sold by day using the below code from a table called WIDGET_ORDER_HISTORY in the Db2 library:

select t1.date
  , sum(t1.widgets_sold) as total_widgets_sold
from db2.widget_order_history t1
where t1.date between '2021–04–05' and '2021–05–01'
group by t1.date

If we wanted to add in any filtering condition on another variable on the table, such as specific stores, we can add a simple WHERE statement:

select t1.date
   , sum(t1.widgets_sold) as total_widgets_sold
from db2.widget_order_history t1
where (t1.date between '2021–04–05' and '2021–05–01')
   and t1.store = 1234
group by t1.date

Other conditions located on other tables (i.e. STORE_DATA) would require an additional INNER JOIN:

select t1.date
   , sum(t1.widgets_sold) as total_widgets_sold
from db2.widget_order_history t1
inner join (
   select store
   from db2.store_data
   where state = 'NY'
) t2
on t1.store = t2.store
where t1.date between '2021–04–05' and '2021–05–01'
group by t1.date
Example time series data, generated from the provided code. Image by author
Example time series data, generated from the provided code. Image by author

Now if I wanted to attach a rolling 7-day average for the week of 4/25/21–5/1/21, I can do this by joining the table onto itself and take advantage of inserting a CASE() statement within the SUM() function. You can use this technique to use grouping functions (i.e. SUM(), COUNT(), MAX()) when you only want to meet specific criteria from a table. It will only sum values that meet the rules encompassed in your WHEN clauses. In our example below, if the values for table B are within the previous 7 days of the current observation’s date on table A, we can add those sales and divide by 7 to get a rolling weekly average for each row of table A:

select a.date
, a.total_widgets_sold
   , sum(
   case when (b.date between a.date-7 and a.date-1) 
   then b.total_widgets_sold 
   else 0 
   end)/7 as seven_day_avg
from (
   select date
      , sum(widgets_sold) as total_widgets_sold
   from db2.widget_order_history
   where date between '2021–04–25' and '2021–05–01'
   group by date
) a
left join (
   select date
      , sum(widgets_sold) as total_widgets_sold
   from db2.widget_order_history
   where date between '2021–04–05' and '2021–05–01'
   group by date
) b
on a.date = b.date
group by a.date
   , a.total_widgets_sold
order by a.date
Widget Sales for Week 17 Calendar Year 2021 with rolling 7-day average values. Image by author
Widget Sales for Week 17 Calendar Year 2021 with rolling 7-day average values. Image by author

If you want to attach a historical value to each observation, then you can avoid aggregation and simply join on the table based off of a date that is a specified number of intervals away. The example below joins table B on table A to get the previous weekday’s widget sales by going 7 days back:

select a.date
   , a.total_widgets_sold
   , b.total_widgets_sold as prev_wkday_sales
from (
   select date
      , sum(widgets_sold) as total_widgets_sold
   from db2.widget_order_history
   where date between '2021–04–25' and '2021–05–01'
   group by date
) a
left join (
   select date
      , sum(widgets_sold) as total_widgets_sold
   from db2.widget_order_history
   where date between '2021–04–04' and '2021–05–01'
   group by date
) b
on a.date -7 = b.date
Widget Sales for Week 17 Calendar Year 2021 with last week's weekday widget sales. Image by author
Widget Sales for Week 17 Calendar Year 2021 with last week’s weekday widget sales. Image by author

Joining a table onto itself is an incredibly flexible way to add summary and calculated columns to a dataset. The creative use of grouping functions such as SUM() and COUNT() with CASE() statements leads to tremendous opportunities for feature engineering, Analytics reporting, and a variety of other use cases.

In practice, if a query joins onto itself via sub-querying and is significantly large, long run-times can be expected. One way around this is to use a temporary table to hold preliminary results with problem-specific criteria; for example, creating a table of widget sales for an entire time range in the WORK library in SAS and querying that table multiple times. Efficient code structure such as the use of indexing can also increase efficiency; see Resources for some ideas.

2) Dealing with Complex Logic Using CASE Statements

The syntax of CASE statements differ from logic in other commonly used programming languages throughout data science (read: Python/R). Thoughtful design of logic rules through the use of pseudo-code can help avoid errors due to incorrect/inconsistent rules. Understanding how to encode nested logic in SQL is crucial to unlocking the potential within your data.

Say there is a table of shoppers with a large amount of characteristics like age, home state, and sales for a given time frame. A targeted marketing campaign is being used to try and increase sales for average shoppers (Marketing has identified an average shopper as someone who spends between $100-$200). Once identified, Gen Z/Millennial shoppers will receive a digital coupon, and all other shoppers will be mailed a print coupon that will differ depending on which state they live in. For simplicity’s sake, there are only three states shoppers live in.

Here is how this logic could be encoded in R, as well as in SQL:

## Example of Nested Logic in R
if(shoppers$sales<=0){ print("Error: Negative/No Sales")}
else if(shoppers&amp;sales<=100){ print("Shopper has below-average sales.")}
else if(shoppers&amp;sales<=200){ 
   if(shopper$age<41){print("Shopper has average sales and is Gen Z/Millennial.")}
   else{
 if(shopper$state=='NJ'){print("Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New Jersey.")}
 else if(shopper$state=='NY'){print("Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New York.")
 else(shopper$state=='CT'){print("Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in Connecticut.")}
   }
}
else{print("Shopper has above-average sales.")}
*Example of nested logic in SQL. No need to actually nest statements!;
, case when sales < 0
 then 'Error: Negative/No Sales.'
when sales <=100
 then 'Shopper has below-average sales.'
when sales <=200 and age <41
 then 'Shopper has average sales and is Gen Z/Millennial.'
when sales <=200 and state = 'NJ'
 then 'Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New Jersey.'
when sales <=200 and state = 'NY'
 then 'Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in New York.'
when sales <=200 and state = 'CT'
 then 'Shopper has average sales, is Gen X/Boomer/Greatest Gen., and lives in Connecticut.'
else 'Shopper has above-average sales.'
end as shopper_classification

Using CASE statements thoughtfully will allow you to build any combination of complex Business logic. However, SQL logic requires a slightly different mindset than other programming languages. Combined with grouping functions, these tools can provide a data scientist with a competitive advantage in obtaining and transforming data sources for feature engineering, business intelligence, analytical reporting, and more!

Resources

A great breakdown of the enterprise DBMS landscape in 2021:

Top Vendors of Database Management Software for 2021 | eWEEK

A summary of some helpful Sql query optimization techniques:

Query optimization techniques in SQL Server: tips and tricks


Related Articles