Snowflake is a cloud-computing data solution which allows users to store data and run queries directly in their cloud platform, available to be accessed directly via web browser. It’s often used for its data storage and its autoscaling capabilities, where clusters are started and stopped automatically to manage query workload.
In part 1 of this series, we explored how Snowflake also features unique query syntax which may not be available in other database systems such as PostgreSQL or MySQL. When used correctly, these clauses help improve syntax and readability, and, most importantly, reduce both compute costs and execution time.
In this article below we’ll walk through 5 more of these clauses, including some functions for string processing which you might initially assume you’d need to go to Python or R for.
1. MAX_BY / MIN_BY
The min_by and corresponding max_by clauses allow us to find the rows corresponding to the minimum/maximum value for a given column and return the value of another column in that row.
A common approach to do this would be to find the minimum value in a subquery, and then use it as a filter. For example, we might want to get the earliest products sold
select product_id
from product_sales
where sold_at = (select min(sold_at) from product_sales)
The issue with this approach is that it requires an additional subquery, which when performed multiple times in a query can make it difficult to follow. In Snowflake we can achieve this in a single operation
select min_by(sold_at, product_id)
from product_sales
This is much cleaner, we performed the same query in half the lines and half the steps!
2. LAST_DAY
The last_day clause is used to return the last day of a particular date period for a given date.
A common use case for this is if we have a denormalized table of the running number of products in inventory by day. If we want to look at how numbers trend by quarter, rather than averaging the number over the whole quarter which can be prone to fluctuations over the period, it might be more informative to see the final number at the end of the quarter
select
date_day,
product_count,
last_day('quarter', date_day) = date_day as is_last_day_of_quarter
from product_inventory
Now further in our data modelling, we can identify if a date is the last day of a week/month/quarter/period, which can be particularly useful when building reporting in our BI tool.
Another approach to this might be to create a denormalized table of dates with columns day, month, quarter, year, is_last_day_of_quarter, etc. which can be joined to any date to pull the relevant useful date dimensions. This approach is also much easier to understand than hardcoding any dates or trying to find the last day using a combination of date_trunc and subtracting a day.
3. EDIT DISTANCE
The edit_distance clause implements the Levenshtein distance between two strings, a common metric used in NLP tasks such as string similarity and pattern matching.
It works by considering how many edits in terms of adding, subtracting, or replacing are required to get from one word to another. For example, the edit distance from "mitten" to "knitting" is 4:
1. mitten → kitten (substitute "k" for "m")
2. kitten → knitten (insert "n" after "k")
3. knitten → knittin (substitute "i" for "e")
4. knittin → knitting (insert "g" at the end)
This metric can be particularly helpful for tasks like finding duplicate names in a database or matching names (or other strings) across different datasets.
select
u.full_name,
u2.full_name as user2_full_name
from
users u
join users u2
on u.last_name = u2.last_name
where edit_distance(u.full_name, u2.full_name) < 1
Here we try to find users with very similar names in signup to check for multiple signups by the same person. This can be used for compliance and other KYC checks.
Another use case for this might be to check for a cache of blacklisted or otherwise cached words. We don’t have to check for an exact match, we can check if it matches any cases with a small edit_distance (e.g. < 1) in which case it is likely to be a close match.
However, keep in mind that although this function is very powerful for string-matching requirements, performing this operation multiple times can be expensive compute-wise, as it has to compare every letter between the two strings. You can speed this up using the max_distance
parameter to stop the compute once a certain edit_distance is reached.
4. REGEXP
The regexp clause enables us to perform regular expression pattern matching which is a very powerful tool used in many string processing tasks. Snowflake users are probably used to similar operations using the SQL ‘ilike’ operation, but the use of regular expressions enables many more possible operations (you can use regexp_like for SQL ‘like’ syntax).
This type of operation might be particularly helpful for searching or filtering a table for a certain type of pattern. For example, in a list of product references I might want all products which contain ‘ANTHONY’ and a set of numbers.
select * from products
where product_ref regexp'ANTHONY.*sREF[0-9]+'
Here the regex looks for a product_ref containing ANTHONY
, then any number of a wildcard .*
(can be anything) characters, then a space s
, then REF
and any number of numbers from 0–9. You can have a play with this using a site like regex101.
Hopefully from just this example above you can see just how powerful regular expressions can be! There’s a joke in the machine learning world where a team spent 6 months developing an AI model to identify spam emails and were able to achieve a great 90% accuracy. They were doing the final checks with another team before deployment and together they did a quick mockup of a regex solution as a baseline, but this mockup performed better than the machine learning model!
I suggest getting to grips with regex if you have any pattern-matching problems to tackle.
5. STARTSWITH / ENDSWITH
Keeping in the domain of string processing, the startswith (and corresponding endswith) clause allows us to perform some types of simple regex in a much more readable fashion.
A use case for this might be that after extracting the domain part from an email [email protected]
using split_part, we want to get all the emails with are associated with Deel or some other company.
select
split_part(email, '@', -1) as domain,
startswith(domain, 'deel') as is_deel_domain
from product_sales
where is_deel_domain = true
Here we used the power of snowflake variables covered in part 1 to be able to select with clear column naming and without having to perform computes multiple times.
Other such use cases might be to save questionnaire responses and check which ones start with identifiable keywords. The key takeaway is that when full regular expressions are overkill, the startswith syntax is much clearer than performing some kind of regex or ilike operation.
Final Thoughts
Snowflake features some useful query options including those for string processing and NLP-based tasks. We looked at some clauses which can help us to more succinctly reach the same output, and explored others which can help us complete string processing tasks such as pattern matching.
If you enjoyed this article you can find more articles on my and drop me a follow on my profile!