4 Principles to Learn SQL for Data Science

Tips for improving your data science skillset

Nate Coleman
Towards Data Science

--

Photo by Mayer Maged on Unsplash

It goes without saying that the ability to code up SQL is necessary to landing and succeeding in any data science role. There isn’t any evidence that this is changing anytime soon, SQL is here to stay, so mastering this skill is an unequivocally worthwhile investment. Whether you’re looking for your first role in data science, brushing up on your SQL skills for your next round in the job market, or looking to level up your career I’ve got four principles that will help you succeed in all your SQL-related endeavors.

Let’s dive right in.

1. Before You Write Any Code, Determine the Right Grain

What do I mean by “grain?”

The grain of your SQL query is the level of aggregation you want your output dataset to be.

So for example, if you’re looking for a dataset aggregating daily user actions, your “grain” can be described as being at “the user-day level” (i.e. each row represents for every user, each day they took an action.)

It’s a great idea to think about this grain ahead of time for two main reasons.

First, it helps you plan a structure for your query

This is especially important when you’re hitting the data science job market since candidates are often asked to live-code SQL during this process. Creating a mental framework for your query before you start coding will help get to the answer in a clear, efficient way that will impress your interviewers.

Second, it can help you identify obvious errors in your code

Usually you’ll have some context as to the number of rows you’d expect to be returned by a query you write. For example, let’s say my planned grain is at the “user” level, and I know we have ~1,000 users. When my query returns a dataset with 100,000 rows, I immediately know that there’s an issue with one or many joins in my query.

2. Formatting Matters

SQL is a language, and our queries tell a story. Let’s make that a story our future selves and colleagues can easily read. Some queries for quick-and-dirty checks don’t need pristine formatting, but any query that has a chance at being read by someone else or a future-self should follow a consistent pattern. Below is an example of a simple but well-formatted query.

I’ll note a few deliberate style choices I made here.

  • All keywords are in CAPS
  • Keywords on new lines are right aligned. Their succeeding expression is left-aligned
  • Each table has a simple, but meaningful alias
  • Column numbers are used in the GROUP BY and ORDER BY clauses. I know some consider this the “lazy” way, but the benefits of speed, and cleanly grouping by an aliased column outweighs the cost of not being as explicit as one could be. For example, if I wanted to use column names in the GROUP BY clause, I would have to do the following (this is mostly true, some databases allow alias names in a GROUP BY)

There are a lot of opinions on SQL formatting out there, and I don’t think there is a clear winner among them. Just choose one, and be consistent.

3. Use CTEs, not Subqueries

CTEs or Common Table Expressions, are temporary datasets that can be referenced in subsequent SQL statements within a given query. CTEs are best explained with an example. The query below identifies the sales rep that went from their first order to $100,000 in orders the fastest.

Let’s break down the components of this query

  • First, you’ll notice the keyword WITH. All queries using CTEs must include this statement preceding the first CTE.
  • first_order_day is our first CTE. It defines a dataset that finds the first day a sales rep completed an order. first_order_day can be referenced in subsequent expressions just like you’d reference any other table in your database.
  • Next we define another CTE, running_order_value. This calculates the running total dollar amount of sales for each sales rep.
  • Finally, we reference both CTEs to return the row corresponding to the shortest time_to_goal value.

Queries written using subqueries can be rewritten using CTEs, and I suggest doing so — CTEs are much cleaner and easier to follow for your colleagues and future self. We can see what the query above looks like using subqueries. Pretty overwhelming, right?

We can achieve the same result, with improved readability and logic flow by using CTEs.

4. Tell a Story with Code Comments

At the end of the day, SQL is now treated like a programming language used in software engineering, so why is it not standard practice to follow the same commenting conventions as other languages (e.g. python)? I believe this is largely due to the way folks have been taught SQL. While most programming languages used in software engineering are taught with these conventions in mind, SQL is often taught with style as an afterthought. It’s usually seen as just the means to an end (i.e. a dataset → analysis).

However, some of the most critical decisions we make in an analysis happen within a SQL query, and we owe it to ourselves, stakeholders, and those who review our code to explain our intentions through descriptive comments.

Especially, in more complex queries, comments are hugely beneficial for thorough analysis.

They help future you, and your code-reviewers to understand your intentions

This saves ourselves and our colleagues time and headaches when re-reading or reviewing code. When we’re generating a dataset for analysis, data scientists do so with an end-goal in mind (e.g. we want to see all help-site visits by logged-in users over the past 6 months.) Documenting what we mean to do with our query, helps guide us in writing our query, and reviewers in checking the accuracy of our work.

It documents assumptions made about the data

Garbage in Garbage Out (GIGO) is an expression that should be top of mind when performing analysis. In short, if you feed an analysis or model nonsense data, you’re going to get nonsense results.

This is important because raw data needs to be modeled and aggregated before it can be used in analysis. And this happens to some degree in SQL (more-so now with the rise of tools like dbt). So when you make assumptions about your data, or choices for a type of aggregation (e.g. using median vs. mean), explain why you did this in your code’s comments.There is nothing worse than the feeling of not being able to explain to a colleague or stakeholder why you included an expression in your query.

It’s All About Trust

Ultimately, these four principles are all rooted in the idea that trust is the most important asset a data scientist has. Following these four principles when learning and writing SQL will help you build trust from everyone from interviewers to colleagues and stakeholders you work alongside.

Pro-tip: If you are new to SQL, or looking to brush up your skills, I think SQL School from Mode Analytics is the best free resource out there. Check it out!

--

--