How To Ace Data Science Interviews: SQL

Carson Forter
Towards Data Science
9 min readSep 12, 2017

--

This is part of an ongoing series on interviewing for data science roles. You can check out the next part, covering statistics, here, and the third part on R and Python here..

Data science interviews can be tough to navigate. The fact that it’s such a multi-disciplinary field means that the sheer volume of material you need to cover to feel properly prepared can become overwhelming.

But I’m here to help. I’ve broken down what you need to know into four areas and will cover one in each part of this series: SQL, Statistics, Scripting, and Product. Each post will cover resources for learning the basics, an overview of important technical areas along with examples, and some notes on my general approach to interviewing in that area.

If this is your first time interviewing for a data science role, going through this series should give you a solid idea of what to expect and how to prepare. But even if you’ve done this all before, I’m hoping these articles will be useful in focusing your preparation on the skills that are most likely to be tested.

SQL Basics

Every data science interview I’ve been a part of has involved SQL in some way. I can almost guarantee you’ll have to write some SQL at some point during your job search. As much as data scientists like to pretend that we’ve moved beyond columnar databases to distributed systems like Hadoop, the fact is most modern data stores still rely on (or at least allow) SQL syntax to retrieve data.

The basics of SQL are straightforward, but the SQL questions you get asked in data science interviews can get pretty complex. I’m guessing most of you reading this have had some experience with the language, but if you’ve never touched SQL before you can work through Codecademy’s brief introduction.

The rest of this post assumes you know the following like the back of your hand:

SELECT
FROM
JOIN
WHERE
(AND)
GROUP BY
ORDER BY
LIMIT

One SELECT...FROM statement will return one set of rows and columns from the specified table, the details of which are determined by what you put after the remaining keywords. For example, to get a count of today’s webpage views by username ordered alphabetically you might write something like:

SELECT username, count(1) as number_of_views
FROM pageviews
WHERE day = '2017-09-08'
GROUP BY username
ORDER BY username;

Got it? Great, lets get to the interesting stuff.

Subqueries and Common Table Expressions

You now know how to retrieve one set of rows and columns, and that might get you past your interviewer’s first question. But the more advanced problems will require subqueries or common table expressions (CTEs). Let’s talk about what these are and how to use them.

CTEs and subqueries allow you to grab a subset of data and store that data with a name, which you can then select from and perform more operations on. The functions of these two methods are almost identical, so here I’ll just focus on CTEs, which I find to have more readable syntax.

Let’s say you’ve been asked to calculate the average time between transactions by a particular user. You have a table called transactions that contains a username and the time of the transaction. To solve this you’ll want the time of each transaction in a row along with the time of that user’s next transaction. A single query won’t get you all the way there: you’ll need to pull a user’s transactions, store it in a CTE, and then join the later transactions to it so that you can calculate the time in between. Here’s how that might work:

-- First, find all of user_a's transactions today with user_a_trans as (
SELECT username, time
FROM transactions
WHERE day = '2017-09-08'
AND username = 'user_a'),
-- Join each transaction to all transactions occurring after itjoined_trans as (
SELECT username, time, future_times
FROM user_a_trans a
INNER JOIN user_a_trans b
ON b.time > a.time),
-- Find the immediate next transaction using MIN()next_trans as (
SELECT username, time, MIN(future_times) as next_time
FROM joined_trans
GROUP BY username, time)
-- Average difference of the time and the next transaction's timeSELECT AVG(next_time - time) as avg_time_to_next_transaction
from next_trans;

It’s totally fine if you didn’t quite get all of the details of how this query works — some practice will make it easy. The important point is that more complex questions need to be broken down into pieces and solved with a series of CTEs. Which leads to…

Filter, Aggregate, Join

When you get a difficult question like the one above, take a minute and ask yourself what the ideal table would have to look like to allow you to answer your question with one SELECT statement. In the above example, the ideal table was one that included one record for each transaction, and a column that gave the time of the next transaction.

Once you know what form your final table should take, you can work backwards, and one step at a time determine how to use a series of CTEs to transform your original table into the final output your interviewer asked for.

Generally, you want to follow these steps with your string of CTEs: filter, aggregate, join. Filter using WHERE, aggregate using GROUP BY, and join using, well JOIN. Rinse and repeat.

By filtering and aggregating your data before joining, you write the most efficient SQL. Joins are expensive to process so you want the fewest possible rows before joining two tables together. Sometimes aggregating first won’t be possible, but usually you’ll be able to limit the size of the tables you’re joining with at least a WHERE clause or two.

It’s important to note that if you have a JOIN and a WHERE clause in the same CTE, SQL processes the JOIN first. In other words, the following is very inefficient, because the entirety of your tables would be joined together and only then filtered to data after 9/1/2017:

SELECT *
FROM table_a a
INNER JOIN table_b b
ON a.username = b.username
WHERE a.day >= '2017-09-01'

The proper way to express this uses CTEs to filter before joining, like this:

with a as (
SELECT *
FROM table_a
WHERE day >= '2017-09-01')
b as (
SELECT *
FROM table_b
WHERE day >= '2017-09-01')
SELECT *
FROM a
INNER JOIN b
ON a.username=b.username;

Again, the efficiency gain here comes from the fact that the join is performed only after you’ve filtered to the fewest possible number of rows.

Window Functions

You’ll probably be able to solve the majority of problems with just the filter, aggregate, join process described above. Occasionally though, you’ll get a tricky problem in SQL that needs a window function. Like a GROUP BY clause, window functions separate your table into several chunks and operates on each chunk individually. But unlike GROUP BY, the rows are not combined. An example is the easiest way to grasp this.

Imagine you have a table with some line items that report revenue along with the U.S state it came from. Your task is to determine the percent of revenue each line item contributed to its state’s total revenue.

The trick here is to realize that you need to compare an individual value (the revenue from a particular line item) with an aggregated value (the sum of all line item revenue in a particular state). Any time you need to do something like this a window function is a good bet. Here’s how that query might look:

with state_totals as (
SELECT state, revenue,
SUM(revenue) OVER (PARTITION BY state) as state_revenue
FROM state_line_items)
SELECT state,
revenue/state_revenue as percent_of_state_revenue
FROM state_totals;

The window function is specified by the OVER clause. By summing revenue partitioned by state you get an aggregated value for each state associated with each individual line item. This makes getting that percent contribution number you care about a matter of simple division.

Window functions work with most aggregations, like SUM, COUNT, or AVG, but there are also some special keywords that only work as window functions. Some good examples of those are RANK, FIRST_VALUE, and LAG. You can read up on additional window functions and how they work in the Postgresql docs. The six functions I’ve named above are probably all you need to successfully complete a typical data science SQL interview.

Union and Case Statements

To round out your SQL arsenal you just need a couple more tools. The first one is easy to grasp: unions.

Unions are just the vertical version of joins: whereas joins combine tables or CTEs horizontally using a join key, a union just stacks the tables on top of each other to form one table containing all the rows from the two original tables. The requirement for this is that the two tables that are being union-ed have the same exact columns — otherwise there would be no way to logically combine them.

An example of when a union might be useful is when you have separate tables for two types of transactions, but want a single query to tell you how many of each type of transaction you have.

with sales as (
SELECT 'sale' as type
FROM sale_transactions
WHERE day >= '2017-09-01'),
buys as (
SELECT 'buy' as type
FROM buy_transactions
WHERE day >= '2017-09-01'),
unioned as (
SELECT type
FROM buys
UNION ALL
SELECT type
FROM sales)
SELECT type, count(1) as num_transactions
FROM unioned
GROUP BY type;

By selecting a constant type field (in this case just ‘sale’ or ‘buy’) from each of your two tables and then union-ing them, you end up with one big table where you can group and count in a single query.

Union-ing is the answer when you wish that two tables had instead been structured as one combined table.

Case statements are another fairly simple concept: they’re exactly the same as the ifelse() functions in environments like R and Excel. They’re useful for mapping from one set of predefined values to another.

For example, you might want to turn a day of week column into a variable representing whether or not that day was a weekend.

SELECT 
CASE WHEN day_of_week in ('Sat', 'Sun')
then 'Weekend' else 'Weekday' end as day_type
FROM table_a;

Similarly, you could convert a string column like day of week to a binary variable and sum it up to count the number of weekend days in your table.

SELECT 
SUM(
CASE WHEN day_of_week in ('Sat', 'Sun')
THEN 1 ELSE 0 END) as num_weekend_days
FROM table_a;

Case statements are flexible — you can string together a bunch of WHENs to map any value to any other value and then use an ELSE to catch the rest.

Wrapping Up

You now have all the pieces you need to successfully pass the SQL portion of a data science interview. Of course practice is the key here. Try setting yourself a few problems and working through them using the tools I’ve described above. Better yet, find a white board and practice on that so that you’re even more comfortable in an interview setting. If you get stuck, Google is your friend. Stack Overflow and similar sites have detailed walk-throughs of how to solve any particular SQL problem and Google is generally good at turning up what you what you need from them.

I’ll leave you with a few pieces of advice for the interview itself.

Break each problem into it’s smallest possible parts. This is best way to clearly think through a SQL problem, and will allow you to map each step to a concrete SQL command.

Talk through your process out loud. Just like in school, you get credit for showing your work. If you just start writing on the whiteboard and don’t let your interviewer know what you’re doing, they’ll have a hard time evaluating your skill, especially if you don’t quite make it to the final answer.

Ask for help. Believe it or not, most interviewers want you to succeed and are happy to provide some help, as long as you’re able to articulate precisely where it is you’re having trouble. In other words, it’s fine to ask something like what the right syntax is for converting an integer to a float, but avoid asking vague questions that might indicate you don’t know how to approach the problem you’re trying to solve.

If you have questions about the data science interview process you can find me on Linkedin or Twitter. And look for the next part in this series, covering statistics, in the coming weeks.

--

--