Crack Data Science Interview

Introduction
Structured Query Language, SQL, is the go-to programming language that Data practitioners use to retrieve data stored in a relational database. Writing effective query requests is no longer considered a nice-to-have but an essential skill for data scientists. The trend can be supported by the specific inclusion of SQL experience in DS job postings and the interview loop.
In addition to Programming (Python), Machine Learning, A/B Tests, and Statistics, Data Scientists are frequently tasked to define and pull data from multiple sources to construct the metrics of interest. Unfortunately, SQL is still an under-appreciated skill, and not enough data scientists have fully recognized its versatility and importance.
For the last few months, I’ve been in close contact with senior data scientists and hiring managers at major tech. One of the top-rated suggestions and recommendations that I’ve received is to master Sql skills and know how to pull data efficiently.
To that purpose, I’ve created a trilogy post that is tailored to beginner coders. It progresses from easy to medium level of difficulty LeetCode questions with step-to-step walk-throughs. If you recently embark on your SQL journey, I’d suggest checking these three posts before reading this one: 1. Easy, 2. Easy/[Medium](https://towardsdatascience.com/crack-data-science-interviews-advanced-sql-skills-in-2021-322fef819c6c?sk=cd796b2b53f0b9106e59967d3e8fb8f3), and 3. Medium.
Question 1: Running Total for Different Genders
Write an SQL query to find the total score for each gender at each day.
Order the result table by gender and day
The query result format is in the following example:
Scores table:
+-------------+--------+------------+--------------+
| player_name | gender | day | score_points |
+-------------+--------+------------+--------------+
| Aron | F | 2020-01-01 | 17 |
| Alice | F | 2020-01-07 | 23 |
| Bajrang | M | 2020-01-07 | 7 |
| Khali | M | 2019-12-25 | 11 |
| Slaman | M | 2019-12-30 | 13 |
| Joe | M | 2019-12-31 | 3 |
| Jose | M | 2019-12-18 | 2 |
| Priya | F | 2019-12-31 | 23 |
| Priyanka | F | 2019-12-30 | 17 |
+-------------+--------+------------+--------------+
Result table:
+--------+------------+-------+
| gender | day | total |
+--------+------------+-------+
| F | 2019-12-30 | 17 |
| F | 2019-12-31 | 40 |
| F | 2020-01-01 | 57 |
| F | 2020-01-07 | 80 |
| M | 2019-12-18 | 2 |
| M | 2019-12-25 | 13 |
| M | 2019-12-30 | 26 |
| M | 2019-12-31 | 29 |
| M | 2020-01-07 | 36 |
+--------+------------+-------+
Walking Through My Thinking
This is the type of query question that you either solve at first glance or feel completely lost. It can easily lead you astray.
Here is why.
The question asks for "the total score for each gender at each day," and my first reaction is to calculate the cumulative sum and then group it by gender and day. So, I was wrongfully trying to find a syntax that calculates the cumulative sum in SQL, which doesn’t exist.
After struggling for a little bit, I realize we could apply the SUM() OVER( PARTITION BY…) statement to calculate the cumulative sum grouped by gender. Then, ORDER the result BY day.
Solution
# Write your MySQL query statement below
SELECT gender, day, SUM(score_points) OVER(PARTITION BY gender ORDER BY day) AS total
FROM Scores
Rule of Thumb:
- Many SQL commands follow the same syntax, such as SUM() OVER(), ROW_NUMBER() OVER(), LEAD() OVER, etc. Understand the usage of one command, and you are pretty much set for the others.
- Do you have to group your test results? If so, use PARTITION BY.
- How about ranking the results? If so, use ORDER BY.
Question 2: Find the Start and End Number of Continuous Ranges
Since some IDs have been removed from Logs
. Write an SQL query to find the start and end number of continuous ranges in table Logs
.
_Order the result table by start_id
._
The query result format is in the following example:
Logs table:
+------------+
| log_id |
+------------+
| 1 |
| 2 |
| 3 |
| 7 |
| 8 |
| 10 |
+------------+
Result table:
+------------+--------------+
| start_id | end_id |
+------------+--------------+
| 1 | 3 |
| 7 | 8 |
| 10 | 10 |
+------------+--------------+
Walking Through My Thinking
Amazon includes this question in its interview process.
This question asks for the start and end of continuous ranges with missing numbers. It’s relatively easy to identify the start and end points of a continuous range, which can be found using _MIN(logid) and _MAX(logid), respectively. The tricky part is to identify the discontinuity between two ranges. For example, how can we tell a discontinuity between these two continuous ranges: 1–3 and 7–8? Here, we are missing 5, 6, and 7.
The solution lies in how we group our results. If there is a discontinuity, we will observe a gap between log_id and the row ranking for each observation. For continuous ranges, the observations should have the same range; if there is any gap, then we observe a jump in the group difference.
Logs table:
+------------+
| log_id |
+------------+
| 1 |
| 2 |
| 3 |
| 7 |
| 8 |
| 10 |
+------------+
In the Logs table, numbers 1, 2, and 3 form a continuous range, and the difference between the log_id and the row ranking is consistent within the same range, which is 0 (i.e., log_id – row ranking, 1–1 = 2–2 = 3–3 =0). The same rule applies to 7 and 8 with a difference of 3 (i.e., log_id – row ranking = 7 – 4 = 8–5 = 3). As seen, we observe a jump (discontinuity) in the difference (from 0 to 3) when we move from the first range to the second range.
Solution
# Write your MySQL query statement below
SELECT MIN(log_id) AS start_id, MAX(log_id) AS end_id
FROM (
SELECT log_id, ROW_NUMBER() OVER(ORDER BY log_id) AS row_id
FROM Logs
) sub
GROUP BY log_id - row_id
ORDER BY start_id
Rule of Thumb:
- Analyze the question at hand and break it down into different components. If needed, use a subquery to complete the minor task and move to the outer query.
- ROW_NUMBER() OVER(ORDER BY …): obtain the row number for each unit.
- GROUP BY: the difference between log_id and row number to identify discontinuity.

Question 3: Department Highest Salary
The Employee
table holds all employees. Every employee has an Id, a salary, and there is also a column for the department Id.
+----+-------+--------+--------------+
| Id | Name | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1 | Joe | 70000 | 1 |
| 2 | Jim | 90000 | 1 |
| 3 | Henry | 80000 | 2 |
| 4 | Sam | 60000 | 2 |
| 5 | Max | 90000 | 1 |
+----+-------+--------+--------------+
The Department
table holds all departments of the company.
+----+----------+
| Id | Name |
+----+----------+
| 1 | IT |
| 2 | Sales |
+----+----------+
Write a SQL query to find employees who have the highest salary in each of the departments. For the above tables, your SQL query should return the following rows (order of rows does not matter).
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT | Max | 90000 |
| IT | Jim | 90000 |
| Sales | Henry | 80000 |
+------------+----------+--------+
Walking Through My Thinking
Amazon includes this question in its interview process.
Let me walk through the overall analysis before giving specific details. The question asks for the employees with the highest salary in each department. So, the natural question is there may be multiple employees with the highest salary, and we need to return the matched employees.
Here is a specific breakdown. I’ll use a subquery to obtain the highest (maximum) salary grouped by department ID and then join Employee and Department tables before matching on DepartmentId and Salary.
Solution
# Write your MySQL query statement below
SELECT d.Name AS Department, e.Name AS Employee, Salary
FROM Employee e
JOIN Department d
ON d.Id = e.DepartmentId
WHERE (DepartmentId,Salary) IN (
SELECT DepartmentId, MAX(Salary)
FROM Employee
GROUP BY DepartmentId
)
Rule of Thumb:
Department
Table only contains two columns, and its sole function is to return the department names for department Ids. This should turn into a second instinct when reading the question prompt.- If the question is to return the max/min value for each group (e.g., department), a typical approach is to apply a subquery to create a pair of max/min value per group and then match the pair to the outer query (as in the WHERE clause).
Question 4: Second Highest Salary
Write a SQL query to get the second highest salary from the Employee
table.
+----+--------+
| Id | Salary |
+----+--------+
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
+----+--------+
For example, given the above Employee table, the query should return 200
as the second highest salary. If there is no second highest salary, then the query should return null
.
+---------------------+
| SecondHighestSalary |
+---------------------+
| 200 |
+---------------------+
Walking Through My Thinking
Amazon and Apple include this question in their interview loops.
I’ve introduced how to approach this question in another blog post:
It’s a straightforward question, and it can be solved in the following steps:
1. Rank Salary in a descending order
2. LIMIT result by 1 and OFFSET 1, which returns the second highest salary
3. Use DISTINCT to deal with duplicates
4. Include the IFNULL statement in the outer query
Check the other post for detailed explanations. The sole reason for its inclusion is to provide a baseline analysis for Question 5 and allow us to compare and contrast the solutions.
Solution
SELECT IFNULL(
(SELECT DISTINCT Salary
FROM Employee
ORDER BY Salary DESC
LIMIT 1
OFFSET 1),
NULL) AS SecondHighestSalary
Rule of Thumb:
- Understand how OFFSET and IFNULL work.
Question 5: Nth Highest Salary
Write a SQL query to get the nth highest salary from the Employee
table.
+----+--------+
| Id | Salary |
+----+--------+
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
+----+--------+
For example, given the above Employee table, the nth highest salary where n = 2 is 200
. If there is no nth highest salary, then the query should return null
.
+------------------------+
| getNthHighestSalary(2) |
+------------------------+
| 200 |
+------------------------+
Walking Through My Thinking
This is a popular question that almost all FAANG companies expect job candidates to excel.
It asks for the nth highest salary and returns null
if no such value. For example, if we input n=2, return the second-highest value; n=3, the third-highest.
As a first step, we need to create a ranking index in a subquery according to the salary value (descending). In an interview, we need to clarify how we should handle ties: shall we skip the next index or return a consecutive number?
For our case, we choose DENSE_RANK to ensure consecutive rankings; Name the rankings _rankingssalary.
In the outer query, we select DISTINCT Salary and set _n = rankingssalary in the WHERE clause, which returns the nth ranking (highest) salary.
Solution
CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
RETURN (
# Write your MySQL query statement below.
SELECT IFNULL((SELECT DISTINCT Salary
FROM (
SELECT Salary, DENSE_RANK() OVER(ORDER BY Salary DESC) AS rankings_salary
FROM Employee
) AS aaa
WHERE n = rankings_salary), NULL) AS getNthHighestSalary)
;
END
Rule of Thumb:
- Don’t panic if you have to write nested query requests. The recommended approach is to write the innermost part in a sub-query first and move to the outer queries layer by layer.
- A common mistake: forget to name your subquery.
- Don’t forget to use DISTINCT and IFNULL.
Takeaway
Practice Makes Perfect.
When I start coding in SQL, the only query I can perform is to count unique values from a table. Fast forward to today, I’m comfortable with more advanced coding tasks that require subqueries, window functions, and other advanced applications.
The biggest takeaway that I want to share with my fellow data folks is to start small and simple, learn from your mistakes, and trust in the value of incremental learning. Be the type of person who grows and becomes 1% better than yesterday. You will be rewarded in the long run.
Appreciate your readership, and good luck with everything.
_If you find my post useful and want to learn more about my other content, plz check out the entire content repository here: https://linktr.ee/leihua_ye._
My Data Science Interview Sequence
Enjoy reading this one?
Please find me on LinkedIn and Youtube.
Also, check my other posts on Artificial Intelligence and Machine Learning.