
It is indisputable that SQL is a must-have skill for data scientists. Most companies use relational databases to store their data and SQL is what we need to interact with them.
SQL stands for structured query language but it is able to perform much more than querying a database. We can use SQL to do the tasks in a typical exploratory Data Analysis process. In this sense, it is similar to the other data analysis tools such as Pandas in Python and data table in R.
In this article, we will go over 2 tricky SQL questions that you might be asked at data scientist interviews. I know the first one was actually asked at an interview. I think the other two is highly likely to be asked as well. If not, the procedure it follows will definitely be helpful.
I will assume you have a basic understanding of writing SQL queries. If not, you can still follow through because I will try to explain what each step does.
Note: There are several relational database management systems such as MySQL, SQL Server, PostgreSQL, and so on. Although they mostly adapt the same SQL syntax, there might be small differences. In this article, we will use PostgreSQL.
Question 1
Consider you have a weather table that contains daily temperature values in different cities.

How can you create a table that also contains the temperature difference between the current day and the next day? For the first row, the difference column should contain the value 1.2.
We can accomplish this task in two steps:
- Create a common table expression (cte) that has the temperature of the next day as a new column
- Using the common table expression created in step 1, calculate the temperature difference
The table defined in the first step can be created by shifting temperature column by 1 row and then joining it to the original table. We need to also consider the city names. We cannot use the temperature of city B as the next day temperature for city A.
The shifting can be done by the lag or lead functions. Since we are interested in the next day, we will use the lead function. We also need to partition by city and order the rows by date. Thus, we ensure to put the appropriate next day temperature for each row.
In the second step, we will select all the columns from the table created in the first step and calculate the difference between the temperature and next day temperature columns.
Here is the SQL query that performs this operation:
WITH cte AS(
SELECT
W1.*,
LEAD(W2.temperature, 1) OVER(PARTITION BY W2.city ORDER BY
W2.date) AS temp_next
FROM weather W1
LEFT JOIN weather W2
ON W1.date = W2.date and W1.city = W2.city
) SELECT
*,
COALESCE((temp_next - temperature), 0) AS temp_diff
FROM cte;

The coalesce function can be used to replace the missing values that arise as a result of a join. Since the last days do not have a next day value, we can just put 0 as the difference.
Question 2
Consider we have the following table that contains some features about the houses in Melbourne and their prices.

The dataset contains a date column which I think represents the date house is posted for sale. We want to see if the day of the week has an effect on the house price in general. For instance, houses posted on the weekends might have a higher price than the ones posted during the week.
This task involves in a grouping operation based on a derived column. We will create a categorical variable that contains the day of week information extracted from the date column.
This column will then be used to group the observations (i.e. rows).
SELECT
EXTRACT(dow from date) AS day_of_week,
ROUND(CAST(AVG(price/1000000) AS numeric),3) AS avg_price_mio FROM melb
GROUP BY day_of_week
ORDER BY avg_price_mio DESC;
day_of_week | avg_price_mio
------------+---------------
5 | 1.123
0 | 1.120
4 | 1.093
6 | 1.077
3 | 1.073
2 | 1.048
1 | 1.023
The aggregation is applied to the price column. I wanted to display the averages with two decimal points so I have also used the cast and round methods.
Finally the results are ordered by the average price in descending order.
By the way, the dataset used in this example can be downloaded from Kaggle.
Conclusion
We have done 2 examples that demonstrate somewhat tricky SQL queries. These kind of operations prove that SQL is able to perform data analysis and manipulation operations just like Pandas for Python and data table for R.
SQL is an essential tool in the Data Science ecosystem. In fact, it is more preferred to perform data analysis and manipulation tasks with SQL rather than retrieving all the data from a database and using an external tool.
Thank you for reading. Please let me know if you have any feedback.