The world’s leading publication for data science, AI, and ML professionals.

2 Tricky SQL Questions You Might Be Asked at Data Scientist Interviews

SQL is a powerful data analysis and manipulation tool.

Photo by Diana Polekhina on Unsplash
Photo by Diana Polekhina on Unsplash

It is indisputable that SQL is a must-have skill for data scientists. Most companies use relational databases to store their data and SQL is what we need to interact with them.

SQL stands for structured query language but it is able to perform much more than querying a database. We can use SQL to do the tasks in a typical exploratory Data Analysis process. In this sense, it is similar to the other data analysis tools such as Pandas in Python and data table in R.

In this article, we will go over 2 tricky SQL questions that you might be asked at data scientist interviews. I know the first one was actually asked at an interview. I think the other two is highly likely to be asked as well. If not, the procedure it follows will definitely be helpful.

I will assume you have a basic understanding of writing SQL queries. If not, you can still follow through because I will try to explain what each step does.

Note: There are several relational database management systems such as MySQL, SQL Server, PostgreSQL, and so on. Although they mostly adapt the same SQL syntax, there might be small differences. In this article, we will use PostgreSQL.


Question 1

Consider you have a weather table that contains daily temperature values in different cities.

The weather table (image by author)
The weather table (image by author)

How can you create a table that also contains the temperature difference between the current day and the next day? For the first row, the difference column should contain the value 1.2.

We can accomplish this task in two steps:

  1. Create a common table expression (cte) that has the temperature of the next day as a new column
  2. Using the common table expression created in step 1, calculate the temperature difference

The table defined in the first step can be created by shifting temperature column by 1 row and then joining it to the original table. We need to also consider the city names. We cannot use the temperature of city B as the next day temperature for city A.

The shifting can be done by the lag or lead functions. Since we are interested in the next day, we will use the lead function. We also need to partition by city and order the rows by date. Thus, we ensure to put the appropriate next day temperature for each row.

In the second step, we will select all the columns from the table created in the first step and calculate the difference between the temperature and next day temperature columns.

Here is the SQL query that performs this operation:

WITH cte AS(
   SELECT
      W1.*, 
      LEAD(W2.temperature, 1) OVER(PARTITION BY W2.city ORDER BY 
      W2.date) AS temp_next                                                   
   FROM weather W1                                                                                                                                                   
   LEFT JOIN weather W2                                                                                                                                              
   ON W1.date = W2.date and W1.city = W2.city
)                                                                                                                       SELECT 
   *, 
   COALESCE((temp_next - temperature), 0) AS temp_diff 
FROM cte;
(image by author)
(image by author)

The coalesce function can be used to replace the missing values that arise as a result of a join. Since the last days do not have a next day value, we can just put 0 as the difference.


Question 2

Consider we have the following table that contains some features about the houses in Melbourne and their prices.

The first 5 rows of the melb table (image by author)
The first 5 rows of the melb table (image by author)

The dataset contains a date column which I think represents the date house is posted for sale. We want to see if the day of the week has an effect on the house price in general. For instance, houses posted on the weekends might have a higher price than the ones posted during the week.

This task involves in a grouping operation based on a derived column. We will create a categorical variable that contains the day of week information extracted from the date column.

This column will then be used to group the observations (i.e. rows).

SELECT 
   EXTRACT(dow from date) AS day_of_week,   
   ROUND(CAST(AVG(price/1000000) AS numeric),3) AS avg_price_mio FROM melb 
GROUP BY day_of_week 
ORDER BY avg_price_mio DESC;
day_of_week | avg_price_mio
------------+---------------
          5 |         1.123
          0 |         1.120
          4 |         1.093
          6 |         1.077
          3 |         1.073
          2 |         1.048
          1 |         1.023

The aggregation is applied to the price column. I wanted to display the averages with two decimal points so I have also used the cast and round methods.

Finally the results are ordered by the average price in descending order.

By the way, the dataset used in this example can be downloaded from Kaggle.


Conclusion

We have done 2 examples that demonstrate somewhat tricky SQL queries. These kind of operations prove that SQL is able to perform data analysis and manipulation operations just like Pandas for Python and data table for R.

SQL is an essential tool in the Data Science ecosystem. In fact, it is more preferred to perform data analysis and manipulation tasks with SQL rather than retrieving all the data from a database and using an external tool.

Thank you for reading. Please let me know if you have any feedback.


Related Articles