How to get more insights using SQL IN operator?

Use IN operator to merge data across tables

Bao Nguyen
Towards Data Science

--

Photo by Kristina Wagner on Unsplash

We all know that SQL IN is a helpful operator, which allows you to choose multiple conditions within the WHERE clause. However, the IN operator is even more helpful than you think. You can merge data across tables with IN.

An example using a Google BigQuery public data set

To illustrate, I will go through a quick tutorial. I use a Google BigQuery’s public data set called austin_bikeshare. Therefore, you can easily follow my codes to practice.

There are two tables in this data set. bikeshare_stations table includes information about stations’ names, statuses, and locations. bikeshare_trips table includes information about trips’ start time, end time, start station, end station, duration, and type of the subscriber.

As a data analyst, I might want to know the performance of each start station through the number of trips taken. However, I am only interested in stations that are still active. Therefore, I can frame my question as follows:

How many trips were taken per each active start station?

You can easily find the start stations and the associated trips from the bikeshare_trips table. However, the status of each station lies in the bikeshare_stations table. A solution might be using JOIN:

There is another solution using IN, in which you don’t have to use JOIN. First, you select the name of all start stations and the number of trips by counting trip IDs from the bikeshare_trips table. Then, in the WHERE clause, you write a subquery selecting all stations that are active from the bikeshare_stations table and use the IN operator. As a result, the selected stations are limited within a subset of active ones instead of all stations. The codes are as follows:

Another example from a Leetcode question

I use question number 607 (Sales Person) to further illustrate. The question is to select all the names in the table salesperson, who didn’t have sales to the company 'RED'.

Table: salesperson

+----------+------+--------+-----------------+-----------+
| sales_id | name | salary | commission_rate | hire_date |
+----------+------+--------+-----------------+-----------+
| 1 | John | 100000 | 6 | 4/1/2006 |
| 2 | Amy | 120000 | 5 | 5/1/2010 |
| 3 | Mark | 65000 | 12 | 12/25/2008|
| 4 | Pam | 25000 | 25 | 1/1/2005 |
| 5 | Alex | 50000 | 10 | 2/3/2007 |
+----------+------+--------+-----------------+-----------+

Table: company

+---------+--------+------------+
| com_id | name | city |
+---------+--------+------------+
| 1 | RED | Boston |
| 2 | ORANGE | New York |
| 3 | YELLOW | Boston |
| 4 | GREEN | Austin |
+---------+--------+------------+

Table: orders

+----------+------------+---------+----------+--------+
| order_id | order_date | com_id | sales_id | amount |
+----------+------------+---------+----------+--------+
| 1 | 1/1/2014 | 3 | 4 | 100000 |
| 2 | 2/1/2014 | 4 | 5 | 5000 |
| 3 | 3/1/2014 | 1 | 1 | 50000 |
| 4 | 4/1/2014 | 1 | 4 | 25000 |
+----------+----------+---------+----------+--------+

Since the question only asks for names of the salespeople, so it makes sense to select name from the salesperson table. However, how about the ones that didn’t have sales to the company ‘RED’?

The IN operator comes into play in this scenario. Instead of using only IN, we use NOT IN. First, we choose a subset of employees who had sales with the company ‘RED’ by joining orders table with company table, selecting sales IDs, and using the condition that company name is ‘RED’. Then, we simply select the sales IDs that don’t belong to that subset. Here are the codes to follow:

By following these two examples, I hope you have a better understanding of how to use IN operator to merge data across tables without using JOIN.

--

--

Data Analyst based in Boston. I enjoy talking about data analytics, tech, and job search tips.