SQL is used by most relational database management systems (RDBMS) to manage data stored in tabular form (i.e. tables). It is a fundamental skill expected from data scientists and analysts.
You might argue that it is the job of a data engineer but the data scientist roles are inclined to being full-stack. Besides, as a data scientist, you wouldn’t want to depend on a data engineer to retrieve data from a database.
Long story short, having SQL in your skill set will put you one step ahead in searching and securing a job in the field of Data Science.
Databases usually contain very large amount of data. We do not need all the data for every task. Thus, we can either retrieve all the data and then filter what we need or only retrieve only what we need. The latter is obviously the better approach.
We use the select statement to apply some conditions so that we only retrieve what is needed. There are many operators used with the select statement to enhance the filtering capabilities such as where, group by, having and so on.
In this article, we will focus on a specific operator used along with the select statements. The case operator is a way to put conditions in the queries. It can be considered as a flow control function just like if/else in Python.
I have previously created a sample sales database that consists of 4 relational tables. We will do the examples using these tables.
mysql> show tables;
+-----------------+
| Tables_in_sales |
+-----------------+
| customer |
| item |
| purchase |
| store |
+-----------------+
Let us first take a look at the purchase table.
mysql> select purchase_id, item_qty, item_id from purchase;
+-------------+----------+---------+
| purchase_id | item_qty | item_id |
+-------------+----------+---------+
| 1 | 2 | 3 |
| 2 | 1 | 4 |
| 3 | 3 | 6 |
| 4 | 1 | 8 |
| 5 | 4 | 5 |
| 6 | 2 | 4 |
| 7 | 2 | 6 |
| 8 | 2 | 10 |
| 9 | 1 | 6 |
| 10 | 1 | 7 |
+-------------+----------+---------+
I have selected only three columns of the purchase table. The columns show the item id and item quantity for each purchase.
A typical task for the case operator can be classifying the purchases into two groups as purchases with single item and multiple items.
mysql> select purchase_id,
-> (case when item_qty > 1 then "multiple items"
-> else "single item" end ) as purchase_group
-> from purchase;
+-------------+----------------+
| purchase_id | purchase_group |
+-------------+----------------+
| 1 | multiple items |
| 2 | single item |
| 3 | multiple items |
| 4 | single item |
| 5 | multiple items |
| 6 | multiple items |
| 7 | multiple items |
| 8 | multiple items |
| 9 | single item |
| 10 | single item |
+-------------+----------------+
The query result contains two columns. The first one is the purchase_id column. The second one is derived based on the item_qty column using the case operator. If the value in the item_qty column is greater than 1, the row in the derived column becomes "multiple items". Otherwise, it is "single item".
Since we have created a new column using the case operator, we can use it in different operators or clauses. For instance, we may want to see the number of purchases in each group.
mysql> select count(purchase_id),
-> (case when item_qty > 1 then "multiple items"
-> else "single item" end) as purchase_group
-> from purchase
-> group by purchase_group;
+--------------------+----------------+
| count(purchase_id) | purchase_group |
+--------------------+----------------+
| 6 | multiple items |
| 4 | single item |
+--------------------+----------------+
We counted the number of purchases by using the derived column in the group by statement.
Multiple conditions can be applied in the case operator. We will do an example on the item table in the sales Database.
Here is the item table:
mysql> select * from item;
+---------+----------------+-------+----------+
| item_id | description | price | store_id |
+---------+----------------+-------+----------+
| 1 | apple | 2.45 | 1 |
| 2 | banana | 3.45 | 1 |
| 3 | cereal | 4.20 | 2 |
| 4 | milk 1 liter | 3.80 | 2 |
| 5 | cheddar cheese | 4.50 | 2 |
| 6 | icecream | 6.10 | 2 |
| 7 | water 2 liters | 1.10 | 3 |
| 8 | tomato | 0.95 | 1 |
| 9 | egg 15 | 4.40 | 3 |
| 10 | sprite 1 liter | 1.60 | 3 |
+---------+----------------+-------+----------+
Consider a case where we need to identify the items that cost higher than a certain amount and are sold at store 2.
mysql> select description,
-> (case when price > 4 and store_id = 2
-> then "high at store 2"
-> else null end) as new_col
-> from item;
+----------------+-----------------+
| description | new_col |
+----------------+-----------------+
| apple | NULL |
| banana | NULL |
| cereal | high at store 2 |
| milk 1 liter | NULL |
| cheddar cheese | high at store 2 |
| icecream | high at store 2 |
| water 2 liters | NULL |
| tomato | NULL |
| egg 15 | NULL |
| sprite 1 liter | NULL |
+----------------+-----------------+
Since our focus is the expensive items at store 2, we do not case the other ones so those can be marked as null.
If we don’t want to display the null items, we can filter them out by adding the having statement as follows:
mysql> select description,
-> (case when price > 4 and store_id = 2
-> then "high at store 2"
-> else null end) as new_col
-> from item
-> having new_col is not null;
+----------------+-----------------+
| description | new_col |
+----------------+-----------------+
| cereal | high at store 2 |
| cheddar cheese | high at store 2 |
| icecream | high at store 2 |
+----------------+-----------------+
Bonus
The select statement of SQL is quite flexible and powerful. It can be combined with many additional statements, operators, and aggregations to create highly complex queries.
I just want to give an example of using different functions with the select statement. Consider the customer table we have previously worked on. We want to sort the customers based on the length of their full names (first name and last name).
Here is a way to accomplish this task:
mysql> select cust_id,
-> char_length(concat(f_name, " ", l_name)) as name_len
-> from customer
-> order by name_len desc;
+---------+----------+
| cust_id | name_len |
+---------+----------+
| 3 | 12 |
| 4 | 11 |
| 5 | 10 |
| 7 | 9 |
| 1 | 8 |
| 2 | 8 |
| 6 | 8 |
+---------+----------+
We first combine the first and last names by using the "concat" function. Then the number of characters is counted with the "char_length" function. The "order by" statement is used to sort the query results based on the derived name_len column.
Conclusion
We have seen how the case operator can be implemented in the select statements. The case operator is one of the tools that are used to enhance the capabilities of the select statement.
By creating complex queries, we can only retrieve the data we need instead of getting all the data and then applying filtering and calculations.
Since real-life databases contain much more data and many relational tables, it is very important to be able to query the desired data using SQL.
Thank you for reading. Please let me know if you have any feedback.