The world’s leading publication for data science, AI, and ML professionals.

Towards Being Tool-Agnostic in Data Science: SQL Case When and Pandas Where

Explained with examples

Your customer data is in a SQL database. You’re assigned a task that involves retrieving data from some tables, doing some data cleaning and manipulation, and writing the results to a different table.

Unfortunately, you don’t know how to do those operations with SQL. No worries! You’re great at using Pandas for data cleaning and manipulation. So, you come up with a solution, which is:

  • Retrieve all the data from SQL tables
  • Download the data as CSV files
  • Read the CSV files into Pandas DataFrames
  • Perform the required data cleaning and manipulation operations
  • Write the results to a different CSV file
  • Upload the data in the CSV file to a SQL table

Nice plan right?

If you actually execute this plan, I’m sure your manager will have a talk with you, which can be pleasant or unpleasant depending on your manager’s personality. In any case, I don’t think you’ll execute this awesome plan anymore after the talk.

I know there are usually many different ways of doing a task in Data Science. You should always aim for the most efficient one because you’ll typically work with very large datasets. Making things more complicated than necessary costs you extra money and time.

"I’m great at Pandas so I’ll do everything with Pandas" is not a desired attitude. If your task involves reading data from SQL tables and writing results to SQL tables, the best way is usually doing the steps in between using SQL.

SQL is not just a query language. It can be used as a highly efficient data analysis and manipulation tool as well.

I remember writing SQL jobs to do very complex data preprocessing operations and they worked just fine.

Data science is still an evolving field. New tools and concepts are introduced in no time. You should not be dependent on a single tool and should always be open to learning new ones.


Pandas vs SQL

Pandas and SQL have a lot in common in terms of what they’re capable of. Both are extremely good at data analysis and manipulation.

In this article, we’ll go through examples to learn how SQL CASE WHEN statement and Pandas where function can be used for doing the same operation. As a bonus, we’ll also learn about NumPy where function, which is more flexible than Pandas where .

I have the following data stored in a SQL table and a Pandas DataFrame, both are named product_inventory .

You can download this dataset as a CSV file from my datasets repo. It’s called product_inventory.csv .


The first task

We’ll first create a new column called has_enough_stock , which takes the value 1 if the stock quantity is more than 500 and 0 otherwise.

The SQL version:

SELECT
 product_id,
 product_description,
 price,
 stock_qty,
 next_shipment,
 CASE
  WHEN stock_qty > 500 THEN 1
  ELSE 0
 END AS has_enough_stock
FROM product_inventory

The following part creates the has_enough_stock column according to the specified condition:

CASE
  WHEN stock_qty > 500 THEN 1
  ELSE 0
END AS has_enough_stock

The Pandas version includes the where function, which takes a condition and replaces values where the condition is false.

In order to use it for our task, we first need to create the has_enough_stock column with all 0s. Then, we’ll change the 0s to 1s where the stock quantity is more than 500.

Since the where function allows for changing values where the condition is false, we’ll define the condition as stock_qty <= 500 (i.e. stock quantity is less than or equal to 500). The rows for which this condition evaluates to false are the ones with a stock quantity of more than 500.

# create the has_enough_stock column with all 0s
product_inventory.loc[:, "has_enough_stock"] = 0

# change values to 1 where stock quantity is more than 500
product_inventory.loc[:, "has_enough_stock"] = product_inventory.where(product_inventory["stock_qty"] &lt;= 500, 1)

# display the first 5 rows
product_inventory.head()


NumPy where function

When using the Pandas where function, we need to first create the has_enough_stock column with all 0s. This is because it only allows for updating values that do not fit the given condition.

NumPy where function lets you update the values that fit and don’t fit the condition so there is no need to initialize the column first.

Here is how we can create the has_enough_stock column with NumPy where function:

import numpy as np

# create the column
product_inventory.loc[:, "has_enough_stock"] = np.where(product_inventory["stock_qty"] &gt; 500, 1, 0)

The first parameter is the condition, the second one is the value to be used for rows that fit the condition, and the third one is the value for rows that don’t fit the condition.


What if we need more than 2 categories?

The example we did involves creating a column with only 2 values (i.e. categories). Let’s also do an example that requires creating more than 2 categories.


You can become a Medium member to unlock full access to my writing, plus the rest of Medium. If you already are, don’t forget to subscribe if you’d like to get an email whenever I publish a new article.


We’ll create a new column named stock_situation based on the following criteria:

  • Stock quantity > 500, "enough"
  • Stock quantity ≤ 500 and stock quantity > 200, "critical"
  • Stock quantity ≤ 200, "order asap"

The SQL version is quite similar. We’ll just need to add a new line with WHEN condition THEN value .

SELECT
 product_id,
 product_description,
 price,
 stock_qty,
 next_shipment,
 CASE
  WHEN stock_qty &gt; 500 THEN 'enough'
  WHEN stock_qty &lt;= 500 AND stock_qty &gt; 200 THEN 'critical'
  ELSE 'order asap'
 END AS stock_situation
FROM product_inventory

When it comes to creating this column in a Pandas DataFrame, we need a different function, which is the select function of NumPy.

It can take a set of conditions and allow for assigning a value for each condition separately.

# define conditions
conditions = [
    product_inventory["stock_qty"] &gt; 500,
    (product_inventory["stock_qty"] &lt;= 500) &amp; (product_inventory["stock_qty"] &gt; 200),
    product_inventory["stock_qty"] &lt;= 200
]

# define values associated with the conditions
values = ["enough", "critical", "order asap"]

# create the column
product_inventory.loc[:, "stock_situation"] = np.select(conditions, values)

# display the first 5 rows
product_inventory.head()

The conditions and values are created using Python lists. Then, we just pass them to the select function as arguments.


Final words

The examples we did are not very complex but they support my point of not sticking to a particular tool especially when working with large datasets.

Spending a few hours to learn a new tool can save you hours in the long run. It’ll also save you from doing unnecessary computations and thus spending extra money.

Thank you for reading. Please let me know if you have any feedback.


Related Articles