
Creating a new column or modifying an existing column in a Pandas data frame – based on a set of if-else
conditions – is probably one of the most frequently encountered problems among all different types of data wrangling tasks. In this post, I’d like to share with you my notepad which summarizes the 5 popular ways of applying if-else
conditional statements in Pandas dataframes with handy and practical code snippets. For simplicity, I created a small sample dataset and will use it for demonstration purposes throughout the tutorial.
Let’s say we have a pandas dataframe as shown below. The column ‘visits_30days’ shows how many times a customer visited a website in the last 30 days. We want to create a new column that either categorizes these customers into ‘non-visitor’ or ‘visitor’ (a binary categorization) or categorizes them into multiple buckets such as ‘0 visits’, ‘1–5 visits’, ‘6-10 visits’, etc. We’ll name this new column ‘visits_category’.

Method 1: Use the numpy.where() function
The numpy.where()
function is an elegant and efficient Python function that you can use to add a new column based on ‘true’ or ‘false’ binary conditions. The syntax looks like this:
np.where(condition, value if condition is true, value if condition is false)
Applying the syntax to our dataframe, our code would look like this. The new column ‘visits_category’ has the value of either ‘Yes’ or ‘No’ depending on the condition of whether the value of the ‘visits_30days’ column is equal to 0 or not.
import pandas as pd
import numpy as np
df['visits_category'] = np.where(df['visits_30days']== 0, 'YES', 'NO')

Method 2: Use the lambda function
Like np.where()
, the lambda
function is another superb choice when you need to add a column based on a simple binary if-else
condition. The generic structure of the code using lambda
function is:
df['new column name'] = df['column name'].apply(lambda x: 'value if condition is true' if x condition else 'value if condition is false')
For our specific example, the code would look like this:
df['visits_category'] = df['visits_30days'].apply(lambda x: 'YES' if x == 0 else 'NO')
Method 3: Use the numpy.select() function
Now let’s move on to a more complicated scenario. Let’s say we want to create multiple categories for ‘visits_category’ involving multiple conditions. You won’t be able to achieve it by using np.where()
. Instead, you can use np.select()
to apply more complicated conditions.
First, create a python list
that specifies your conditions in the order of which one needs to be executed first, second, etc.
#Create a python list to specify all the conditions
conditions = [
(df['visits_30days'] == 0),
(df['visits_30days'] > 0) & (df['visits_30days'] <= 5),
(df['visits_30days'] > 5) & (df['visits_30days'] <= 10),
(df['visits_30days'] > 10)
]
Then, create a python list
of values that we want to assign to each condition. Make sure each value corresponds to the condition created in the first step.
#create a python list of values to be assigned to the conditions
values = ['0 visits', '1-5 visits', '6-10 visits', '>10 visits']
Finally, create the new column using np.select()
by using the two lists
you created before as the arguments.
# use np.select() to create a new column
df['visits_category'] = np.select(conditions, values)
Putting the three pieces of code together you get the following:
df['visits_category'] = np.select([
(df['visits_30days'] == 0),
(df['visits_30days'] > 0) & (df['visits_30days'] <= 5),
(df['visits_30days'] > 5) & (df['visits_30days'] <= 10),
(df['visits_30days'] > 10)
], ['0 visits', '1-5 visits', '6-10 visits', '>10 visits'])

Method 4: Use the Pandas apply()
function
With this method, we can first define a function that specifies the conditions. We then apply that function along columns (axis=1
).
def conditions(row):
if row['visits_30days'] == 0:
val = '0 visits'
elif row['visits_30days'] >0 & row['visits_30days']<=5:
val = '1-5 visits'
elif row['visits_30days'] >5 & row['visits_30days']<=10:
val = '5-10 visits'
elif row['visits_30days'] >10:
val = '>10 visits'
else:
val = 'NA'
return val
#Apply the function to each data point in the data frame
df['visits_category']= df.apply(conditions, axis=1)
Method 5: Use DataFrame.loc()
Pandas DataFrame.loc()
selects rows and columns by label(s) in a given DataFrame
. For example, in the code below, the first line of code selects the rows in the dataframe where the value of ‘visits_30days’ is equal to zero and assigns ‘0 visits’ to the new column ‘visits_category’ for only those rows that meet this specific condition. You then do the same thing for other conditions as well.
df.loc[(df['visits_30days'] == 0), 'visits_category'] = '0 visits'
df.loc[(df['visits_30days'] > 0) & (df['visits_30days'] <= 5), 'visits_category'] = '1-5 visits'
df.loc[(df['visits_30days'] > 5) & (df['visits_30days'] <= 10), 'visits_category'] = '5-10 visits'
df.loc[(df['visits_30days'] > 10) , 'visits_category'] = '>10 visits'
Learning and summarizing the most common pandas’ data wrangling techniques has always been a fun and helpful exercise in my daily data scientist life. Data wrangling is an important element in a data scientist’s day-to-day work and accounts for nearly 80% of the time spent on a data analytics project. Having a notepad nearby with all your favorite code snippets sorted and summarized is a great and efficient way to improve your productivity. I hope you enjoyed this tutorial and thanks for reading.
Data Source: The sample dataset used in this tutorial was created by the author for demonstration purposes.
You can unlock full access to my writing and the rest of Medium by signing up for Medium membership ($5 per month) through this referral link. By signing up through this link, I will receive a portion of your membership fee at no additional cost to you. Thank you!