Data preparation process is the pillar of every analysis since almost no data comes ready. With the magic of python and pandas, this process can be simplified and rescued from repeating code.
In this post, I will walk you through some of the least preferred but very effective applymap, Apply, groupby functions and the potential use cases where these functions can save you time while coding and increase the readability of your code.
For illustrative and applicability purposes, I will analyse the profitability and CAPEX need for an imaginery store chain that is located across many cities. However, the code could be practiced on variety of fields and use cases.
Contents
Creating a Data Frame for Analysis
- Applymap: Looping Through Entire DataFrame
- Nested Apply: The Inception
- Groupby Multiple DataFrames for Arithmetic Operations
- Groupby with Apply: Customised Functions
- Groupby to slice a DataFrame
Importing Libraries
We import the following libraries to create data frames with random string and numerical type data fields.
import pandas as pd
import numpy as np
import string
pd.set_option('display.max_rows', 400)
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 150)
# ['A', 'B', ..., 'AA', 'AB', ..., 'ZY', 'ZZ']
ALPHABETICAL_LIST = (list(string.ascii_uppercase) +
[letter1+letter2 for letter1 in string.ascii_uppercase
for letter2 in string.ascii_uppercase])
Creating a Data Frame for Analysis
Let’s create a list of 40 stores, that were opened between 1980 and 2010 across 10 different cities with a 20-year trajectory of revenues and costs.
PERIOD_COLUMNS = [i for i in range(2021,2041)]
CITIES = np.array(sorted([ALPHABETICAL_LIST[x] for x in np.random.choice(range(10), 40)]))
STORES = np.array(sorted(['Street_' +
ALPHABETICAL_LIST[x] for x in np.random.choice(range(100), 40, replace=False)]))
STORE_OPENING_YEARS = np.array(([x for x in np.random.choice(range(1980, 2010), 40)]))
STORE_REVENUS = np.random.uniform(low=500, high=1000, size=(40, 20))
STORE_COSTS = np.random.uniform(low=300, high=600, size=(40, 20))
df_Stores = pd.DataFrame(
np.column_stack((
CITIES,
STORES,
STORE_OPENING_YEARS,
)),
columns = (['City', 'Store', 'Opened'])
).astype({'Opened': np.int64})
df_Stores

df_Store_Revenues = df_Stores.drop(['Opened'], axis=1).reset_index(drop=True)
df_Store_Revenues[PERIOD_COLUMNS] = pd.DataFrame(STORE_REVENUS, columns=PERIOD_COLUMNS)
df_Store_Costs = df_Stores.drop(['Opened'], axis=1).reset_index(drop=True)
df_Store_Costs[PERIOD_COLUMNS] = pd.DataFrame(STORE_COSTS, columns=PERIOD_COLUMNS)
1- Applymap: Looping Through Entire DataFrame
When used, Applymap loops the lambda function inside through every cell on that dataframe.


This "One-Liner" is very handy despite dealing with multiple data type fields.
For instance, we have a Store Revenue Trajectories like the picture below.
df_Store_Revenues.head(10)

If we don’t want to see lots of decimal points, we can simple do an applymap loop for all the cells that are not object types.
df_Store_Costs = df_Store_Costs.applymap(lambda x: np.round(x,2) if type(x)!=str else x)
df_Store_Revenues = df_Store_Revenues.applymap(lambda x: np.round(x,2) if type(x)!=str else x)
df_Store_Revenues.head(10)

Use Cases: This function is helpful when we need to apply one liner rule on every cell (i.e. scaling with log / multiplier, rounding up/down, replacing with a value etc.)
2- Nested Apply: The Inception
I really like this one as it reminds me of one of my favourite movies: "The Inception". By double applying the lambda function on a data frame, we can get the index / column name information of that cell.

Let’s consider a scenario where our stores need to go through a renewal phase of major equipment at the 25th year after operation and enhancement of the real estate (painting / flooring) for the subsequent 3years.
RENEWAL_YEAR_AFTER_OPENING = 25 #years
ENHANCEMENT_DURATION_AFTER_RENEWAL = 3 #years
df_Store_CAPEX = df_Stores.copy().reset_index(drop=True)
df_Store_CAPEX[PERIOD_COLUMNS] = pd.DataFrame(np.array([[np.nan ]* 20] * 40), columns=PERIOD_COLUMNS)
df_Store_CAPEX.head(5)

We can calculate whether that particular year is a typical maintenance, renewal or an enhancement year as follows:
df_Store_CAPEX[PERIOD_COLUMNS] = df_Store_CAPEX[PERIOD_COLUMNS].apply(lambda x:
pd.DataFrame(x).apply(lambda y:
'Maintenance' if x.name < (RENEWAL_YEAR_AFTER_OPENING +
df_Store_CAPEX.loc[y.name, 'Opened'])
else
('Renewal' if (RENEWAL_YEAR_AFTER_OPENING +
df_Store_CAPEX.loc[y.name, 'Opened']) == x.name
else
('Enhancement' if x.name < (RENEWAL_YEAR_AFTER_OPENING +
df_Store_CAPEX.loc[y.name, 'Opened'] + 1 + ENHANCEMENT_DURATION_AFTER_RENEWAL
)
else 'Maintenance')
)
, axis=1))
For instance, Street_AR (established in 1997), needs maintenance until 2022 and after 2025, a renewal in 2022 (1997 + 25), and enhancement from 2023 to 2025.

Use Cases: This method is very similar to the use of Excel Tables where header cells are used in the formula within the table values.
In addition to this, when having almost identical data frames in terms of index order and column name presence, this nested apply method could be preferred to make complex calculations across multiple data frames.
3- Groupby Multiple DataFrames for Arithmetic Operations
The previous method was effective when we knew that the index order (order of certain data fields) was the same across data frames. However, what if it wasn’t us who created these tables in the first place, and we don’t know whether there are any additions or removals to/from these almost alike tables?
If so and if we need to do the 4 basic arithmetic operations (Addition, Subtraction, Multiplication, Division), then there is a safer, better way.

The key point is setting the object type fields as indexes so that the numerical columns could be summed / multiplied with each other.
INDEX_KEYS = ['City', 'Store']
# we can put as many dfs as we like inside this list
# buy multiplying with -1 and summing at the end,
# we're basically substraction negative table from the positive table.
LIST_OF_DATAFRAMES_TO_SUM = [
df_Store_Revenues.set_index(INDEX_KEYS),
df_Store_Costs.set_index(INDEX_KEYS) * -1,
]
df_Store_Profit = pd.concat(LIST_OF_DATAFRAMES_TO_SUM,
sort=False,
keys=range(len(LIST_OF_DATAFRAMES_TO_SUM))
).groupby(
level=[i+1 for i in range(len(INDEX_KEYS))]
).apply(lambda x : (x.sum())
).reset_index()
df_Store_Profit
Street_A’s revenue and cost view is as follows:


Street_A’s gross profit view is as follows:

Multiplication (or Division) is also very similar to the previous method (addition / substraction) with only a few differences.
INDEX_KEYS = ['City', 'Store']
# we can put as many dfs as we like inside this list
# buy taking the power to the -1 and multiplying at the end,
# we're basically dividing first table by the second table.
LIST_OF_DATAFRAMES_TO_MULTIPLY = [
df_Store_Profit.set_index(INDEX_KEYS),
df_Store_Revenues.set_index(INDEX_KEYS) ** -1,
]
df_Store_ProfitMargin = pd.concat(LIST_OF_DATAFRAMES_TO_MULTIPLY,
sort=False,
keys=range(len(LIST_OF_DATAFRAMES_TO_MULTIPLY))
).groupby(
level=[i+1 for i in range(len(INDEX_KEYS))]
).apply(lambda x : (x.cumprod(skipna=False).iloc[-1])
).reset_index()
df_Store_ProfitMargin
Street_A’s gross profit margin view is as follows:

Use Cases: These arithmetic operations are very preferrable when there are many dataframes to (sum / substract) or (multiply /divide) of multi level data.
Please note that sum / multiply cannot be blended in the same dataframe list and must be run separately.
4- Groupby with Apply: Customised Functions
We might need tailor-made functions that may only suit our needs.

For instance, what if we wanted to find the stores that have a negative gross profit margin for every year?

The key point is to create a function with an empty dictionary inside and to return it as Pandas series indexed with dictionary keys.
def find_stores_that_loses_money(x):
d={}
for COLUMN in PERIOD_COLUMNS:
MONEY_LOSING_STORES = x[x[COLUMN] < 0]
if len(MONEY_LOSING_STORES) > 0:
d['Loss_{}'.format(COLUMN)] = ' , '.join(list(
MONEY_LOSING_STORES['Store'].values
)
)
else:
d['Loss_{}'.format(COLUMN)] = np.nan
return pd.Series(d, index=d.keys())
df_Stores_Losing_Money = df_Store_ProfitMargin.groupby(['City']).apply(find_stores_that_loses_money).reset_index()
df_Stores_Losing_Money
For every city per year, we found the stores that have a negative gross profit margin.

Use Cases: The sky is the limit. Just create a function that meets your business needs the most and combine with groupby magic.
5- Groupby to slice a DataFrame
We sometimes need to slice a dataframe in a multi level way to proceed with a subset.

For every city, we filter the row of stores that have the highest profitability in 2021.
idx_max = df_Store_ProfitMargin.groupby(['City'])[2021].transform(max)==df_Store_ProfitMargin[2021]
df_Most_Profitable_Stores_2021 = df_Store_ProfitMargin[ (idx_max) ]
df_Most_Profitable_Stores_2021
Please see the filter effect in index numbers.

Use Cases: This method could be very helpful when we need to keep the most recent row per category of a data base history. (i.e. the max of Updated_At column per user and product)
Conclusions
In this post, I wanted to walk you through some of the least preferred but effective looping and grouping functions that could increase the readibility of your code as well as let you win time and certainty. However, please note that loops are not preferred when dealing with dataframes of millions of rows and they require approaching cautiously.