The world’s leading publication for data science, AI, and ML professionals.

Fantastic Functions and Where to Use Them

Looped and Grouped

Photo by Rhii Photography on Unsplash
Photo by Rhii Photography on Unsplash

Data preparation process is the pillar of every analysis since almost no data comes ready. With the magic of python and pandas, this process can be simplified and rescued from repeating code.

In this post, I will walk you through some of the least preferred but very effective applymap, Apply, groupby functions and the potential use cases where these functions can save you time while coding and increase the readability of your code.

For illustrative and applicability purposes, I will analyse the profitability and CAPEX need for an imaginery store chain that is located across many cities. However, the code could be practiced on variety of fields and use cases.

Contents

Importing Libraries

Creating a Data Frame for Analysis

  1. Applymap: Looping Through Entire DataFrame
  2. Nested Apply: The Inception
  3. Groupby Multiple DataFrames for Arithmetic Operations
  4. Groupby with Apply: Customised Functions
  5. Groupby to slice a DataFrame

Conclusions

Importing Libraries

We import the following libraries to create data frames with random string and numerical type data fields.

import pandas as pd
import numpy as np
import string
pd.set_option('display.max_rows', 400)
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 150)
# ['A', 'B', ..., 'AA', 'AB', ..., 'ZY', 'ZZ']
ALPHABETICAL_LIST = (list(string.ascii_uppercase) + 
                     [letter1+letter2 for letter1 in string.ascii_uppercase 
                      for letter2 in string.ascii_uppercase])

Creating a Data Frame for Analysis

Let’s create a list of 40 stores, that were opened between 1980 and 2010 across 10 different cities with a 20-year trajectory of revenues and costs.

PERIOD_COLUMNS = [i for i in range(2021,2041)]
CITIES = np.array(sorted([ALPHABETICAL_LIST[x] for x in np.random.choice(range(10), 40)]))
STORES = np.array(sorted(['Street_' + 
                          ALPHABETICAL_LIST[x] for x in np.random.choice(range(100), 40, replace=False)]))
STORE_OPENING_YEARS = np.array(([x for x in np.random.choice(range(1980, 2010), 40)]))
STORE_REVENUS = np.random.uniform(low=500, high=1000, size=(40, 20))
STORE_COSTS = np.random.uniform(low=300, high=600, size=(40, 20))
df_Stores = pd.DataFrame(
np.column_stack((
CITIES,

STORES,
STORE_OPENING_YEARS,
)),
columns = (['City', 'Store', 'Opened'])
).astype({'Opened': np.int64})
df_Stores
Dummy Data Frame for Stores, Photo by the author
Dummy Data Frame for Stores, Photo by the author
df_Store_Revenues = df_Stores.drop(['Opened'], axis=1).reset_index(drop=True)
df_Store_Revenues[PERIOD_COLUMNS] = pd.DataFrame(STORE_REVENUS, columns=PERIOD_COLUMNS)
df_Store_Costs = df_Stores.drop(['Opened'], axis=1).reset_index(drop=True)
df_Store_Costs[PERIOD_COLUMNS] = pd.DataFrame(STORE_COSTS, columns=PERIOD_COLUMNS)

1- Applymap: Looping Through Entire DataFrame

When used, Applymap loops the lambda function inside through every cell on that dataframe.

Photo by Önder Örtel and Tine Ivanič on Unsplash
Photo by Önder Örtel and Tine Ivanič on Unsplash

This "One-Liner" is very handy despite dealing with multiple data type fields.

For instance, we have a Store Revenue Trajectories like the picture below.

df_Store_Revenues.head(10)
Store Revenues, Photo by the author
Store Revenues, Photo by the author

If we don’t want to see lots of decimal points, we can simple do an applymap loop for all the cells that are not object types.

df_Store_Costs = df_Store_Costs.applymap(lambda x: np.round(x,2) if type(x)!=str else x)
df_Store_Revenues = df_Store_Revenues.applymap(lambda x: np.round(x,2) if type(x)!=str else x)
df_Store_Revenues.head(10)
Store Revenues, Rounded. Photo by the author
Store Revenues, Rounded. Photo by the author

Use Cases: This function is helpful when we need to apply one liner rule on every cell (i.e. scaling with log / multiplier, rounding up/down, replacing with a value etc.)

2- Nested Apply: The Inception

I really like this one as it reminds me of one of my favourite movies: "The Inception". By double applying the lambda function on a data frame, we can get the index / column name information of that cell.

Photo by Christophe Hautier on Unsplash
Photo by Christophe Hautier on Unsplash

Let’s consider a scenario where our stores need to go through a renewal phase of major equipment at the 25th year after operation and enhancement of the real estate (painting / flooring) for the subsequent 3years.

RENEWAL_YEAR_AFTER_OPENING = 25 #years
ENHANCEMENT_DURATION_AFTER_RENEWAL = 3 #years
df_Store_CAPEX = df_Stores.copy().reset_index(drop=True)
df_Store_CAPEX[PERIOD_COLUMNS] = pd.DataFrame(np.array([[np.nan ]* 20] * 40), columns=PERIOD_COLUMNS)
df_Store_CAPEX.head(5)
Photo by the author
Photo by the author

We can calculate whether that particular year is a typical maintenance, renewal or an enhancement year as follows:

df_Store_CAPEX[PERIOD_COLUMNS] = df_Store_CAPEX[PERIOD_COLUMNS].apply(lambda x: 
pd.DataFrame(x).apply(lambda y:

'Maintenance' if x.name < (RENEWAL_YEAR_AFTER_OPENING +
df_Store_CAPEX.loc[y.name, 'Opened'])
else
('Renewal' if (RENEWAL_YEAR_AFTER_OPENING +
df_Store_CAPEX.loc[y.name, 'Opened']) == x.name 

else
('Enhancement' if x.name < (RENEWAL_YEAR_AFTER_OPENING + 
                   df_Store_CAPEX.loc[y.name, 'Opened'] + 1 + ENHANCEMENT_DURATION_AFTER_RENEWAL
                   )  

else 'Maintenance')
)                                                                  

, axis=1))

For instance, Street_AR (established in 1997), needs maintenance until 2022 and after 2025, a renewal in 2022 (1997 + 25), and enhancement from 2023 to 2025.

Annual CAPEX Type, Photo by the author
Annual CAPEX Type, Photo by the author

Use Cases: This method is very similar to the use of Excel Tables where header cells are used in the formula within the table values.

In addition to this, when having almost identical data frames in terms of index order and column name presence, this nested apply method could be preferred to make complex calculations across multiple data frames.

3- Groupby Multiple DataFrames for Arithmetic Operations

The previous method was effective when we knew that the index order (order of certain data fields) was the same across data frames. However, what if it wasn’t us who created these tables in the first place, and we don’t know whether there are any additions or removals to/from these almost alike tables?

If so and if we need to do the 4 basic arithmetic operations (Addition, Subtraction, Multiplication, Division), then there is a safer, better way.

Photo by Recha Oktaviani on Unsplash
Photo by Recha Oktaviani on Unsplash

The key point is setting the object type fields as indexes so that the numerical columns could be summed / multiplied with each other.

INDEX_KEYS = ['City', 'Store']
# we can put as many dfs as we like inside this list
# buy multiplying with -1 and summing at the end,
# we're basically substraction negative table from the positive table.
LIST_OF_DATAFRAMES_TO_SUM = [
df_Store_Revenues.set_index(INDEX_KEYS),
df_Store_Costs.set_index(INDEX_KEYS) * -1,

                            ]
df_Store_Profit = pd.concat(LIST_OF_DATAFRAMES_TO_SUM, 

          sort=False, 

          keys=range(len(LIST_OF_DATAFRAMES_TO_SUM))

         ).groupby(

    level=[i+1 for i in range(len(INDEX_KEYS))]

                    ).apply(lambda x : (x.sum())
                                 ).reset_index()
df_Store_Profit

Street_A’s revenue and cost view is as follows:

Street_A's Revenue, Photo by the author
Street_A’s Revenue, Photo by the author
Street_A's Cost, Photo by the author
Street_A’s Cost, Photo by the author

Street_A’s gross profit view is as follows:

Stores' Gross Profit, Photo by the author
Stores’ Gross Profit, Photo by the author

Multiplication (or Division) is also very similar to the previous method (addition / substraction) with only a few differences.

INDEX_KEYS = ['City', 'Store']
# we can put as many dfs as we like inside this list
# buy taking the power to the -1 and multiplying at the end,
# we're basically dividing first table by the second table.
LIST_OF_DATAFRAMES_TO_MULTIPLY = [
df_Store_Profit.set_index(INDEX_KEYS),
df_Store_Revenues.set_index(INDEX_KEYS) ** -1,

                            ]
df_Store_ProfitMargin = pd.concat(LIST_OF_DATAFRAMES_TO_MULTIPLY, 

          sort=False, 

          keys=range(len(LIST_OF_DATAFRAMES_TO_MULTIPLY))

         ).groupby(

    level=[i+1 for i in range(len(INDEX_KEYS))]

                    ).apply(lambda x : (x.cumprod(skipna=False).iloc[-1])
                                 ).reset_index()
df_Store_ProfitMargin

Street_A’s gross profit margin view is as follows:

Stores' Gross Profit Margin, Photo by the author
Stores’ Gross Profit Margin, Photo by the author

Use Cases: These arithmetic operations are very preferrable when there are many dataframes to (sum / substract) or (multiply /divide) of multi level data.

Please note that sum / multiply cannot be blended in the same dataframe list and must be run separately.

4- Groupby with Apply: Customised Functions

We might need tailor-made functions that may only suit our needs.

Photo by Salvador Godoy on Unsplash
Photo by Salvador Godoy on Unsplash

For instance, what if we wanted to find the stores that have a negative gross profit margin for every year?

Stores' Gross Profit Margin, Photo by the author
Stores’ Gross Profit Margin, Photo by the author

The key point is to create a function with an empty dictionary inside and to return it as Pandas series indexed with dictionary keys.

def find_stores_that_loses_money(x):

    d={}

    for COLUMN in PERIOD_COLUMNS:

        MONEY_LOSING_STORES = x[x[COLUMN] < 0]

        if len(MONEY_LOSING_STORES) > 0:

            d['Loss_{}'.format(COLUMN)] = ' , '.join(list(

                MONEY_LOSING_STORES['Store'].values

                                                      )
                                                    )
else:

            d['Loss_{}'.format(COLUMN)] = np.nan

    return pd.Series(d, index=d.keys())
df_Stores_Losing_Money = df_Store_ProfitMargin.groupby(['City']).apply(find_stores_that_loses_money).reset_index()
df_Stores_Losing_Money

For every city per year, we found the stores that have a negative gross profit margin.

Loss Making Stores, Photo by the author
Loss Making Stores, Photo by the author

Use Cases: The sky is the limit. Just create a function that meets your business needs the most and combine with groupby magic.

5- Groupby to slice a DataFrame

We sometimes need to slice a dataframe in a multi level way to proceed with a subset.

Photo by Kabir Kotwal on Unsplash
Photo by Kabir Kotwal on Unsplash

For every city, we filter the row of stores that have the highest profitability in 2021.

idx_max = df_Store_ProfitMargin.groupby(['City'])[2021].transform(max)==df_Store_ProfitMargin[2021]
df_Most_Profitable_Stores_2021 = df_Store_ProfitMargin[ (idx_max) ]
df_Most_Profitable_Stores_2021

Please see the filter effect in index numbers.

Stores with the Highest 2021 Profit per City, Photo by the author
Stores with the Highest 2021 Profit per City, Photo by the author

Use Cases: This method could be very helpful when we need to keep the most recent row per category of a data base history. (i.e. the max of Updated_At column per user and product)

Conclusions

In this post, I wanted to walk you through some of the least preferred but effective looping and grouping functions that could increase the readibility of your code as well as let you win time and certainty. However, please note that loops are not preferred when dealing with dataframes of millions of rows and they require approaching cautiously.


Related Articles