Measuring User Engagement with Cohort Analysis

SQL for heavy lifting, Pandas for deep diving

Published in

Towards Data Science

7 min readOct 12, 2022

Companies need to be able to measure user engagement and retention to quantify a variety of information such as (i) how it evolves over time, (ii) how sensitive it is to external events and seasonality, and (iii) how different user groups engage with their product. The product could often be an app that users interact with, to receive some services. Consequently, engagement could simply be defined as the user's activity with the app.

1-Introduction

In this post, I will show you an example of how user engagement can be investigated with “Cohort Analysis”.

“A cohort is a group of people who share a common characteristic over a certain period of time.”

To make things simpler, let’s assume this common characteristic to be the week a user opens the company’s app for the first time. This sets that user’s status as an active user. If that user doesn’t open the app the next week, then its status will be inactive. By building up on this, one can come up with the following status definitions for the active and inactive users, which will help us to track different cohorts over time.

With this logic, we can see that at any given week,

The total of retained, regained, and churned users is equal to the cumulative sum of all new users from the previous weeks until that week.
Retention Rate is (sum of retained users) / (sum of retained users + sum of churned users)

Steps

a) Dummy Data Base

b) SQL Query

c) Returning to Pandas

d) Table Transformation

e) Time Cohorts

f) Behaviour Cohorts

Conclusion

2-Process

Let’s generate a dummy user activity database and start crunching it.

2.a) Dummy Data Base

Normally, companies have their own databases where they collect user activity data. However, for illustrative purposes, we will generate our own user activity table (with pandas and NumPy) and push it to a SQL database so that we can use SQL queries for the heavy lifting. Our activity table will have 3 basic columns: (i) user_id, (ii) activity_day, and (iii) app_opens for that user on that day.

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import string
import seaborn as snspd.set_option('display.max_rows', 400)
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 150)# ['A', 'B', ..., 'AA', 'AB', ..., 'ZY', 'ZZ']
ALPHABETICAL_LIST = (list(string.ascii_uppercase) + 
                     [letter1+letter2 for letter1 in string.ascii_uppercase 
                      for letter2 in string.ascii_uppercase])%matplotlib inline
plt.rcParams["figure.figsize"] = (40,20)

The dummy user activity table will be

userid_list = [letter+'{:02d}'.format(i) for letter in ALPHABETICAL_LIST 
               
               for i in range(1,21)][:14000]np.random.seed(0)app_opens = np.concatenate(( 
    np.random.choice(range(10,13), size=len(userid_list)),
    np.random.choice(range(8,11), size=len(userid_list)),
                ))user_ids = np.concatenate((
                           userid_list, 
                           userid_list, 
                           ))app_activations = np.datetime64('2020-01-06') + np.random.choice(np.arange(0, 180), len(userid_list))activity_days = np.concatenate((
                           app_activations, 
                           app_activations + 7, 
                           ))random_user_ids = [x for x in userid_list]for days_passed in range(14,106, 7):
        
    user_id_sample = int(np.floor(len(random_user_ids) * 0.98))if user_id_sample!=0:
    
        random_user_ids = [random_user_ids[x] for x in 
                           
                           np.random.choice(range(len(random_user_ids)),size=user_id_sample,replace=False)]user_ids = np.concatenate((user_ids, random_user_ids))
        
        app_opens = np.concatenate((app_opens, 
                                    
                                    np.random.choice(range(3,7), 
                                                     
                                                     size=user_id_sample, 
                                                    
                                                    p=[0.2, 0.3, 0.3, 0.2]
                                                     
                                                    ),
                                    ))
        
        activity_days = np.concatenate((activity_days, 
                                        
                                        [app_activations[userid_list.index(x)]+days_passed for x in random_user_ids]
                                        
                                        ))df_user_activity = pd.DataFrame(zip(user_ids, activity_days, app_opens), 
                                columns=['user_id', 'activity_day', 'app_opens'])df_user_activity = df_user_activity.groupby(['user_id'], 
                                            group_keys=False
                                           ).apply(lambda df: df.sample(len(df)-2)
                                                  ).sort_index().reset_index(drop=True)df_user_activity['count'] = df_user_activity.groupby(['user_id'])['activity_day'].transform(len)df_user_activity.loc[df_user_activity['count']==1, 'app_opens'] = np.minimum(5, df_user_activity['app_opens'])df_user_activity = df_user_activity.drop(['count'], axis=1)df_user_activity

Let’s write this table to a SQL database.

from sqlalchemy import create_engine
import osSQL_ENGINE = create_engine('sqlite:///'+ 'medium_project.db',
                           connect_args={'check_same_thread': False}).connect()df_user_activity.to_sql(name='user_activity',
                  index=False,
                  con=SQL_ENGINE,
                  if_exists='replace')

2.b) SQL Query

The following query helped me achieve the view I wanted to conduct my cohort analysis. For the above-mentioned period (42 weeks), I created a table which shows user_id, week, and weekly_flags (such as new, retained, regained, and churned). This table has 13,720 unique user_ids * 42 weeks = 576,240 rows.

On the query,

I created a weekly period table to join with user_activity_daily data to bring missing weeks for every user. My goal is to achieve a table view in which, for every user id, we have 42-week rows and their total weekly app opens. I start with turning every activity day into a Year-Week view to apply a group by (sum of app opens) for users. Later, I join this to the period table, I convert app_opens to 0 for unrelated weeks and lastly, I calculate lifetime (cumsum), previous week (lagged) and previous lifetime (lagged cumsum) weekly_app_opens to assign new, retained, active, churned, regained flags. Note: row_number() columns are to repeat the same daily_activity row 42 times.

sql query for user activity by author

After I save this table as a CSV, I returned Pandas for the rest of my analysis.

Although it’s not in this post’s scope, if needed, this table could be further joined with other SQL tables to bring user segment-related information into the view.

2.c) Returning to Pandas

The table below shows us the weekly overall new, retained, active, regained, churned users.

df_overall_summary = df_weekly_activity.groupby(
                    ['year_week']
                          )['new_flag', 
                            'retained_flag', 
                            'active_flag', 
                            'churned_flag', 
                            'regained_flag'].sum().reset_index()df_overall_summary

For instance, if we look at the first 3 rows of the above data frame, we can see that on the first week of 2020, 435 users opened the app for the first time ever. The next week, out of these 435, 72 churned (for that week) and 363 were still active in addition to the 523 new joiners.

Now that we have our weekly activity view on a user basis, let’s analyse further and visualize cohorts.

2.d) Table Transformation

I will look at all users’ activity journeys by grouping same-week subscribers together. My goal is to identify whether there is a drop in activity for everyone after the Nth week. The active user average of those who have a new flag on Week X will be investigated throughout Week X (100% activity), Week X+1, Week X+2, Week X+3, Week X+4, … etc.

df_weekly_activity['new_flag_cumsum'] = df_weekly_activity.groupby(['user_id'])['new_flag'].cumsum()# Filterin out the rows where the user didn't started its journey
# (they weren't even a new user yet).df_cohort_prep = df_weekly_activity[df_weekly_activity['new_flag_cumsum']>0].reset_index(drop=True)
df_cohort_prep = df_cohort_prep.merge(
    
df_cohort_prep[df_cohort_prep['new_flag']==1][['user_id', 'year_week']].rename(columns={'year_week': 'joined_on'}),
                    
how='left', on=['user_id']                    
                    
                    )df_cohort['Nth_Week'] = (df_cohort['year_week'].str[:4].astype(int) * 52 + 
                         df_cohort['joined_on'].str[:4].astype(int) * 52 * -1 + 
                         df_cohort['year_week'].str[-2:].astype(int) + 
                         df_cohort['joined_on'].str[-2:].astype(int) * -1 
                         + 1
                         )

2.e) Time Cohorts

Week No 1 is always 100% because that’s when a user is a new user for the company.

df_heatmap = pd.pivot_table(
               df_cohort[df_cohort['Nth_Week']<20],
               index=['joined_on'], 
               columns=['Nth_Week'], 
               values=['active_flag'], 
               aggfunc=lambda x: np.sum(x) / len(x)
               )df_heatmap.loc[''] = np.nandf_heatmap = df_heatmap.sort_index(axis=0)df_heatmap.loc[' '] = np.nanfig2 = plt.figure(figsize=(24,16))ax1=sns.heatmap(df_heatmap.values, 
            xticklabels=[b for a,b in df_heatmap.columns],
            yticklabels=df_heatmap.index.tolist(),
            vmin=df_heatmap.min(axis=1).min(), 
            vmax=df_heatmap.max(axis=1).max(),    
            annot=True, 
            fmt=".2%", 
            cmap='YlGnBu', 
            linewidths=.005, 
            linecolor='#029386',
            cbar=False, 
               )ax1.set_title('Active Users %')plt.subplots_adjust(hspace=0.35, wspace=0.2)ax1.set_yticklabels(ax1.get_yticklabels(), rotation=0)ax1.set_xlabel('Nth Week at [Company X]')ax1.set_ylabel('Joined On Week')fig2.savefig("Weekly Active Users % Medium.png", bbox_inches = 'tight')

It’s seen that the user activity decreases to 80–85% in the 2nd week, and continues to go down and, for many of the cohorts, activity drops to 0% in the 16th week.

df_heatmap = pd.pivot_table(
               df_cohort[df_cohort['Nth_Week']<20], 
               index=['joined_on'], 
               columns=['Nth_Week'], 
               values=['weekly_app_opens'], 
               aggfunc=lambda x: np.mean(x)
               )df_heatmap.loc[''] = np.nandf_heatmap = df_heatmap.sort_index(axis=0)df_heatmap.loc[' '] = np.nanfig2 = plt.figure(figsize=(24,16))ax1=sns.heatmap(df_heatmap.values, 
            xticklabels=[b for a,b in df_heatmap.columns],
            yticklabels=df_heatmap.index.tolist(),
            vmin=df_heatmap.min(axis=1).min(), 
            vmax=df_heatmap.max(axis=1).max(),    
            annot=True, 
            #fmt=".2%", 
            cmap='YlGnBu', 
            linewidths=.005, 
            linecolor='#029386',
            cbar=False, 
               )ax1.set_title('App Opens Avg.')plt.subplots_adjust(hspace=0.35, wspace=0.2)ax1.set_yticklabels(ax1.get_yticklabels(), rotation=0)ax1.set_xlabel('Nth Week at [Company X]')ax1.set_ylabel('Joined On Week')fig2.savefig("Weekly App Opens Avg Medium.png", bbox_inches = 'tight')

In average, users open the app 10–11 times on their first week, and then 6–7 times on their 2nd week. Before churning completely, they open the app 2–2.5 times a week.

2.f) Behaviour Cohorts

Finding the row indexes where a user churned for good and wasn’t regained again, ever.

df_churned_forever_index = df_cohort[
    
df_cohort['active_flag'].eq(1)[::-1].astype(int).groupby(df_cohort['user_id']).cumsum().eq(0).sort_index(axis=0)].groupby('user_id', sort=False
                  )['Nth_Week'].idxmin().reset_index().rename(columns={'Nth_Week': 'index_value'})

Keeping only the rows between a user was new and churned for the rest of our time window.

df_cohort = df_cohort.merge(
    
df_churned_forever_index,
                    
how='left', on=['user_id']                    
                    
                    )df_cohort_churned_forever = df_cohort[df_cohort.index<=df_cohort['index_value']].reset_index(drop=True)

For all churned users, how many churned only after 2 weeks, 3 weeks, … N weeks and what are their weekly_app_opens averages?

df_users_churning_week = df_cohort_churned_forever.groupby(
                            ['user_id']
                                 )['Nth_Week'].max().reset_index()list_to_append_leaving_users = []for Nth_WEEK in range(2,15):
    
    list_of_users_leaving_n_week = df_users_churning_week[(df_users_churning_week['Nth_Week']==Nth_WEEK)]['user_id'].tolist()list_to_append_leaving_users.append(df_cohort_churned_forever[
        df_cohort_churned_forever['user_id'].isin(list_of_users_leaving_n_week)
                             ].groupby(['Nth_Week'])['weekly_app_opens'].mean().reset_index().rename(
        columns={'weekly_app_opens': '{} Users Leaving in Week {}'.format(
            len(list_of_users_leaving_n_week), Nth_WEEK)}).set_index(['Nth_Week']))
    
df_churned_users_weekly_app_opens = pd.concat(list_to_append_leaving_users, sort=False, axis=1)df_churned_users_weekly_app_opens.reset_index()

As the picture above presents, the users that churn in their 2nd week and never get regained are the ones who open the app 4.7 times on average in their first week.

3-Conclusion

Companies would like to maintain their users’ activity if not increase it. Additionally, if there are any red flags that could be raised for a potential decline in user engagement, they’d like to know this by inspecting trends. In this post, I shared some Python and SQL code, to demonstrate how one can analyse the retention from a user activity table.

Multiple cohorts have been created as an example:

1- Time cohorts (Section 2.e, grouping users by their activity over time)

2- Behaviour cohorts (Section 2.f, finding users who were churned and never regained again)