Visualizing Cohort Retention Rates with pycohort Package

Tutorial about how to create a Python package on PyPi

Published in

Towards Data Science

7 min readJun 26, 2021

The main revenue generator for subscription businesses is recurring payments. There might be additional one-time offerings but the number of subscribers opting in for additional charges is limited and this number will be harder to predict. That’s why the financial health of a subscription business can best be measured by the retention rate. If the users are satisfied with the product or the service, they will likely stay until their needs change. That’s why it is important what kind of retention features there are in an app and how these are presented to users, especially the ones who are entering a cancellation funnel to end their recurring payment cycles.

To quickly visualize retention rates in a more detailed format, businesses utilize Time-Based Cohort Analysis. The cohorts consist of users who signup in a particular time frame which might have a granularity from weekly to quarterly. For each cohort, the percentage of retained customers is calculated in every time period until the most current.

How to read time-based cohort analysis charts?

Below is an example of a time-based cohort analysis chart. The users are clustered into cohorts based on their subscription start month. Each row represents a cohort of subscribers. For each cohort, there is a starting retention rate in the first column. This is the percentage of users who remained one month after the subscription start date. The rates along each row represent the retention rate in a given month in the subscription journey. These rates are calculated based on the number of subscribers left after a certain month divided by the number of all subscribers in a given cohort in the beginning. Let’s take the cohort of users who signed up in January 2020. Two months after their subscription start date, 75% of the subscribers are still there. In other words, the churn rate is 25% in the first two months. Since the data pull ends in April 2021, there are 16 months for this cohort. After 16 months, 44% of the users are still subscribed. (Since this is an artificial dataset, the numbers are might be higher than what is expected from a real case.)

Can this view summarize everything about retention?

No! This chart sums up the retention rate for every cohort in a given period along their subscription journey. But it is important to keep in mind that these are only rates. Without seeing the numerator and denominator, we shouldn’t jump to conclusions on whether the future of a company is in danger or not. There might be a new offering increasing the traffic to the portal which increases the number of subscriptions and this increases the revenue so this is very desirable but the rates might suggest a different trend. Although there is an increase in numbers, the subscribers might churn faster. That’s why this visualization will only present a partial view of the business and might be misleading if used alone!

How can I create a similar chart for my company?

To answer that question I built a python package called pycohort. To utilize this package, the data needs to be in a certain format. Otherwise, the functions cannot be utilized. First, there needs to be a unique identifier for each subscription that is created by a user which needs to be called id. Then the start date of each subscription needs to be entered in the creation_date field. metric_month column is added for every month that the subscription is active. That’s why for every active month, there will be a new data point. The date when the subscription is last active needs to be in last_active_month field. This is not necessarily equal to cancellation month. If the subscription is still active then the last_active_month will be equal to the most current month. The last column needed is active_duration which is the time spent between creation_date and last_active_month.

Format of the dataframe needs to be loaded to the pycohort functions

In a data format like this, there might be duplications. For instance, for a subscription that is created in March 2020 and cancelled in May 2020, there will be three rows: First with the metric_month equal to March 2020, second with metric_month equal to April 2020 and third with the metric_month equal to May 2020.

After being sure that the data is in the right format, there are 5 functions to utilize to understand the retention. First three are to calculate mean, median and standard deviation. These three are mainly to be used for the active_duration field. The next function is called cohort_preprocess. In this function, the data frame format which is explained above is converted into a matrix where the retention rates for different cohorts are calculated in each period. The last function cohort_viz takes the output of the cohort_preprocess as input and prints the cohort analysis visualization.

# to download the package:
!pip install pycohort# importing necessary functions from the package
from pycohort import calculate_stdev
from pycohort import calculate_mean
from pycohort import calculate_median
from pycohort import calculate_stdev
from pycohort import calculate_stdev# calculating mean
calculate_mean(df.active_duration)
>>> 13.315# calculating median
calculate_median(df.active_duration)
>>> 12# calculating standard deviation
calculate_stddev(df.active_duration)
>>> 7.256# printing the visualization
cohort_viz(cohort_preprocess(df))

Output of the cohort_preprocess function

How to bundle all the functions in a python package?

Before publishing this code, I created a working directory called pycohort. Under that folder, there was a readme.md, setup.py, license.txt and another folder called pycohort. Under setup.py, the author needs to enter specific information related to the package just like below. Please be careful that the name of the package cannot be used before by some other developer. Also, if you want to upload a new version, you need to iterate through the version. As you can see it took me a couple of versions to adjust the code, add comments and increase the efficiency of the package.

# content of setup.py for pycohort
from setuptools import setupsetup(name='pycohort',
      version='2.6',
      description='pycohort package',
      url='https://github.com/demirkeseny',
      author='Yalim Demirkesen',
      author_email='yalimdemirkesen@gmail.com',
      license='MIT',
      packages=['pycohort'],
      zip_safe=False)

Under pycohort folder in the working directory there are two files. __init__.py which informs the users how to import the package and also pycohort.py where all the above-mentioned functions are stored.

# content of  __init__.py for pycohort
from .pycohort import calculate_mean
from .pycohort import calculate_stdev
from .pycohort import calculate_median
from .pycohort import cohort_preprocess
from .pycohort import cohort_viz

The final file layout looked like this:

Where and how are these files need to be uploaded?

The next step is to upload our files so that others can install them. The site that we will upload our package is called PyPi. When you usually download any python package using pip install, you pull folders from PyPi. It is best practice to first try uploading your package to test PyPi which is a testing repository for PyPi. If it works fine, we can upload everything to PyPi. In order to upload our files, we need to create an account both on test PyPi and PyPi.

Once every file is ready and accounts are created, we need to navigate to the working directory where the pycohort lies on our console using the cd command. Then we need to run python setup.py sdist. This command will add two additional files to our repository so that our package can be uploaded to and downloaded from PyPi. Next, we need to download the twine package which is needed to push our files to PyPi repository. Then we can upload our package to test PyPi and try to download it from the test repository. If no issues while uploading to test PyPi, we can move to the regular PyPi. You can find all the necessary code below:

# navigate to pycohort working directory:
cd pycohort# download the necessary files to upload your package:
python setup.py sdist# install twine package:
pip install twine# upload package to test PyPi:
twine upload --repository-url https://test.pypi.org/legacy/ dist/*# install from test PyPi:
pip install --index-url https://test.pypi.org/simple/ pycohort# upload package to PyPi:
twine upload dist/*# install from test PyPi:
pip install pycohort

Conclusion

If you want to visualize the retention trends of different cohorts, this time-based cohort analysis is a perfect option. Once your data is in the above-explained format, then it is even easier with the pycohort package.

Before making any decisions looking into this chart, please remember that this visualization is based on rates that don’t reflect sudden changes in numerator or denominator.

Please find the Github repo here and PyPi link here for a more detailed view on the package and the functions.

Special thanks to Gunsu Altindag for all the inspiration!