Generating product usage data with Pandas

How can we approximate high-level user behavior?

Published in

Towards Data Science

10 min readOct 24, 2017

The insight into behavioral usage of our product is crucial as it indicates not only how well we acquire users, but retain them as well. It’s a way to determine the answer to the question of how many users are really using our product? It’s often referred to as the one metric that matters (OMTM), which helps us resist the temptation of using vanity metrics to quantify the growth of our business. More on the OMTM and product growth here (article) and here (book).

Whereas I normally prefer to explore existing datasets, I took a different approach this time by trying to approximate a real dataset from scratch.

There are two main reasons for it:

A deeper understanding of behavioral data. Just as there’s value in coding up an algorithm from scratch to be able to understand its moving parts, I believe it’s similar for a dataset. It’s answering the question of “How can I approximate the usage of a product?” There will certainly be some simplifying, but necessary assumptions built into our behavioral model.
Exploration of Pandas to learn how to model data better, as an effective manipulation of data can speed up the data science workflow substantially to get deeper into the data you want to explore.

Our goal is to produce a dataset for 1000 users where each user is stored with a device uuid that has potentially 20 months of usage. As we’ll see later, the simulated data nicely converge to the expected values. We add two processes to get closer to a real dataset:

Define a user’s cohort (when the user first used the product)
Define the user’s churn (when the user last used the product)

That enables us to get a more accurate representation of a dataset that we might work on in a company.

Final result

We want the final dataset to include data for all months of the device’s activity:

Numerical (usage) features: How much did the user use each feature in that month (we remove entries when the user wasn’t active)
Categorical features: Defined per device (e.g. platform, country)

To simplify the model the data changes per device for usage features only, whereas it is static for categorical features.

The final dataset looks like something we might get from querying our business’s relational database (e.g. MySQL, PostgreSQL) from two tables, one containing the users' table (containing categorical features), the other one all the events (containing usage features). Then we would join both tables on device_uuid (primary key in both tables) and come up with something that looks like this. The data infrastructure obviously varies by each company, but that’s one of the ways that we might come to this point.

This post will focus only on generating data, but later we can use it to produce cohort analysis, for example, a user retention curve, which is not covered in this post but will be explored in the upcoming ones:

User retention plot based on the final data frame

When do we lose most of our users? Did the change in our onboarding process in the last product version improve the retention? Is there a difference in the retention of different segments of our users? With this kind of analysis, we can answer different kinds of questions to have a better understanding of how our users use the product.

You can find the notebook with all code used in this post here.

1. Defining the main parameters

We start by importing all the necessary dependencies and defining the main parameters mentioned above (1000 devices, 20 months, starting month in January 2016).

Importing the dependencies

2. Generating device uuids and combine it with datetime data

Generating device uuids

Generating device uuids

Device uuids are unique identifiers for each device that’s used by a user.

Generating datetime

Generating datetime

Output:

(‘2016–01–01 00:00:00’, ‘2017–08–01 00:00:00’)

We can see that with the specified parameters (starting month, number of months) we will generate data from January 2016 to August 2017.

Adding datetime

We get a dataframe with two columns, device_uuid and date. There are 1000 * 20 rows (number of device uuids * number of periods).

3. Generating usage features

After our dataframe includes device uuid and datetime data we next generate features that will describe product usage.

The first thing we do is to define:

The number of usage features we would like to generate
The ratio of how often the users use each feature relative to other features

While the number of features is self-explanatory, the feature ratio requires some further explanation.

We set the number of features to be 3 with the ratio of 0.8 which means that the first feature will take the randomly generated number and multiply it by 1 (it remains the same). That’s where the ratio comes in. The second feature is used 0.8 times the first feature. The third feature is used 0.8 times the second feature.

That means that if the first feature is used 10 times, the second feature is 8 times and the third feature 6 times. It’s just a simple way to not have all the usage features used to the same extent. In this case feature 1 corresponds to the core feature of the product. This is of course chosen completely arbitrarily.

This tweak produces features that are consistently used in a similar manner, just like with a real product. A strong assumption here is that the usage is homogenous, thus all users use the features in the same proportion, which is sufficient for our use.

3.1. Calculate feature ratios

We then apply the ratio that we defined above.

Calculate feature ratios

Output:

Ratio per feature: {‘feature1’: 1.0, ‘feature2’: 0.8, ‘feature3’: 0.6}

3. 2. Assigning cohort groups to devices

Now we pick an arbitrary number of cohorts and assign each device to a specific cohort. Each cohort will correspond to a datetime (month in our case) that the user started using the product. The number of cohorts is equal to half of all the number of months in our data . In this case it’s 20 / 2 = 10 cohorts. With 1000 users that means ~ 100 users per cohort.

Assigning cohort groups to devices

3.3. Adding cohort data

First we define a function that concatenates a series of dataframes into a dataframe. We will use it when we’ll get an output as a series for dataframes, each dataframe representing data for a specific device:

Concatenate series of dataframes into one dataframe

Then we define a function and add the assigned cohort groups to the main dataframe:

Adding cohorts to the main dataframe

Next, we use both functions to add the cohorts:

Applying both functions to all devices

As we can see in the example above of the first user we defined their cohort group to be April 2016, so that’s when they started using the product. We removed the first 3 months of their data to simulate this.

An average number of months per device

Output:

Because we assigned cohort groups to users we removed quite a few rows in the process. We can see that now we’ve got on average 15 months of usage per user, down from 20.

3.4. Generating usage data

Here we generate a random number for usage (0 to 14) for feature1. We then use the feature ratios to calculate usage for feature 2 and feature 3 based on the number generated for feature1.

Generating usage data

3.5. Generating churn behavior

While we defined the lower bound of the time period of usage by assigning a cohort group, we will define the upper bound with churn. Churn happens when a user stops using the product.

In general we have 2 simplifying principles on top of which we determine how a user uses the product:

Churner stays a churner (after a month of no usage the user doesn’t use the product again)
50% of users churn in the first month (about 50% of users have 1 month of data, 50% have more than that, thus longer usage)

The first one is meant to simplify the simulation and the second one is to more accurately simulate real-life data as most users normally churn after just trying the product and then never returning.

We define the function for simulating churn:

Simulating churn

We apply it to our dataframe:

Applying the simulating _churn function to all devices

We have the same structure of the dataframe as the last one. The only difference now is that we’ve removed additional rows where we simulate the user’s churn

An average number of months per device

Output:

We can see that we now have about 25% of all the months that we’ve initially generated for each user (5 / 20).

4. Generating categorical features

After the usage features we will define some categorical features to enrich our dataset. We will then be able to segment the users based on their platform, country or whether they’re registered or not and try to find some patterns.

Do (and if so how do) the registered users use the product differently than unregistered ones? What about users from a specific country? Where are the most of our power users located, what do they have in common?

A useful impact of an insight like this can help us with our marketing efforts to target the users that are the most profitable for the our business (highest LTV). More on that here .

Defining categorical features

Output:

{‘country’: [‘NL’, ‘AU’, ‘FR’], ‘platform’: [‘iOS’, ‘Android’], ‘user_registered’:[False, True]}

4.1. Generating categorical feature weights

Defining weights for the likelihood of a categorical feature associated with an individual user appearing.

Defining weights for categorical features

If there are two possible variants in a categorical variable (e.g. ‘Android’ and ‘iOS’) then we expect the generate data to contain 70% of the first variant and 30% with the second variant.

If there are three possible variants in a categorical variable (e.g. ‘NL’, ‘AU’, ‘FR’) then we expect the generated data to contain 60% of the first variant, 30% with the second variant and 10% with the third variant.

4.2. Applying the categorical feature weights

We create a function to produce categorical features based on the defined parameters.

Generating categorical features

Next we apply the function:

Applying the generation of categorical features to all devices

Each device has it’s own corresponding country, platform and whether it’s registered.

5. Merging all features into the final dataframe

We now have two dataframes, one containing datetime data and usage features and one containing categorical features. We next merge them together into the final dataframe.

Merging categorical and usage features

We then apply the function to all devices:

As we can see the first device (da5e3464-d356–4572-b962-f2b57e730732) started using the product in May 2016, used it for three months and then churned. They are from the Netherlands, they use an Android phone and haven’t registered into the product.

6. Checking simulation output in our final dataframe

We can check whether the parameters that we set manifested in the final dataframe as well.

Did we generate the data that we intended to? Why would we be able to do that? Because of the law of large numbers: The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed). More on the law of large numbers here .

6.1. Checking usage features simulation

The average usage for each feature:

The averages are approximately the same for both registered and unregistered users.

We then check the proportions for usage features (each feature 80% of the previous one).

Checking proportion feature 1 / feature 2

Output:

Feature 1 / Feature 2: 79% (expected value is 80%)

Checking proportion feature 2/ feature 3

Output:

Feature 2 / Feature 3: 75% (expected value is 80%)

As we can see in the above output the proportions converge well to the expected values. The numbers would be even closer if we increased the number of devices.

6.2. Checking categorical features simulation

We check the proportions of categorical features with two possible variants. We have set weights for features with 2 possible variants as [0.7, 0.3] and user_registered to [False, True], so it has 2 possible variants.

Checking user_registered:

Checking the proportion for user_registered

Output:

32% of users are registered (expected value is 30%)

Checking country:

We check the proportions of categorical features with three possible variants.

Plotting proportion of rows per country

Checking the simulation of the country feature

We can see in the plot above the percent of all events per country. The proportions approximate the defined parameters ([0.6, 0.3, 0.1])

6.3. Checking churn simulation

How many churn users in the first month, how many keep on using the product longer?

Output:

Number of m1 churners / Number of non-m1 churners: 489 / 455

Output:

48% of users churn in the first month (expected value is 50%)

As expected approximately 50% of users churn in their first month of usage, the other 50% use the product longer.

7. Next steps

Next thing to do is to further manipulate the final dataframe to have a clearer understanding of behavioral patterns of users in the product. We can produce visualizations such as the retention curve above to answer questions relating to some of the most crucial business problems: How successful are we at growing our business?

Generating product usage data with Pandas

How can we approximate high-level user behavior?

Final result

1. Defining the main parameters

2. Generating device uuids and combine it with datetime data

3. Generating usage features

3.1. Calculate feature ratios

3. 2. Assigning cohort groups to devices

3.3. Adding cohort data

3.4. Generating usage data

3.5. Generating churn behavior

4. Generating categorical features

4.1. Generating categorical feature weights

4.2. Applying the categorical feature weights

5. Merging all features into the final dataframe

6. Checking simulation output in our final dataframe

6.1. Checking usage features simulation

6.2. Checking categorical features simulation

6.3. Checking churn simulation

7. Next steps

Written by Jan Osolnik