Firebase Event Analytics with Google BigQuery

An extensive guide to help you get started with app analytics

Kritika Jalan
Towards Data Science

--

I started working with Events Analytics. As most mobile apps do, this app too, was linked to Firebase for event tracking. It was difficult to follow especially because there were limited resources online around this. After 4 months of being into the system, I have some understanding and experience which I would be sharing in this easy-to-digest post on Firebase Analytics. (This post includes some information as is, directly from Google docs). Let’s get started!

How beautiful is this,
Sitting here at my desk,
Looking out the window, and seeing the entire city,
Sprawled in front of my eyes

The end of concrete jungle, marked by the river,
Running through the entire length of the city

A world, a very different world lives across the river.
Better or worse, you ask.
Let’s not label everything on the scale of goodness,
When all things really are, are just different.

What is Firebase Analytics?

As wiki says, Firebase provides a real-time database and backend as a service. The service provides application developers an API that allows application data to be synchronized across clients (users basically) and stored on Firebase’s cloud.

At the heart of Firebase is Google Analytics, an analytics solution that integrates across Firebase features and provides with reporting for up to 500 distinct events that can be defined using the Firebase SDK.

In short, Firebase lets you track what your users are doing on the app (events), saves it on its cloud (real-time database) and also lets you visualize this data.

Below is a screenshot of a Firebase Analytics Dashboard. You can choose to see the performance for different time periods. On the dashboard, you’ll get information about your active users, about their engagement (how long they spend on which screen), about conversion event (when an in-app purchase is made, for example), your audience’s geography, their retention, mobile platform and a lot more details.

Firebase Analytics Dashboard showing real-time information

What do we mean by events?

As Google’s documentation says, events provide insight on what is happening in your app, such as user actions, system events, or errors.

Automatic Events

There are some events (and user properties) that Firebase tracks automatically, for example the event of app first opened after it was installed, the notification dismissed event, the OS update or an app update event etc.

Custom Events

Apart from the automatic events, developers can create their own events to track user activity or system events/errors. There is no limit on the total volume of events your app logs. For example, if you’re a dating app, you can choose to record an event when a user swipes right on someone, or when a user changes their search preferences from a radius of 5km to 500km. If you’re an e-commerce app, you can choose to record an event when user adds/removes an item from cart or when a user checks out etc.

Event Logging

With each event, firebase lets you track a bunch of other relevant information. They can be grouped as event property (screen from which the event originated, which user ID was swiped right, what new search distance was set etc.), user property (user ID , their age, gender etc.), device property (device type, brand, model, OS, language etc.), location property (where was the event originated country, region, city etc.), app property (app version, store through which the app was installed) etc.

List of events configured for this mobile app

If you need answers to questions which Firebase dashboard doesn’t provide, you can connect your Analytics data to BigQuery, which allows for more complex analysis like querying large data sets, joining with other data sources, creating business reports etc. That is what we will be talking about next.

Source

Bring Data to BigQuery

Since you want to import your raw data from Firebase to BigQuery, you would likely be interested in doing user level analysis and for this, you must add User ID in order to track user across platform and devices. Why?

Firebase doesn’t identify unique users on your platform. All it does is track activity across unique app instances called user_pseudo_id. This means, if you uninstall your dating app and reinstall it, if you change your device or if you are using multiple apps from the same provider, Firebase doesn’t have an automatic way of informing that it’s all just one user. To have the ability of identifying an individual on your platform, you need to set user IDs for them on your own. After setting the user ID, all your events will be automatically tagged by this value.

Questions! How do I set user ID? How do I connect data from Firebase to BigQuery?
Answer — Well, this isn’t something you will do as a data scientist. The app developer is better equipped to do it all for you. Here is the guide from Google they can use to connect Firebase data to BigQuery. Now that we have our raw data pushed into BQ, let’s hop on to BigQuery to see how we can work with it!

Raw Events Export in BigQuery

(I know it looks scary :p)

I had never before seen a table with repeated records, and this one was it. A few things to know about the exported data -

  • After linking a BQ project with firebase, the first export creates a dataset named analytics_xxx where x represents your Analytics Property ID (don’t worry about this).
  • In this dataset, a table is exported for each day (BQ tables partitioned by date) named events_YYYYMMDD
  • You also see an events_intraday_<date> table which is the real-time table getting updated with raw events for today.
  • Since the table is partitioned by day, you can either query a single table for a particular day or filter for multiple tables usingFROM analytics_xxxxxxxxx.events_20200131 OR
    FROM analytics_xxxxxxxxx.events_*
    WHERE _table_suffix BETWEEN ‘20190201’ AND ‘20200131’
  • These tables are event level tables with repeated records for event properties, user properties, device properties etc. What this means is, for each event there’s going to be multiple event properties which will be saved as key, value pairs (an array) instead of in a flat table format, as shown below.

Due to this structure, two things happen,

  • You won’t be able to query the DB using regular SQL. For example, when you want to filter your engagement events for user_gender property, you will not be able to simply write where user_gender = ‘Male’
  • You won’t be able to export this kind of table structure into Google Sheets or on your local.

To deal with this table format, let’s talk about some querying style and functions.

UNNEST Function

As you might have guessed it, the unnest function helps in flattening our confusing (at first) table. The UNNEST function takes an array and breaks it out into each of its individual elements. For example, the query to count male users on your app for the week of 20th Jan will look something like -

SELECT count(DISTINCT user_id) FROM 
(SELECT user_id,
(SELECT value.string_value FROM UNNEST(user_properties) WHERE key = “user_gender”) AS gender
FROM analytics_xxxxxxxxx.events_*`
WHERE _table_suffix BETWEEN ‘20200120’ AND ‘20200126’
)
WHERE gender = “Male”

Here, the UNNEST function took the array user_properties and unfolded it. From all the key, value pairs for user properties then, we filtered for user_gender and extracted the string value where we had saved ‘Male’ and ‘Female’. We called this extraction as gender and then finally filtered for Male here in the enclosing query.

Complex Queries in WITH … AS Style

While creating a report which involves multiple steps say when calculating retention, you will need to create a few intermediate tables before you can finally write select users, their day-7 retention for a period of 30 days from table x

Now, BigQuery isn’t SAS or Python where you can write multiple steps and run them all at once. The platform only runs one query at a time. So here’s to your rescue the with..as style. It lets you create multiple table within the same query using table aliases. For example retention -

with
cohort_items as
(SELECT user_pseudo_id,
MIN( TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY)) as cohort_day
FROM analytics_xxx.events_*
WHERE _table_suffix BETWEEN ‘20170101’ AND ‘20191231’
AND event_name = ‘first_open’
GROUP BY 1
),

user_activites AS
(
SELECT A.user_pseudo_id, DATE(C.cohort_day) as cohort_day,
DATE_DIFF(DATE(TIMESTAMP_TRUNC(TIMESTAMP_MICROS(event_timestamp), DAY)), DATE(C.cohort_day), DAY) AS day_number
FROM analytics_xxx.events_* A
LEFT JOIN cohort_items C ON A.user_pseudo_id = C.user_pseudo_id
WHERE _table_suffix BETWEEN ‘20170101’ AND ‘20191231’
GROUP BY 1,2,3),
cohort_size AS ( SELECT cohort_day, count(1) as number_of_users FROM cohort_items GROUP BY 1 ),retention_table AS
(
SELECT C.cohort_day, A.day_number, COUNT(1) AS number_of_users
FROM user_activites A
LEFT JOIN cohort_items C
ON A.user_pseudo_id = C.user_pseudo_id
GROUP BY 1,2
)
SELECT
B.cohort_day, ifnull(B.number_of_users,0) as total_users,
B.day_number, safe_divide((ifnull(B.number_of_users,0)), S.number_of_users) as retention,
FROM retention_table B
LEFT JOIN cohort_size S ON B.cohort_day = S.cohort_day
WHERE B.cohort_day IS NOT NULL and B.day_number >= 0 and B.day_number < 31
)

Here, we created 3 intermediate tables named cohort_items, user_activities and retention_table before we could use them all to aggregate and get our final retention table.

cohort_items — Identifying the new user cohort based on the date they first opened the application

user_activities — For these identified users, getting their daily activity on the app. If a user logged single event on firebase, we consider them active for that day.

retention_table — Aggregating the above table on day level (cohort_day) and identifying for each day (day_number), how many users were retained on day 0, day 1, day 7, . . . upto day n.

Finally, in the last select statement, we calculate the retention percent for each cohort day and day_number combination and keep only day 0 to day 30 retentions.

Some Best Practices

After you start creating reports using exported data in BigQuery, you’ll encounter discrepancy between Analytics and your reports. You can use the following best practices to get better results:

  1. Use the table for a specific day instead of using a date comparison on the event timestamp in the WHERE clause. The event timestamp is based on client device time, which is often skewed.
  2. Never compare events like session_start or user_engagement that are triggered quite often.
  3. If any query runs out of allocated resources, remove the ORDER BY clause since it is a quite expensive operation and cannot be processed in parallel so try to avoid it (or try applying it in a limited result set)
  4. Always get a report past 3 days so that the data exported from GA4F to BigQuery gets normalized. Data is only exported to BigQuery once per day. And so, queries which include the most recent three days will show different results between Firebase Analytics and BigQuery.

I hope this post helped you learn a bit about Firebase Analytics. There will be more posts following this one to talk about more specific tasks like using events data to map user behaviour, create user journey etc.

If you have any questions/suggestions, feel free to reach out with your comments and connect with me on LinkedIn, or Twitter

--

--