How to Accurately Calculate Age in BigQuery

James Cerwin Ly
Towards Data Science
3 min readMar 21, 2020

--

Photo by Helloquence on Unsplash

In analysing customer data, age is one of the basic and important demographic data fields. Usually, age is not a data point collected from the customer but rather their date of birth. Age is then calculated from this — either the customer’s age from a moment in time or their age today.

Unfortunately, there’s no standard way of calculating age on Bigquery and this leads to different ways of deriving a basic concept. With each unique calculation among developers and analysts, there would be discrepancies between results.

I introduce different age calculations below and how to accurately calculate it in BigQuery using SQL. In these examples, I’ll be using the same date of birth (mine).

Basic Age Calculation

The most basic way of calculating age is to use DATE_DIFF function to get the number of years between two dates. However, this function just subtracts the years, regardless if the date of birth has already passed or not, which is entirely inaccurate.

WITH data AS (SELECT CAST('1993-04-29' AS DATE) AS date_of_birth)SELECT 
DATE_DIFF('2020-03-21',date_of_birth, YEAR) AS age
FROM data

Result: 27

Instead of using the difference in years, we can use the same function to calculate the difference in days between a date and the date of birth, then dividing this by 365. FLOOR is used to remove the decimal places. This is useful when it comes to quick analyses and when accuracy is less important.

SELECT 
FLOOR(DATE_DIFF('2020-03-21',date_of_birth, DAY)/365) AS age
FROM data

Result: 26

This calculation assumes that all years have 365 days. Since this does not take into account leap years, the age will get rounded up a few days before the actual birth date. The number of days when the age is rounded up is the number of leap years that have passed.

FLOOR(DATE_DIFF('2020-04-27',date_of_birth, DAY)/365) AS age

Result: 27

Adjusting for Leap Year

Replacing 365 with 365.25 in the divisor gives a more accurate result.

FLOOR(DATE_DIFF('2020-04-27',date_of_birth, DAY)/365.25) AS age

Result: 26

However, the result gets less accurate at certain ages where the leap year has an impact.

FLOOR(DATE_DIFF('2011-04-29',date_of_birth, DAY)/365.25) AS age

Result: 17

Here, the age is actually being rounded down. The result should have been 18 in this scenario. While the error has minimal effect and this proves to be a better alternative, there could be cases where accuracy is important.

Being Accurate

In marketing, calculating age incorrectly can have legal implications or just really bad customer experience. In this seemingly complex (not really) calculation, it uses several functions. The first part of the query subtracts the years between the two dates. The second part will subtract 1 year if the date of birth has already passed the other date by comparing just the month and day.

DATE_DIFF('2020-03-21',date_of_birth, YEAR)
-
IF(EXTRACT(MONTH FROM date_of_birth)*100 + EXTRACT(DAY FROM date_of_birth) > EXTRACT(MONTH FROM '2020-03-21')*100 + EXTRACT(DAY FROM '2020-03-21'),1,0) AS age

Result: 26

The downside of using this calculation is that it is long and quite a hassle to redo every time you need age calculated (except when you have the query saved somewhere).

Persistent UDF

BigQuery now allows persistent user-defined functions. This makes it easier to reuse the same code and keep it consistent throughout the project. It’s a struggle as someone who has been coding for a few years now that every time age needs to be derived, I would have to use different functions and write a long query just to get an accurate age. Instead, I created a UDF such that I can reference the same code for every analysis I make.

CREATE OR REPLACE FUNCTION workspace.age_calculation(as_of_date DATE, date_of_birth DATE) AS (
DATE_DIFF(as_of_date,date_of_birth, YEAR) -
IF(EXTRACT(MONTH FROM date_of_birth)*100 + EXTRACT(DAY FROM date_of_birth) > EXTRACT(MONTH FROM as_of_date)*100 + EXTRACT(DAY FROM as_of_date),1,0))

In this UDF, there are two inputs — the date of birth and the as of date you want to calculate the age. This gives the user the flexibility of reusing the code and for different use cases.

SELECT workspace.age_calculation('2020-03-21','1993-04-29')

Result: 26

It’s interesting that there is no standard way of calculating age when this is such an important attribute, whether in analysing or targeting customers. However, using a combination of BigQuery functions gives better alternatives. I hope this helps analysts use a standard and accurate way of calculating age in their data.

--

--

Customer Insights Analyst. MSc Operations Research and Analytics, LSE 2018. Working with data one day at a time. https://www.linkedin.com/in/james-cerwin-ly/