The world’s leading publication for data science, AI, and ML professionals.

Counting unique elements faster in Google BigQuery with HyperLogLog

With HyperLogLog algorithm, we trade off some accuracy to get the job done faster. But how much faster and at the expense of what level of…

Photo by Crissy Jarvis on Unsplash
Photo by Crissy Jarvis on Unsplash

HyperLogLog (HLL) is an algorithm that estimates how many unique elements the dataset contains. Google Bigquery has leveraged this algorithm to approximately count unique elements for a very large dataset with 1 billion rows and above.

In this article, we’ll cover 2 points.

  1. What’s HLL?
  2. How does HLL compare with other methods?

What’s HLL?

The HyperLogLog algorithm can estimate cardinalities well beyond 10⁹ with a relative accuracy (standard error) of 2% while only using 1.5kb of memory.

-HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm by Philippe Flajolet et al.

HLL is one of many cardinality estimation algorithms that are all invented to solve one problem: how to count the number of unique elements as fast and accurately as possible. As you might have guessed, cardinality means the number of unique elements in a dataset with duplicated entries.

These algorithms are extremely useful when you have a very large dataset, far too large to sort and count how many unique entries within seconds. Below are a few real-life use cases.

  • Detection of network attacks (e.g. Denial of Service) by counting the unique header pairs of data packets transmitted in various time slices
  • Estimate the count of unique records in Redis, AWS Redshift and Druid datasets
  • Count the number of unique users during a time period for real-time activity reports of mobile apps

HLL is the product of various enhancements of the Flajolet-Martin algorithm introduced by Philippe Flajolet and G. Nigel Martin in 1984. Since then, Google has adopted and improved on it to become HyperLogLog++ functions. Apart from Google, many other technology platforms have implemented their own data structures based on HLL. Nevertheless, they all share the same basic rule.

If all numbers in the set are represented in binary and let n be the maximum number of leading 0s observed in all numbers, 2^n is the estimated unique count of numbers.

In short, with HLL, we trade off some accuracy to get the job done faster. But how much faster and at the expense of what level of accuracy? Let’s run some experiments in Google BigQuery.

How does HLL compare with other methods?

Let’s start with a big dataset. I am using the Wikipedia dataset which is available via Google BigQuery Public dataset. To extract only 2020 data, I have created a permanent table with the below query.

# Create a table containing all hourly views per Wikipedia page in 2020 
CREATE OR REPLACE TABLE `bigquery-301504.wikipedia.view_2020` AS SELECT * FROM `bigquery-public-data.wikipedia.pageviews_2020` 
WHERE DATE(datehour) < "2021-01-01" AND 
DATE(datehour) > "2019-12-31";

This large table (2.24TB with more than 5 billion rows) contains the number of views for each Wikipedia page title every hourly in 2020. You can take a look at the schema below.

Photo by author
Photo by author

Let say we want to find out how many unique titles viewed every month in 2020, here is the simplest SQL query we can write.

Query

# Count unique titles viewed per month with DISTINCT method

SELECT EXTRACT(MONTH FROM datehour) AS month, 
COUNT(DISTINCT title) AS title_count 
FROM `bigquery-301504.wikipedia.view_2020` 
GROUP BY month 
ORDER BY month;

Result

Photo by author
Photo by author

When using DISTINCT function, Google BigQuery needs to keep track of every unique date it has ever seen. Hence it took more than 4 minutes to run this query.

Luckily we can go faster if we are willing to trade off some accuracy. There are 2 options to arrive at an approximate count.

Option #1 – APPROX_COUNT_DISTINCT

Query

# Count unique titles viewed per month with APPROX_COUNT_DISTINCT method

SELECT EXTRACT(MONTH FROM datehour) AS month, APPROX_COUNT_DISTINCT(title) AS title_count 
FROM `bigquery-301504.wikipedia.view_2020` 
GROUP BY month 
ORDER BY month;

Result

Photo by author
Photo by author

Way better, but can we push it a little harder? Yes, we can.

Option #2 – HLL

Let’s see how we can use HLL to solve the same challenge of counting unique titles viewed every month.

Query

# Count unique titles viewed per month with HLL method 
WITH hll_count AS (
SELECT EXTRACT(MONTH FROM datehour) AS month, 
HLL_COUNT.INIT(title) title_count 
FROM `bigquery-301504.wikipedia.view_2020` 
GROUP BY month) 
SELECT month, hll_count. MERGE (title_count) 
FROM hll_count 
GROUP BY month order by month;

Result

Photo by author
Photo by author

The moment we gave up even less than 1% accuracy, we speed up the query by 6 times. Of course, there will be scenarios whereby accuracy is required such as financial reporting or scientific research calculations. Otherwise, both APPROX_COUNT_DISTINCT and HLL will come in handy if performance is more important than absolute accuracy.

Although the queries for APPROX_COUNT_DISTINCT and HLL look different, it’s worth to note that both utilise the same HLL algorithm. Since the second method is more complicated, you might want to use APPROX_COUNT_DISTINCT to keep the code cleaner and more readable.

Parting Thoughts

If you are interested in learning more about the evolution of HLL, do check out the following papers (arranged in chronological order).

Probabilistic Counting Algorithms for Database Applications

LogLog Counting of Large Cardinalities

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm

That’s all I have for today. Thank you for reading and do let me know if you have any feedback. Take care and stay safe everyone!


Originally published at http://thedigitalskye.com on January 13, 2021.


Related Articles