The world’s leading publication for data science, AI, and ML professionals.

HyperLogLog implemented using SQL

We look at an implementation of the HyperLogLog cardinality estimation algorithm written entirely in declarative SQL

Photo by Vidar Smits on Unsplash
Photo by Vidar Smits on Unsplash

The HyperLogLog algorithm is an extremely popular algorithm used to estimate (approximate) the number of unique elements in a given dataset. There are some difference between this and exact unique counting in SQL.

COUNT(DISTINCT x)

  1. Performs an exact unique count.
  2. Uses O(unique elements) additional memory.

HyperLogLog

  1. Performs an approximate unique count.
  2. Uses O(1) additional memory. This is a significant reduction in memory requirements when you have millions or billions of unique values to track.

If you want to know the details about the LogLog and Hyperloglog algorithms, you can find them here.


Basic Intuition Behind The Working Of HyperLogLog

Q. How many coin flips would you need to get 3 heads in a row?

We expect to see the following when we flip a coin once.

H
T

We expect to see the following when we flip a coin twice.

HH
HT
TH
TT

We expect to see the following when we flip a coin thrice.

HHH
HHT
HTH
HTT
THH
THT
TTH
TTT

That’s roughly 8 (2³) coin flips (in expectation) before you see 3 heads in a row.

Q. How many coin flips would you need to get 4heads in a row?

Similarly, it would take roughly 16 (2⁴) coin flips (in expectation) before you see 4 heads in a row.

Input Preprocessing

Take every element of your input and hash it using a hashing algorithm that produces 64-bit hashes both randomly and uniformly. i.e. the chance of seeing a 0 (or 1) at any bit position is equal to 1/2.

Now, start from the least significant bit (BIT) of this bit-string, and count the number of consecutive 0 bits. This should remind you of the heads-in-a-row problem we discussed above. Using the same intuition, we expect to see a run of K 0s every time we see 2^K unique hashes. We assume that every element hashes to a unique hash, and that hashes have a random bit stream distribution.

First Version

An example bitstring (Image by author)
An example bitstring (Image by author)

We just track the maximum bit position (starting from position 1) for where a string of consecutive 0s ends. i.e. if we see some numbers that have 1 at bit positions 3, 2, 2, 2, 1, 4, then we just track 4.

The approximation of the number of unique values seen is 8.

Drawback

The main drawbacks of the above are:

  1. We can estimate the count only for powers of 2. i.e. once we see a string of zeroes we estimate the number to be 2^K. We can’t estimate any number between 2 consecutive powers of 2. This is a pretty wide range for larger powers of 2.
  2. We could get really unlucky and see a string of (say) 6 zeroes for just 1 unique element. This means that we would estimate having seen 2⁶ = 256 unique elements, when in fact we’ve seen just 1.

Second Version (LogLog estimator)

To address both the drawbacks mentioned above, we can make 2 changes.

  1. Maintain 2^B counters instead of just one counter. Each counter is updated by a different hash applied to the same input. If we assume a total of N elements in the input, this means that each counter roughly counts the number of unique elements in N/(2^B) elements of the input.
  2. When calculating 2^K as the number of unique elements, take the average across all the 2^B counters, and then do 2^avg(K). Multiply this result by 2^B (the number of buckets).

As an optimization, we can take the last B bits of the hash and use that to compute the bucket and then use the rest of the bits (shifted right B bits) in the way we used them above.

Splitting a bit string into a bucket and a random hash bit string (Image by author)
Splitting a bit string into a bucket and a random hash bit string (Image by author)
B buckets each containing a single LogLog estimator (Image by author)
B buckets each containing a single LogLog estimator (Image by author)

This is basically the LogLog estimator.

Trivia: This is called LogLog, since it requires log2(log2(n)) bits to store the number of bits needed to represent the number "n". For example, if n == 2^b, then "n" has "b" bits, and b == log2(n). We need log2(b) == log2(log2(n)) bits to store the value "b". "b" also represents the largest position that can have a 1 bit with all 0 bits after it (this is what we want to measure for each hash).

Third Version (HyperLogLog)

Instead of taking the arithmetic mean (average) of the values before raising to the power of 2, we take the harmonic mean of the values.

Since the harmonic mean of a list of numbers tends strongly toward the least elements of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large outliers and aggravate the impact of small ones.

That’s it!


SQL Input Schema

The input table is very simple. It’s basically a table with a single integer column (the input integer).

CREATE TABLE input_data(n INT NOT NULL);

HyperLogLog in SQL

INSERT INTO input_data

WITH RECURSIVE fill_n(ctr, n) AS (
  SELECT 4000, (RANDOM() * 2000)::INT

  UNION ALL

  SELECT ctr-1, (RANDOM() * 2000)::INT
  FROM fill_n
  WHERE ctr > 0
)

SELECT n FROM fill_n;

SELECT COUNT(DISTINCT n) AS num_unique_actual FROM input_data;

WITH hashed_list AS (
  SELECT
    ('x' || SUBSTR(md5(n::TEXT), 1, 16))::BIT(64)::BIGINT AS h
  FROM input_data
),

bucketed AS (
  SELECT
    -- Keep 64 buckets
    h & (63::BIGINT) AS bucket_id,
    (h >> 6) & 2147483647::BIGINT AS h_key
  FROM hashed_list
),

loglog_mapped AS (
  SELECT
    bucket_id,
    -- The following computes the position of the first 1 bit
    -- when looking at the number from right (least significant
    -- bit) to left.
    LOG(h_key & -h_key) / LOG(2.0) AS fsb,
    h_Key
  FROM bucketed
),

max_in_bucket AS (
  SELECT
    bucket_id,
    -- Retain the largest bit position in any bucket.
    MAX(fsb) AS fsb
  FROM loglog_mapped
  GROUP BY 1
),

harmonic_mean_mapped AS (
  SELECT
    AVG(fsb) AS avg_fsb,
    COUNT(1) / SUM(1.0 / (CASE WHEN fsb = 0 THEN 1 ELSE fsb END)) hm_fsb,
    -- Number of unique elements using the LogLog estimator.
    COUNT(1) * POW(2, AVG(fsb)) / 0.77351 AS num_unique_log_log,
    -- Number of unique elements using the HyperLogLog estimator.
    COUNT(1) * POW(
      2, COUNT(1) / SUM(1.0 / (CASE WHEN fsb = 0 THEN 1 ELSE fsb END))
    ) / 0.77351 AS num_unique_hll
  FROM max_in_bucket
  WHERE fsb > 0
)

SELECT * FROM harmonic_mean_mapped;

Here’s the result of running the query above.

Actual v/s estimated unique counts of numbers (Image by author)
Actual v/s estimated unique counts of numbers (Image by author)

Error Rate: The actual number of unique elements in the input table is 1729. The LogLog estimator is way off in this case with an estimate of 2795 unique elements. The HyperLogLog estimator has a 3.6% error with an estimate of 1792 unique elements.

We can get a much better estimate (smaller error) by using a larger number of buckets.

Buckets: The implementation we showed above uses 64 buckets. Since we can practically store any number between 0 and 63 in an 8-bit (single byte) number, we use 64 bytes for the implementation shown above.

According to the Wikipedia article above,

The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.

Space: The implementation above uses O(n) additional space due to the way the query has been written. An actual (non-toy) implementation would not store the computed hashes but would just update in-memory buckets.

Hash Computation: We use md5 in the query above, but you can use any hash that produces uniform random 64-bit numbers.

There are some details about the correction factor constant 0.77351 that I have omitted for brevity. You can read the paper for details about this constant.


SQL Fiddle

The SQL Fiddle for the queries above can be found here.


Conclusion

The HyperLogLog algorithm is surprisingly simple yet powerful at the same time. It stores a lot of information per bit of storage!

Previous article: Validate a String as HTML Using SQL


Related Articles