Big Data with Sketchy Structures, Part 2 — HyperLogLog and Bloom Filters

Published in

Towards Data Science

9 min readJul 17, 2018

This article is a direct follow-up to my earlier article, “Big Data with Sketchy Structures, Part 1”, in which I introduced the concept of space-optimal sketch data structures. Last time, we talked about Count-Min Sketch, a structure to count the number of times an element appears in a set. In this article, I introduce two other popular sketch structures.

How Many Users are Logged-In? An O(n)-space Solution

We’ve already established a space-efficient method to approximate video view counts. This is great — with this data, we can begin learning about what kind of content draws users to our platform and keeps them engaged.

Suppose your team wants to begin tracking other relevant user engagement metrics. An obvious one is the number of active users during any given unit of time. For example, you could track the number of monthly users, daily users, or even hourly users. Is your product more popular on weekdays, or on weekends? Do users prefer visiting in the daytime, or at night?

Let’s start by implementing a daily active user counter. We could use a hash set, adding a user to the set whenever he or she logs in. If a user logs in multiple times per day, or on different devices, or on different browsers, we won’t double-count them because their login id will map to the same hash value. At the end of every day, we can save the set’s size (also known as its cardinality) to some table of daily counts. We then reset the set, and begin counting for the next day.

If we get a billion users every day, and use a 32-bit integer to store user ids, that’s 4 GB just to store a set of active user ids. Even if you use a bitmap — capable of representing a user with a single bit — you would still need a gigabyte of memory to represent a billion users. These numbers are actually pretty good. However, there’s still room for improvement. Both hash sets and bitmaps have O(n) space-complexity, meaning they will consume space in direct proportion to the amount of data you track.

Although you’re content with tracking the number of active users today, tomorrow you may want to track the number of active users per country, or the number of active users of each age. The more metrics you decide to monitor, the more you need to invest in storage. O(n) data structures also make it harder to do real-time analytics on big data, since the size of data ingested may exceed the throughput of your analytics pipeline.

If you are willing to accept a less precise user count, you can use HyperLogLog, a sketch data structure, for much better space efficiency.

HyperLogLog — Approximately How Many Users are Logged-In?

HyperLogLog is a sketch data structure with sub-logarithmic O(log(log(n)) space complexity and constant O(1) time complexity. Like the Count-Min Sketch, HyperLogLog is concerned with multisets — sets where each unique element may occur more than once. Count-Min Sketches can tell you approximately how many times each element in a multiset occurs. HyperLogLog, on the other hand, can tell you how many unique elements a multiset has. This is known as the Cardinality-Estimation problem.

Let’s see how we might use HyperLogLog. Whenever a user logs into our app, the HyperLogLog algorithm hashes that user’s id into a bitstring. Assuming a uniform distribution, we can conclude that approximately half of the bitstrings begin with a 0, and the other half begin with a 1. Similarly, approximately one-fourth of the bitstrings begin with 00, and approximately one-eight begin with 000.

You can see the pattern. In general, approximately 1 out of every 2^z of uniformly-distributed bitstrings will begin with a sequence of z 0s.

Suppose that, for some given minute, we hash the ids of all active users, and we discover that the longest sequence of leading 0s is 0000. The probability of this happening on any random bitstring is 1/16. Another way of saying that is, we would need 16 random bitstrings on average to find one that begins with 0000. So we can guess that our app probably had about 16 visitors in that minute, give or take.

HyperLogLog is built on this idea, though the math is a bit more complicated. In the original paper, the bitstring is split into multiple segments, and the cardinality is computed as the harmonic mean of 2^z for each of those segments. Bias from hash collisions is handled by a multiplicative constant, and linear counting is used to give more precise counts for smaller sets. This method is a bit too complicated for a Medium article (and my brain), so I recommend reading Philippe Flajolet’s original 2007 paper if you really want to get into the guts of how it works.

Despite being a relatively new algorithm, HyperLogLog has already become quite successful in industry. Redis Labs used HyperLogLog to compress a 12 MB set to only 12 KB, while experiencing only 0.81% error. Google uses it to count their number of daily search queries. Several databases, such as Elasticsearch and AWS Redshift, use HyperLogLog under the hood. Like most sketch data structures, HyperLogLog is also fully parallelizable, making it efficient for high-throughput parallel applications. The main takeaway: HyperLogLog is an excellent and space-efficient approach to approximating the number of unique items in a set.

Recommendation System Pitfalls

At this point, your team has got some sweet user engagement analytics going on– thanks to the Count-Min Sketch, you know approximately how many times each video has been watched, and thanks to HyperLogLog, you know approximately how many active users logged on every day for the past month. You can use this data to make business decisions, like producing new content based on the most popular videos, or making projections for future growth.

You’ve probably considered creating a recommendation system. People love custom-tailored experiences based on their own personal preferences. So, you roll a recommendation system out, and at first, things are going great.

But after a while, you notice user retention starts dropping. What’s going on? You do some research, and discover that many of your users leave the site after consuming the recommendations on their homepage. It turns out that your recommendation engine keeps suggesting the same videos over and over, ignoring the fact that your users have already watched those videos. Your users are led to believe that they have exhausted your catalog of interesting content, so they begin to leave.

How do we fix this? We need to keep track of what videos each user has watched, and make sure that those videos are not recommended to that user again. This is an instance of the classic set membership problem. Given an item (a recommended video), we want to determine if that item belongs to a particular set (the user’s history).

We already discussed how hash sets and bitmaps have linear space complexity. If a billion users each watched a hundred videos, and each of those videos was represented by a single bit on a bitmap, we would require a hundred billion bits, or 100 GB. Let’s try to do better.

Bloom Filters — approximately which videos has a user already seen?

The Bloom Filter is a data structure that solves the set membership problem in O(1) space and time. The Bloom Filter’s internal structure is a m-length bitstring. Initially, all of the bits are set to 0. Like the Count-Min Sketch, the Bloom Filter uses k distinct hash functions, each of which returns a bit position between 0 and m-1. Bloom Filters support two operations — put(x), which represents adding an element x to the set, and get(x), which tells us whether x is a member of the set or not.

When put(x) is called, each of the k hash functions independently hashes x to a bit position, and the k selected bits are then flipped to 1. Similarly, get(x) calls the same hash functions and checks the resulting k bits. If each bit is 1, get(x) tells us that x is probably in the set. If at least one bit is 0, get(x) tells us that x cannot be in the set, because set(x) would have set all of x’s bits to 1.

You can see the initialize(), put(), and get() functions in action below.

Let’s see how this can be applied to our recommendation system. Suppose that our recommender system thinks that some user would really like to watch “Hello”, “History of Japan”, and “What are Blockchain Smart Contracts?”. However, this user has already watched “History of Japan”, so we really don’t want to recommend it.

Behind the scenes, each of the three titles is hashed, and those hashes are checked against the user’s watch history, represented as a Bloom Filter set. We see that “History of Japan” is included in the set, so we filter it out. In the end, only “Hello” and “What are Blockchain Smart Contracts?” are sent to the user’s homepage. Your user sees fresh content, your advertisers get more eyeballs, your manager gets more money, you get a positive employee review, and everyone is happy. Yay, Bloom Filters!

Bias in Bloom Filters

Like all sketch data structures, Bloom Filters trade accuracy for efficiency. The type of error returned by Bloom Filters is known as a False Positive. This happens when the set membership query, get(x), incorrectly returns “true”, even when the put(x) operation had never been called.

How could this happen? Let’s look at the example below. The user only watched “Japan” and “Blockchain”, but the hash values for this particular combination of videos happens to overlap with the hash values of “Hello”. Consequently, “Hello” will be incorrectly omitted from that user’s recommendation list. Sounds bad, right?

Turns out it’s not so bad after all. So long as your program’s success is unaffected by the presence of false positives, you can safely take advantage of space-efficient Bloom Filters. In our recommendation system example, we’re not going to lose a user just because a particular video didn’t appear in his or her recommendation feed. It’s not the end of the world if a few users miss out on Adele’s latest tunes. As long as you supply a constant stream of fresh content that appeals to your users’ interests, you’ll be able to maintain steady engagement metrics, and that’s what really matters to your business.

The false positives in Bloom Filters are actually quite similar to the overcounting errors in Count-Min Sketches. In fact, both of these are biased errors, and they both result from the same phenomenon — hash bit collisions. To minimize the false positive rate, you need to minimize the probability of hash collision. The solution is simple — use a larger bitstring.

Bloom Filters are perhaps the most popular sketch data structure today. They are used in malicious URL detection, web page caching, database lookups, and even spell checkers. Bloom Filters are also an active area of research, with many variants like Counting and Sliding Bloom Filters being published in recent years.

The Future is Probabilistic

In these two articles, we looked at three probabilistic data structures — Count-Min Sketch, HyperLogLog, and Bloom Filters — that are being used to tackle today’s big data problems. Probabilistic approaches to problem solving have already been embraced in the artificial intelligence community, and now, they’re becoming more popular among software engineers as well. Software is all about tradeoffs — time versus space, reads versus writes, clarity versus brevity, consistency versus availability — and now, precision versus efficiency.