Counting the number of distinct elements in a data set seems like a trivial problem. You iterate over the data set adding an element to a list if it’s not in the list and doing nothing if it is. Typically a hashset is used instead of a list due to its
O(1) contains versus
O(n) for a list.
This approach, however, doesn’t scale well.
Because every unique element must be kept in memory, data sets with large numbers of unique elements occupy large amounts of memory. It’s also not an additive count. For example: If a website has 3 unique visitors on Monday and 3 unique visitors on Tuesday all you know about the total number of unique visitors between the two days is that it’s somewhere between 3 and 6.
To accurately get the number of unique visitors you must re-scan the base data. In fact every time you want a unique count you must re-scan the base data. This scales poorly when your data is large and you want to count the uniques in many different ways (e.g. the total number of site visitors vs the number of visitors to a specific page). At Conversant we see 120 billion bid requests a day and analytics has dozens of unique counts based on this data. For many of the counts an accurate estimation is more than sufficient. This is a strong use case for HyperLogLog.
HyperLogLog was first described in the paper HyperLogLog: the analysis of near-optimal cardinality estimation algorithm, published by Flajolet, Fusy, Gandouet and Meunier in 2007. HyperLogLog is both a data structure and an algorithm to evaluate that data structure (I will often refer to the data structure as an estimator or counter). The estimators are constructed from base data and represent a set that has two operations: union and estimate cardinality.
The intuition behind the HyperLogLog algorithm is that if you track the rarest event you have seen, the rarer that event is, the more likely it is that you have seen many events.
For example, consider a person repeatedly flipping a coin and keeping track of the longest sequence of consecutive heads. If that person says that they have flipped 2 heads in a row, you would assume they haven’t been flipping the coin for very long. However, if they have flipped 100 heads in a row its highly likely they have been flipping that coin for an incredibly long time! (If they were flipping one per second around 4000 years!)
Instead of flipping coins HyperLogLog uses a hash function to generate a uniformly distributed sequence of 0’s and 1’s for each element added and tracks the maximum number of leading zeroes, much like keeping track of the longest sequence of consecutive leading heads.
In the example above the element “Steve” is added to an empty HyperLogLog counter.
- “Steve” is hashed into a sequence of 1’s and 0’s
- The number of leading zeroes is found (4)
- The value is stored ( max(4,0) = 4 )
Hash functions have a very important property; If the same key is entered multiple times the hash function always produces the same sequence. Thus only unique values are seen as separate events.
In this example “Steve” is added twice but since “Steve” always hashes to the same value the counter remains unchanged.
Counters can also be easily merged by storing the max of their respective values. In fact the max operation is both associate and commutative. Which is important because it means counters can be constructed in any order and with arbitrary grouping patterns which is key for parallelism.
Evaluating this counter, i.e. turning it into a estimate of uniques, is simple
2 ^ value.
This alone, though, results in rather poor accuracy.
Testing the algorithm
To measure the accuracy, a simple Java prototype was developed. Random keys were fed into the program and it recorded the number of keys processed until encountering one who’s hash had 4 leading 0’s. This was repeated 500 times. It took an average of 15 keys before one with 4 leading 0’s was seen. Which is close to what we expect mathematically since the probability of this event is
(1/2) ^ 4 = 1/16, however, a standard deviation of 14 was observed. This indicated that the estimate can vary up to 100% of the value ~68% percent of the time.
The weakness of this approach is that a single outlier can heavily skew the estimate. Re-using the original example where a person was flipping a coin repeatedly an extra heads out of the norm skews the estimate by 100%. Improving on this requires a second bit of intuition. If you had 10 people flipping coins and tracking their longest observed sequence of heads and only one of them reported seeing a sequence 100 long while the others reported much lower maximums the outlier can be discredited.
HyperLogLog leverages this by using the first bits of the hash assign to select a bucket and the remaining for the number of leading zeroes. By storing many max’s their values can be averaged out to reduce the impact of outliers.
The example above shows how a value is added to a 2 bucket counter.
- “Steve” is hashed into a sequence of 1’s and 0’s
- The first bit is 0 selecting the first bucket
- The remaining bits are evaluated for the number of leading zeroes (3)
- The value is stored in the bucket ( max(3,0) = 3 )
The details of how to evaluate this are a bit complicated and better explained in Flajolet’s paper but the conclusion is that the accuracy of the estimate is
1.04/sqrt(# of max's stored) and that this estimate maintains this accuracy for
2^2^(max bit length of max's) unique elements. In practice this means that an estimator that has 16,384 6-bit buckets (a bucket holds one max) has an accuracy of 0.8125% for 2^64 unique elements while only using 12KB.
HyperLogLog on GreenPlum
To make the use of this data structure and algorithm we developed a datatype and corresponding SQL functions for the GreenPlum/PostgreSQL database to construct, manipulate, and evaluate HyperLogLog estimators.
The following code creates an estimator from the integers 1 through 100.
This estimator itself is a custom type named
hyperloglog_estimator (whose contents are printed out in base64) that can be persisted, unioned with other estimators, or evaluated to see the estimated cardinality of the base data.
A key thing to note is that an estimator generated entirely from base data is identical to an estimator generated from unions of estimators created from portions of the base data. For example the following two estimators are identical:
In part 2 we’ll discuss the design specifics of how this was implemented along with the source code so you can try it out in your own environment.