Smash Company Splash Image

September 10th, 2012

No Comments

If you enjoy this article, see the other most popular articles

If you enjoy this article, see the other most popular articles

If you enjoy this article, see the other most popular articles

Hash data to find duplicates

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: lawrence@krubner.com, or follow me on Twitter.

This is a very cool algorithm:

Cardinality estimation algorithms like the ones we’ve just discussed make it possible to get a very good estimate – within a few percent – of the total number of unique values in a dataset, typically using less than a kilobyte of state. We can do this regardless of the nature of the data, and the work can be distributed over multiple machines with minimum coordination overhead and data transfer. The resulting estimates can be useful for a range of things, such as traffic monitoring (how many unique IPs is a host contacting?) and database query optimization (should we sort and merge, or construct a hashtable of unique values?).

…The first set of refinements comes from the paper Probabilistic Counting Algorithms for Data Base Applications by Flajolet and Martin, with further refinements in the papers LogLog counting of large cardinalities by Durand-Flajolet, and HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm by Flajolet et al. It’s interesting to watch the development and improvement of the ideas from paper to paper, but I’m going to take a slightly different approach and demonstrate how to build and improve a solution from the ground up, omitting some of the algorithm from the original paper. Interested readers are advised to read through all three; they contain a lot of mathematical insights I won’t go into in detail here.
First, Flajolet and Martin observe that given a good hash function, we can take any arbitrary set of data and turn it into one of the sort we need, with evenly distributed, (pseudo-)random values. With this simple insight, we can apply our earlier procedure to whatever data we want, but they’re far from done.
Next, they observe that there are other patterns we can use to estimate the number of unique values, and some of them perform better than recording the minimum value of the hashed elements. The metric Flajolet and Martin pick is counting the number of 0 bits at the beginning of the hashed values. It’s easy to see that in random data, a sequence of k zero bits will occur once in every 2k elements, on average; all we need to do is look for these sequences and record the length of the longest sequence to estimate the total number of unique elements. This still isn’t a great estimator, though – at best it can give us a power of two estimate of the number of elements, and much like the min-value based estimate, it’s going to have a huge variance. On the plus side, our estimate is very small: to record sequences of leading 0s of up to 32 bits, we only need a 5 bit number.
As a side note, the original Flajolet-Martin paper deviates here and uses a bitmap-based procedure to get a more accurate estimate from a single value. I won’t go into this in detail, since it’s soon obsoleted by improvements in subsequent papers; interested readers can read the original paper for more details.
So we now have a rather poor estimate of the number of values in the dataset based on bit patterns. How can we improve on it? One straightforward idea is to use multiple independent hash functions. If each hash produces its own set of random outputs, we can record the longest observed sequence of leading 0s from each; at the end we can average our values for a more accurate estimate.
This actually gives us a pretty good result statistically speaking, but hashing is expensive. A better approach is one known as stochastic averaging. Instead of using multiple hash functions, we use just a single hash function, but use part of its output to split values into one of many buckets. Supposing we want 1024 values, we can take the first 10 bits of the hash function as a bucket number, and use the remainder of the hash to count leading 0s. This loses us nothing in terms of accuracy, but saves us a lot of redundant computation of hashes.

Post external references

1
http://blog.notdot.net/2012/09/Dam-Cool-Algorithms-Cardinality-Estimation

Source

Check out my books:

RECENT COMMENTS

February 8, 2022 9:33 am

From Michael S on How I recovered from Lyme Disease: I fasted for two weeks, no food, just water

"Did you have Bartonella, too? Seems it uses autogenesis..."

January 11, 2022 4:33 am

From Essie on Docker is the dangerous gamble which we will regret

"Once in 1990s, there are popular high performance solution called HPC software, many commercial softwares are ..."

December 17, 2021 7:32 pm

From John Carston on The ethics of being a high level tech consultant (a Fractional CTO)

"It helped when you mentioned that it is important to have a real connection with your consumer. My cousin ment..."

September 2, 2021 7:47 pm

From Mojavedfo on Where PHP regex fails

"55 thousand Greek, 30 thousand Armenian..."

August 7, 2021 9:53 am

From Colin Steele on The ethics of being a high level tech consultant (a Fractional CTO)

"Fantastic essay. Thoughtful, well-constructed, timely and applicable. I think every part-timer in the tech f..."

August 5, 2021 3:02 pm

From Rachiovwn on Where PHP regex fails

"consists of the book itself..."

October 19, 2019 3:08 am

From Bernd Schatz on Object Oriented Programming is an expensive disaster which must end

"I really enjoyed your article. But i can't understand the example with the interface. The example is reall..."

October 17, 2019 4:50 pm

From Anderson Nascimento Nunes on The conventional wisdom among social media companies is that you can’t put too much of the onus on users to personalize their own feeds

"Can't speak for anyone else, but on my feed reader: 5K bookmarked feeds, 50K regex on the killfile to filter o..."

October 10, 2019 11:17 am

From روابط: البث المباشر – صفحات صغيرة on RSS has been damaged by in-fighting among those who advocate for it

"[...] تاريخ تقنية RSS، مقال قديم ويلقي نظرة على الناس الذين طوروا التقنية [...]..."

October 9, 2019 3:08 pm

From Dan Campbell on Object Oriented Programming is an expensive disaster which must end

"Object-Oriented Programming is Bad https://www.youtube.com/watch?v=QM1iUe6IofM..."

October 4, 2019 8:44 pm

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, I am working to clean up some of my Packer/Terraform code so I can release it on Github, and then..."

October 4, 2019 5:14 pm

From Gorgi Kosev on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"> Packer, sometimes with some Ansible. The combination of Packer and Terraform typically gives me what I ne..."

October 4, 2019 12:40 pm

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, about this: "I would love if you could point out which VM based system makes it simpler and..."

October 4, 2019 7:31 am

From Gorgi Kosev on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"I won't list anything concrete that you missed, because that will just give you ammunition to build the next a..."

October 4, 2019 1:39 am

From lawrence on My final post regarding the flaws of Docker / Kubernetes and their eco-system

"Gorgi Kosev, also, I don't think you understand what a "straw man argument" is. This is a definition from Wiki..."

NO COMMENTS