Source

Bloom filter


p. “a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.”

p. “a query returns either “possibly in set” or “definitely not in set”.”

p. “an impractically large amount of memory if “conventional” error-free hashing techniques were applied.”

p. “fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.”

Algorithm description

p. “m bits”

p. “k different hash functions”

p. “k is a small constant which depends on the desired false error rate ε, while m is proportional to k and the number of elements to be added.”

p. “To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.”

p. “To test whether an element is in the set, feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set”

p. “If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive.”

p. “For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple “different” hash functions by slicing its output into multiple bit fields.”

p. “one can pass k different initial values (such as 0, 1, …, k − 1) to a hash function that takes an initial value; or add (or append) these values to the key.”

p. “Removing an element from this simple Bloom filter is impossible”

Space and time advantages

p. “a substantial space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries.”

p. “A Bloom filter with a 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.”

p. “Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set.”

p. “In a hardware implementation, however, the Bloom filter shines because its k lookups are independent and can be parallelized.”