Bloom filter
p. “a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.”
p. “a query returns either “possibly in set” or “definitely not in set”.”
p. “an impractically large amount of memory if “conventional” error-free hashing techniques were applied.”
p. “fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.”
Algorithm description
p. “m bits”
p. “k different hash functions”
p. “k is a small constant which depends on the desired false error rate ε, while m is proportional to k and the number of elements to be added.”
Ref How much is
M
proportional toK
?
- In general,
K
=~ (M/N) * ln2 (=0.693)N
is the number of elements to be inserted (stored)
p. “To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.”
Element > Hash Function > Array Position
- What kind of hash functions are used? How is the position from the array is decided?
- Any hash functions can be used because it always results in binary data, which can be interpreted as 32 / 64 bits Integer.
p. “To test whether an element is in the set, feed it to each of the k hash functions to get k array positions. If any of the bits at these positions is 0, the element is definitely not in the set”
p. “If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive.”
p. “For a good hash function with a wide output, there should be little if any correlation between different bit-fields of such a hash, so this type of hash can be used to generate multiple “different” hash functions by slicing its output into multiple bit fields.”
Good Hash Function's 64bits Hash Result: 0x123456789ABCDEF0
- 1st Hash Function: 16bits (0x1234)
- 2nd Hash Function 16bits (0x5678)
- 3rd Hash Function: 16bits (0x9ABC)
- 4th Hash Function: 16bits (0xDEF0)
p. “one can pass k different initial values (such as 0, 1, …, k − 1) to a hash function that takes an initial value; or add (or append) these values to the key.”
Use a same hash function, but with different values as seeds.
p. “Removing an element from this simple Bloom filter is impossible”
Because...
- There’s no way to tell which of the k bits should be cleared.
- 비트 하나만 지워도 값을 지우는 효과를 낼 수 있지만, 다른 값을 지울 수 있다는 부작용도 함께 존재.
Space and time advantages
p. “a substantial space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries.”
p. “A Bloom filter with a 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element, regardless of the size of the elements.”
1% Error ⇒ 9.6bits / Elements
p. “Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set.”
This can be boosted with aid of hardware support, since applying each hash functions can be done in parallel
p. “In a hardware implementation, however, the Bloom filter shines because its k lookups are independent and can be parallelized.”