Tuesday, March 12, 2013

Bloom Filters

The bloom filter is a data structure designed in 1970. It is similar to a hash table. They are more efficient than a hash table, but they do raise false positives at times.

A bloom filter supports super fast lookup and insertion, just like a hash table. So why have this data structure?

  1. Pros: It is much more space efficient. 
  2. Cons: You can't store an associated object. You are only storing a value to indicate whether or not an object is in the data structure. Deletions are also not allowed. There is also a small false positive probability. This means that even if you haven't inserted object x in the bloom filter, it will say x has been inserted. 
Applications
  1. Early spellcheckers - you add every possible word into the filter. Then you just check word by word if something is in the filter or not. If it is not in the filter, then that is an incorrect word. With small probability, this will say that an incorrectly spelled word is actually correct.
  2. List of forbidden passwords - This could be used to prevent the use of common or too-easily guessed passwords. 
  3. Network routers - There is limited memory and it needs a super-fast data structure. 
To implement this data structure, you use an array of n bits. We also need k has functions, somewhere around 2-5 (You could also make 2 hash functions and make k linear combinations of those 2 functions). Then, 
    for i=1,2,...,k 
        set A[hi(x)]=1
Then for lookup, you just return true if A[hi(x)]=1 for i=1,2,...,k. Here, you can't get a false negative since we never return an entry in the array back to 0. However, you can get a false positive if other insertions set all k hi(x) to 1.

For this data structure to be good, we need to see that the probability of a false positive is relatively low, while keeping the space needed small. 

No comments:

Post a Comment