Sunday, April 21, 2013

Pathological Data Sets and Universal Hashing Motivation

The load of a hash table is the total number of elements in the hash table divided by total amount of space in the table. This is usually denoted as alpha. Alpha=O(1) is necessary for operations to run in constant time. With open addressing we need alpha << 1. For good performance, we need to control load.

Secondly, for good performance, we need a good hash function, i.e. spreads the data evenly across buckets. However, a hash table that spreads out every data set does not exist. That is, for every hash function, there is a pathological data set.

Pathological Data in the Real World
Reference: Crosby and Wallach, USENIX 2003.
You can paralyze several real-world systems (like network intrusion detection) by exploiting badly designed hash functions. Two characteristics of these breakable systems were:
  1. Open source
  2. Overly simplistic hash function designed for speed (easy to reverse engineer a pathological data set)
So what should we do to solve this problem? 

No comments:

Post a Comment