Chogarithms: March 2013

Tuesday, March 12, 2013

Bloom Filters

The bloom filter is a data structure designed in 1970. It is similar to a hash table. They are more efficient than a hash table, but they do raise false positives at times.

A bloom filter supports super fast lookup and insertion, just like a hash table. So why have this data structure?

Pros: It is much more space efficient.
Cons: You can't store an associated object. You are only storing a value to indicate whether or not an object is in the data structure. Deletions are also not allowed. There is also a small false positive probability. This means that even if you haven't inserted object x in the bloom filter, it will say x has been inserted.

Applications

Early spellcheckers - you add every possible word into the filter. Then you just check word by word if something is in the filter or not. If it is not in the filter, then that is an incorrect word. With small probability, this will say that an incorrectly spelled word is actually correct.
List of forbidden passwords - This could be used to prevent the use of common or too-easily guessed passwords.
Network routers - There is limited memory and it needs a super-fast data structure.

To implement this data structure, you use an array of n bits. We also need k has functions, somewhere around 2-5 (You could also make 2 hash functions and make k linear combinations of those 2 functions). Then,

for i=1,2,...,k

set A[h_i(x)]=1
Then for lookup, you just return true if A[h_i(x)]=1 for i=1,2,...,k. Here, you can't get a false negative since we never return an entry in the array back to 0. However, you can get a false positive if other insertions set all k h_i(x) to 1.

For this data structure to be good, we need to see that the probability of a false positive is relatively low, while keeping the space needed small.

Rotations

Rotations are common to all balanced search trees. The idea is to locally rebalance subtrees at a node in O(1) time without violating the search tree property. When you invoke a rotation, it is always on a parent-child node pair. If it is a right child, you need a left rotation and vice-versa for a left child.

Here is a diagram.

We want to rotate the pair (x,y) so that Y is the new child of P. Since X is less than Y, X must be Y's left child. Since C is greater than Y, it must be Y's right child. Since A is greater than X, it must be X's left child. Now we know X<B<Y so B will be X's right child. This also satisfies that B is on Y's left subtree.

A right rotation is very similar to this.

Red Black Trees

There are lots of different trees for the same set of keys. The height of the tree could range anywhere between log₂n and n. Most of the operations on the search tree are governed by the height, we want to minimum the height. This is where balanced binary trees come in.

We want to ensure that the height is always O(logn) even after insertion and deletion. Some examples are red-black trees, AVL trees, and splay trees. There are also B trees that has multiple keys per node.

A red-black tree is the same as a binary search tree but it has a few invariants.

Each node is either red or black.
The root is black.
No 2 reds in a row - if you have a red node, its children must be black. This implies if a child is red, its parent must be black.
Every root-null path has the same number

Every red-black tree with n nodes has a height less than or equal to 2log₂(n+1).

Sunday, March 10, 2013

Binary Search Tree

The search tree property is as follows: Every thing to the left of the node is less than it and everything to the right is greater. There are many possible trees for the same set of keys. The height can range anywhere from log₂n to n. Therefore, the worst possible running time for searching a key or inserting a key is O(height).

Getting Minimum
You can just traverse left down the tree since everything left is less than what you are at now.

Getting Maximum
This is the same concept as minimum but you move to the right.

Finding Predecessor
Easy Case - If k's left subtree is not empty, just return the max key in its left subtree.
Harder Case - Go up parent pointers until you reach one that is less than you.

In-Order Traversal
Let R be the root node of the search tree, with substrees T_l and T_r. Recurse on T_l. Next print out R's key. Lastly, recurse on T_r.

Deletion
First we need to search for the node. Then we have three cases:
(1) k has no children (easy) - Then we can just delete the node without any consequences
(2) k has one child (medium) - The unique child assumes the place of the parent node
(3) k has two children (hard) - We can sneakily compute the predecessor. That is, traverse k's non-null left child and then follow right child pointers until no longer possible to get l. Swap k and l. We know k now must have no right child since we followed them all. Then we know how to delete it if it has 0 or 1 child.

Note:
We can store the size of the tree at each node. If x has children y, x, then size(x) = size(y) + size(z) + 1. We also have to maintain this variable each time we do insertion or deletion as well.

Select
The goal is to select the ith order statistic. Start with root x with children y and z.
(1) If i==size(y)+1, then return x's key.
(2) If i<=size(y), then recurse on y.
(3) if i-1 > size(y), then recursively compute (i-size(y)-1)th order statistic on search tree rooted at z.
The running time for this is still O(height).

Saturday, March 9, 2013

Balanced Binary Search Trees

We can think about a balanced binary search tree as a dynamic sorted array. If we have a sorted array, what kinds of operations can we perform on it?

Binary search in O(logn) time.
Select element given ith order statistic in O(1)
Computing min/max of array - O(1)
Predecessor/Successor - O(1), just find that element and return one before/after it
Rank - how many keys stored are less than or equal to the key - O(logn)
Output in sorted order - O(n)

Simply using a sorted array would be unacceptable for insertion/deletion because it could use O(n) time. This is where the binary search tree comes in! A BST is like a sorted array, but with logarithmic insertion and deletion. Now, the operations:

Binary search - O(log n)
Select - O(log n)
Min/max - O(log n)
Pred/succ - O(log n)
Rank - O(log n)
Output in sorted order - O(n)
Insert/Deletion - O(log n)

If you only need min/max, you should use a heap instead. If you only need fast insertion and lookup, use a hash table instead.

Thursday, March 7, 2013

Hash Tables cont.

So how is a hash table implemented? We could use an array for quick insert, deletion, and lookup, but it will take up a lot of space. On the other hand, we could use a linked list. This would only take up as much as space as there are objects, but the 3 operations would not be O(1). To resolve this, we can use both.

The first way to implement a hash table could be to use an array of linked lists. We use a hash function to determine which index the object goes into. If there is already something in that index, this is called a "collision." To resolve this, we just make a chain at that index of the array.

The "Birthday Paradox" states that given 23 people, there is a 50% chance that two people will share the same birthday. By this principle, we can see that it doesn't take relatively many objects to cause a collision.

The second way to implement a hash table would be to use open addressing. When there is a collision, simply put it elsewhere. This could be the next available index, which is called linear probing. Or, you could use another hash table to determine the offset. Implementing deletion in this method is much trickier.

Wednesday, March 6, 2013

Hash Tables

A hash table is primarily used to keep track of an evolving set of stuff by using a key. It is good for inserting, deleting, and lookups. It is also sometimes referred to as a "dictionary."A hash table should not be used if you need to keep things in some particular order.

The operations that it supports are all blazingly fast, O(1) running time. However, this is only if it is implemented properly.

Applications
De-duplication: You are given a stream of object, such as a linear scan through a large file or objects arriving in real time. The goal is to remove duplicates (only keep track of unique objects). This could be used to report only unique visitors to a website and avoid duplicates in search results.

When a new object x arrives, look it up in hash table H. If it is not found, then insert x into H.

The 2-SUM Problem: You are given an array of n integers and a target sum T. The goal is to determine whether or not there are two numbers x,y in A with x+y=T.

The naive way to do this would be to check all combinations (n choose 2). This exhaustive search is O(n²). A better way would be to sort the array. Then for each x in array A, use binary search to look for T-x. This will take O(nlogn). The best way would be to insert every element into a hash table. Then for every x, we can just look up its complement, T-x, which is O(1). This will be O(n).

Hash tables was first designed for and used to symbol table lookups for a compiler. It can also be used to block network traffic. Another application would be for search algorithms, such as for game tree explorations. (There are more chess configurations than size of web). You can add a configuration to a hash table to avoid exploring it more than once.

Tuesday, March 5, 2013

Implementing a Heap

We can think of a heap as a tree. It is rooted, binary, and as complete as possible. This means that the leaves on the lower most level is to the left of the leaves above it.

The heap property is:
(1) At every node x, key[x] <= all keys of x's children.
Consequently, the root node must have the minimum key value.

Heaps

A heap is a data structure that contains objects which have comparable keys. Some uses for a heap include: keeping track of employee records where the social security number is the key, network edges, events where there is a priority, etc.

There are two basic operations for a heap.
(1) Insert - add a new object to the heap
(2) Extract-min - remove an object in the heap with a minimum key value (with ties broken arbitrarily). You could always makes this an extract-max operation or negate every key and you will get the maximum when you extract-min.

The running time for both operations is O(logn) where n is the number of objects in the heap.

Another operation is heapify, which inserts a batch of objects into the heap. This seems like it should run in O(nlogn) but it can actually be done in O(n). Deletion can also be done in O(logn) time.

Applications
(1) One application is for heap sort. In selection sort, you compute the minimum multiple times, resulting in O(n²) running time. However, you can always heapify the set and then extract-min n times to get the elements in order. The running time for this algorithm is O(nlogn).

(2) A heap can also be called a priority queue. It can be used to schedule events and to check which event is going to occur next, using the timestamp as key. This makes it much faster than keeping an unordered list and finding the max each time.

(3) Thirdly we can use a heap for "median maintenance." You can given a sequence x₁, ... x_n and at each time step i, you need to give the median of {x₁, ..., x_i}. Use O(log i) time at each step i. We can solve this by maintaining two heaps, H_low and H_high to keep track of the smallest half of the elements and the largest half of the elements respectively. H_low supports extract-max while H_high supports extract-min. The key idea is to maintain the invariant that about i/2 smallest (largest) elements in H_low (H_high).

When you add a new element, if it is smallest than max element in H_low then add it in there. Otherwise, add to H_high. What if there is an imbalance? That is, if H_low has two more elements than H_high? Then you simply extra the max of H_low and add to the other heap.

So how do we compute the median? It is the max of the H_low or the min of H_high. If i is even, then they are both medians. Otherwise, the median is just from the heap that has one more element than the other.

Monday, March 4, 2013

Strongly Connected Graph

A strongly connected graph is a graph such that there is a path connecting each vertex in the graph to another vertex in the graph. The strongly connected components of a directed graph G are the equivalence classes of the relation: u~v <-> path u->v AND path v->u. The ~ relation is an equivalence class, that is, ~ is reflexive, symmetric, and transitive.

Computing the strongly computed components allows you to use the for-free primitives (BFS, DFS, etc) on the components in linear time, which is blazingly fast.

Calling DFS on the right nodes will compute all the SCCs, but on the wrong nodes will just give back the union of multiple SCC's or even the whole graph.

With Kosaraju's Two-Pass Algorithm, we can compute the SCC's in linear time, with just two passes of DFS. Make a new graph with all the edges of original graph reversed. The naive way of implementing this would to run through the graph linearly and construct a new one. An optimization for this would be to just run DFS on the original graph, but going across arcs backwards.

Secondly, run DFS-loop on G^rev. Thirdly, run DFS-loop on G.The first DFS-loop computes the "magic ordering" of the nodes while the second DFS-loop discovers the SCC's one by one.

DFS-loop(graph G)
global variable t = 0 (for finishing times in first pass, number of nodes finished so far)
global variable s = null (current source vertex, for leaders in second pass)
Assume nodes labeled 1 to n
for i:=n down to 1
if i not yet explored
s := i
DFS(G, i)

DFS(graph G, node i)
mark i as explored
set leader(i) := node s
for each arc(i,j) in G
if j not yet explored
DFS(G, j)
t++
set f(i) := t [i's finishing time]

Dijkstra's Shortest Path Algorithm

You are given as input a directed graph G=(V,E) where m=|E| and n=|V|. Each edge also has a nonnegative length. For each v in V, compute L(V) where L is the length of the shortest path from the source node s to v.

One might as why use Dijkstra's Algorithm if you can already find the shortest path via Breadth First Search. However, BFS actually only works when all the edge length are one. It only tells you how many edges it takes to get from s to v.

But then you can replace each edge e with with a directed path of length(e) edges. This will make the graph way too big. It is too wasteful and it won't be in linear time anymore.

Thirdly, you could always just add a large constant to each edge length so that there are no negative edge lengths. This seems plausible but then it will skew the shortest path results. If we add 5 to every edge length, a path with 10 edges will be increased by 50 while a path with only 5 edges would only be increased by 25.

Initialize:
X = {S} (vertices processed so far)
A[S] = 0 (computed shortest path distances)
B[S] = empty path [completed shortest paths]
Main Loop:
while (X != V) [when not all vertices have been processed]
- among all edges (v, w) in E, with v in X and w not in X, pick one that minimizes A[v] + l(vw) [Dijkstra's greedy criterion]
- call this edge (v*, w*)
- add w* to X
- set A[w*] := A[v*] + l(v*w*)
- set B[w*] := B[v*] U (v*, w*)