Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2
Universal Hashing Starting point: for every hash function, there is a “really bad” input. A possible solution: choose the hash function randomly from a family of hash functions. The logic behind it: For any given input, we want that “most” of the hash functions in our family will handle it with few collisions. Our family of hash function Specific hash function h 10,5 () h 2,13 () h 24,82 () h 68,53 ()
Demonstration Let us conduct an experiment: –A family of about 10,000 hash functions (the family you saw in class, details later on). –One fixed input (50 keys), inserted to a table of size 70. (Student grades of exercises - dast 2003) –Question: how many will behave really bad? Next slide shows the results - the x-axis describes the number of collisions (in this case it was also equal to the number of pairs that collide), and the y-axis describes how many functions had such a number of collisions.
Results
Most functions perform close to average performance The average number of collisions is 8-9 [ all entries of the hash table always contained at most two elements, so the number of collisions is actually the number of entries with more than one element]. Very very few functions had more than twice this number of collisions (or less than half). This is no accident! –We constructed the family of functions so that the average performance of all the functions over any input will be good. –Probability laws (e.g the Markov inequality you saw in class) tell us that very few elements of a universe will behave much worse (or much better) than the average behavior.
A good family of hash functions Conclusion: Designing a family with good average performance is enough. We need to know two things: –A criteria that guarantees good average performance. –How to construct a family that will have this criteria.
Ensuring good average performance Definition: A family of hash functions H is universal if for any two keys k 1 and k 2, and any two slots in the table y 1 and y 2, the probability that h(k 1 )= y 1 and h(k 2 )= y 2 is at most 1/m 2 (m is the size of the hash table). Remark: This means that the chance that two keys will fall to the same slot is 1/m - just like if the hash function was truly random! Claim: When using a universal hash family H, the average time of any hash operation is at most n/m + 1 (n is the number of elements we insert to the table).
Is this better than a balanced tree? If we have an estimation of n, the number of elements we will insert to the table, we will have constant time performance - no matter how many elements we have: 10 6, 10 8, 10 10, or more... In contrary, the performance of a balanced tree, O(log n), is affected by the number of elements we have! As we have more elements, we have slower operations. For very large numbers, like 10 10, this makes a difference.
Constructing a universal family Choose p - a prime larger than all keys. For any a,b in Z p ={0,...,p-1} denote fix a hash function: h a,b (k) = ((a * k + b) mod p) mod m The universal family: H p,m = { h a,b () | a,b in Z p } Theorem: H p,m is a universal family of hash functions. In our demonstration, the set of keys was all possible grades. We chose p=101, inserted 50 (real) grades into a hash table of size 70 (doing this for all the hash functions in H 101,70 and counting collisions ).
A second approach - average over inputs In Universal Hashing - no assumptions about the input (I.e. for any input, most hash functions will handle it well). For example, we don’t know a-priori how the grades are distributed. (surly they are not uniform over 0-100). If we know that the input has a specific distribution, we might want to use this. For example, if the input is uniformly distributed, then the simple division method will obtain simple uniform hashing. In general, we don’t know the input’s distribution, and so Universal Hashing is superior!
T2 q.1 Reminder - quicksort: quicksort(A[1..n]) 1. choose a pivot p from A. 2. re-arrange A s.t. all elements smaller than p will be located before it in A, and all larger elements will be after it. 3. Suppose now p is in slot k. 4. Recursively sort A[1..k-1] and A[k+1..n]. The connection to the previous discussion: If we choose the pivot randomly, we actually have a family of algorithms, from which we choose one. The average performance is good, and so, for any input, most algorithms will perform well!
T2 q.1 - continued Question: How many calls to the random number generator will we have in the worst case, and in the best case? Answer: The number of these calls will always be ! Proof: Let us draw the recursion tree : An internal node represents a call to quicksort with an array of size at least 2. A leaf represents a call to quicksort with an array of size 1. Any internal node is also a father of a leaf that represents the pivot it used.
The recursion tree Any leaf represents a single element in the array. Therefore the number of leaves is exactly n. [the ordered array is actually the leaves, from left to right]. The random number generator is called once in every internal nodes. Therefore we actually ask: how many internal nodes are there? Let X be the number of internal nodes.
Proof (continued) Observation 1: X is at most n, since any internal node points to at least one leaf. Observation 2: X is at least n/3 : –Divide the set of leaves to subsets according their father. –Each subset contains at most 3 leaves, and therefore there are at least n/3 subsets. Therefore: –X = no. of subsets >= n/3 Conclusion: Q.E.D
T3 q.2 Reminder - Red-Black trees: A binary search tree, with the following properties: 1. Every node has a color - either red or black. 2. The root is black. 3. Every leaf (empty child) is black. 4. Both children of a red node are black. 5. Every path from a node to a descendant leaf contains the same number of black nodes. The black height of a tree is the number of black nodes in a path from the root to some leaf (not counting the root).
T3 q.2 - first part Question: What is the maximal number of internal nodes of an RB tree with black height h? First intuition: The path from the root to a leaf must contain exactly h black nodes. We want it to be long, so we can put a red node between each two black nodes. That’s the maximal we can do, since otherwise we’ll violate property 4. Important: This is just intuition, not a proof! A proof must show, by one or more arguments, without gaps between them, that the claim must be true. For example, in the above intuition there are two gap: how do we know there is no way to make it even longer? And can we actually construct such a tree?
A maximal tree First part: showing we can actually construct such a tree: Take a complete binary tree with 2h+1 levels. Color the root black, the second level red, the third level black, and so on. Number of internal nodes: 2^(2h) - 1 Notice that is a valid RB tree (with black height h): Properties 1, 2 & 4 immediately hold. 3 - The number of levels is odd, and we colored the first level black, then the last level (leaves) is black too. 5 - All paths have alternating red & black nodes, and have the same length. black red black......
This tree is indeed maximal Claim: Any RB tree with black height h has at most 2h levels (ignoring the leaves). Proof: What is the no. of nodes in some path from a root to a leaf: All paths contain same no. of black nodes, h in our case. Including the root, we have h+1 black nodes. There must be at most h red nodes by property 4, therefore the path has 2h+1 nodes, or 2h if we ignore the leaf. Remark: A binary tree with 2h levels contains at most 2^(2h)-1. This happens when the tree is complete. Answer: An RB tree with maximal height h can have at most 2^(2h)-1 internal nodes.
T3 q.2 - second part Question: What is the minimal number of internal nodes of an RB tree with black height h. Claim: There cannot be any red nodes in the minimal RB tree. Proof: Suppose there is a red node, x. It must have two black sons. We can delete x and one of its sub-trees, T1, and connect x’s father to the other sub-tree, T2. The only property we need to check is 5 - but for any path in T1 there is a path with the same number of nodes it T2. So property 5 holds. Therefore the original tree wasn’t minimal.
T3 q.2 - second part (continued) Claim: An RB tree with no red nodes must be a complete binary tree. Proof: Consider only the internal nodes. If this tree is not complete, there are missing nodes at the last level. Then there is a node with two paths to a leaf, with different lengths. Since all nodes are black, this violates property 5. Answer: There is a single RB tree of black height h with minimal no. of internal nodes. It has 2^h - 1 internal nodes.