CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001

B-Tree Properties Properties –maximum branching factor of M –the root has between 2 and M children –other internal nodes have between  M/2  and M children –internal nodes contain only search keys (no data) –smallest datum between search keys x and y equals x –each (non-root) leaf contains between  L/2  and L keys –all leaves are at the same depth Result –tree is (log M/2 n/(L/2)) +/- 1 deep  (log n) –all operations run in time proportional to depth –operations pull in at least M/2 or L/2 items at a time

When Big-O is Not Enough B-Tree is about log M/2 n/(L/2) deep = log M/2 n - log M/2 L/2 = O(log M/2 n) = O(log n) steps per operation (same as BST!) Where’s the beef?! log 2 ( 10,000,000 ) = 24 disk accesses log 200/2 ( 10,000,000 ) < 4 disk accesses

… __ k1k1 k2k2 … kiki B-Tree Nodes Internal node –i search keys; i+1 subtrees; M - i - 1 inactive entries Leaf –j data keys; L - j inactive entries k1k1 k2k2 … kjkj … __ 12M - 1 12L i j

Example B-Tree with M = 4 and L = 4 12 3569101112 1517 202526 303233364042 506070 1040 3 15203050

Making a B-Tree The empty B-Tree M = 3 L = 2 3 Insert(3) 314 Insert(14) Now, Insert(1)?

Splitting the Root And create a new root 13 14 13 13 3 Insert(1) Too many keys in a leaf! So, split the leaf.

Insertions and Split Ends Insert(59) 14 13 59 14 13 Insert(26) 14 13 26 59 142659 1459 13142659 And add a new child Too many keys in a leaf! So, split the leaf.

Propagating Splits 1459 13142659 1459 13142659 5 135 Insert(5) 514 2659 135 5 5 135 142659 14 Add new child Create a new root Too many keys in an internal node! So, split the node.

Insertion in Boring Text Insert the key in its leaf If the leaf ends up with L+1 items, overflow! –Split the leaf into two nodes: original with  (L+1)/2  items new one with  (L+1)/2  items –Add the new child to the parent –If the parent ends up with M+1 items, overflow! If an internal node ends up with M+1 items, overflow! –Split the node into two nodes: original with  (M+1)/2  items new one with  (M+1)/2  items –Add the new child to the parent –If the parent ends up with M+1 items, overflow! Split an overflowed root in two and hang the new nodes under a new root This makes the tree deeper!

Deletion in B-trees Come to section tomorrow. Slides follow.

After More Routine Inserts 5 135 142659 14 5 135 265979 5989 14 89 Insert(89) Insert(79)

Deletion 5 135 14265979 5989 14 89 5 135 142679 89 14 89 Delete(59)

Deletion and Adoption 5 135 142679 8914 89 Delete(5) ? 13 142679 8914 89 3 133 142679 89 14 89 A leaf has too few keys! So, borrow from a neighbor

Deletion with Propagation 3 1 3 142679 8914 89 Delete(3) ? 1 142679 8914 89 1 142679 8914 89 A leaf has too few keys! And no neighbor with surplus! So, delete the leaf But now a node has too few subtrees!

Adopt a neighbor 1 142679 89 14 89 14 1 2679 89 79 89 Finishing the Propagation (More Adoption)

Delete(1) (adopt a neighbor) 14 1 2679 89 79 89 A Bit More Adoption 26 14 26 79 89 79 89

Delete(26) 26 14 26 79 89 79 89 Pulling out the Root 14 79 89 79 89 A leaf has too few keys! And no neighbor with surplus! 14 79 89 79 89 So, delete the leaf A node has too few subtrees and no neighbor with surplus! 14 79 89 Delete the leaf But now the root has just one subtree!

Pulling out the Root (continued) 14 79 89 The root has just one subtree! But that’s silly! 14 79 89 Just make the one child the new root!

Deletion in Two Boring Slides of Text Remove the key from its leaf If the leaf ends up with fewer than  L/2  items, underflow! –Adopt data from a neighbor; update the parent –If borrowing won’t work, delete node and divide keys between neighbors –If the parent ends up with fewer than  M/2  items, underflow! Why will dumping keys always work if borrowing doesn’t?

Deletion Slide Two If a node ends up with fewer than  M/2  items, underflow! –Adopt subtrees from a neighbor; update the parent –If borrowing won’t work, delete node and divide subtrees between neighbors –If the parent ends up with fewer than  M/2  items, underflow! If the root ends up with only one child, make the child the new root of the tree This reduces the height of the tree!

Thinking about B-Trees B-Tree insertion can cause (expensive) splitting and propagation B-Tree deletion can cause (cheap) borrowing or (expensive) deletion and propagation Propagation is rare if M and L are large (Why?) Repeated insertions and deletion can cause thrashing If M = L = 128, then a B-Tree of height 4 will store at least 30,000,000 items –height 5: 2,000,000,000!

Tree Summary BST: fast finds, inserts, and deletes O(log n) on average (if data is random!) AVL trees: guaranteed O(log n) operations B-Trees: also guaranteed O(log n), but shallower depth makes them better for disk-based databases What would be even better? –How about: O(1) finds and inserts?

Hash Table Approach But… is there a problem in this pipe-dream? f(x) Zasha Steve Nic Brad Ed

Hash Table Dictionary Data Structure Hash function: maps keys to integers –result: can quickly find the right spot for a given entry Unordered and sparse table –result: cannot efficiently list all entries, –Cannot find min and max efficiently, –Cannot find all items within a specified range efficiently. f(x) Zasha Steve Nic Brad Ed

Hash Table Terminology f(x) Zasha Steve Nic Brad Ed hash function collision keys load factor = # of entries in table tableSize

Hash Table Code First Pass Value & find(Key & key) { int index = hash(key) % tableSize; return Table[index]; } What should the hash function be? What should the table size be? How should we resolve collisions?

A Good Hash Function… …is easy (fast) to compute (O(1) and practically fast). …distributes the data evenly (hash(a)  hash(b) ). …uses the whole hash table (for all 0  k < size, there’s an i such that hash(i) % size = k).

Good Hash Function for Integers Choose –tableSize is prime –hash(n) = n % tableSize Example: –tableSize = 7 insert(4) insert(17) find(12) insert(9) delete(17) 3 2 1 0 6 5 4

Good Hash Function for Strings? Ideas?

Good Hash Function for Strings? Sum the ASCII values of the characters. Consider only the first 3 characters. –Uses only 2871 out of 17,576 entries in the table on English words. Let s = s 1 s 2 s 3 s 4 …s 5 : choose –hash(s) = s 1 + s 2 128 + s 3 128 2 + s 4 128 3 + … + s n 128 n Problems: –hash(“really, really big”) = well… something really, really big –hash(“one thing”) % 128 = hash(“other thing”) % 128 Think of the string as a base 128 number.

Making the String Hash Easy to Compute Use Horner’s Rule int hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (s i + 128*h) % tableSize; } return h; }

Universal Hashing For any fixed hash function, there will be some pathological sets of inputs –everything hashes to the same cell! Solution: Universal Hashing –Start with a large (parameterized) class of hash functions No sequence of inputs is bad for all of them! –When your program starts up, pick one of the hash functions to use at random (for the entire time) –Now: no bad inputs, only unlucky choices! If universal class large, odds of making a bad choice very low If you do find you are in trouble, just pick a different hash function and re-hash the previous inputs

Universal Hash Function: “Random” Vector Approach Parameterized by prime size and vector: a = where 0 <= a i < size Represent each key as r + 1 integers where k i < size –size = 11, key = 39752 ==> –size = 29, key = “hello world” ==> h a (k) = dot product with a “random” vector!

Universal Hash Function Strengths: –works on any type as long as you can form k i ’s –if we’re building a static table, we can try many a’s –a random a has guaranteed good properties no matter what we’re hashing Weaknesses –must choose prime table size larger than any k i

Hash Function Summary Goals of a hash function –reproducible mapping from key to table entry –evenly distribute keys across the table –separate commonly occurring keys (neighboring keys?) –complete quickly Hash functions –h(n) = n % size –h(n) = string as base 128 number % size –Universal hash function #1: dot product with random vector

How to Design a Hash Function Know what your keys are Study how your keys are distributed Try to include all important information in a key in the construction of its hash Try to make “neighboring” keys hash to very different places Prune the features used to create the hash until it runs “fast enough” (very application dependent)

Collisions Pigeonhole principle says we can’t avoid all collisions –try to hash without collision m keys into n slots with m > n –try to put 6 pigeons into 5 holes What do we do when two keys hash to the same entry? –open hashing: put little dictionaries in each entry –closed hashing: pick a next entry to try

CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001.

Similar presentations

Presentation on theme: "CSE 326: Data Structures Lecture #11 B-Trees Alon Halevy Spring Quarter 2001."— Presentation transcript:

Similar presentations

About project

Feedback