Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Tree-Structured Indexes Chapter 10. 2 Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.

Similar presentations


Presentation on theme: "1 Tree-Structured Indexes Chapter 10. 2 Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice."— Presentation transcript:

1 1 Tree-Structured Indexes Chapter 10

2 2 Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice is orthogonal to the indexing technique used to locate data entries k* :  Hash  Tree

3 3 Introduction  Tree-structured indexing techniques support both range searches and equality searches.

4 4 Motivation for Tree Index  Range search : `` Find all students with gpa > 3.0 ’’  Given Sorted file  USE: do binary search to find first such student, then scan to find others.  Cost of binary search can be quite high.

5 5 Motivation for Tree Index  Range search : `` Find all students with gpa > 3.0 ’’  Simple idea: Create an `index’ file. * Can do binary search on (smaller) index file! Page 1 Page 2 Page N Page 3 Data File k2 kN k1 Index File

6 6 Motivation for Tree Index  Range search : `` Find all students with gpa > 3.0 ’’  Sorted file, do binary search to find first such student, then scan to find others.  Cost of binary search can be quite high.  Simple idea: Create an `index’ file. * Can do binary search on (smaller) index file! Page 1 Page 2 Page N Page 3 Data File k2 kN k1 Index File

7 7 Alternative Tree Index Structures  ISAM (Indexed Sequential Access Method) :  static structure.  B+ tree :  dynamic structure.  Adjust gracefully under inserts and deletes.

8 8 ISAM Index Page 1 Page 2 Page N Page 3 Data File k2 kN k1 Index File P 0 K 1 P 1 K 2 P 2 K m P m index entry

9 9 ISAM  Index file may still be quite large.  But we can apply the idea repeatedly! * Only leaf pages contain data entries. Non-leaf Pages Overflow page Primary pages Leaf

10 10 Example ISAM Tree  Each node can hold 2 entries.  No need for `next-leaf-page’ pointers. (?) 10*15*20*27*33*37*40* 46* 51* 55* 63* 97* 20335163 40 Root

11 11 Inserting 23*, 48*, 41*, 42*... 10*15*20*27*33*37*40* 46* 51* 55* 63* 97* 20335163 40 Root

12 12 After Inserting 23*, 48*, 41*... 10*15*20*27*33*37*40* 46* 51* 55* 63* 97* 20335163 40 Root 23* 48* 41* Overflow Pages Leaf Index Pages Primary

13 13 After Inserting 23*, 48*, 41*, 42*... 10*15*20*27*33*37*40* 46* 51* 55* 63* 97* 20335163 40 Root 23* 48* 41* 42* Overflow Pages Leaf Index Pages Primary

14 14 Now Let’s Delete: 42*, 51*, 97*, … 10*15*20*27*33*37*40* 46* 51* 55* 63* 97* 20335163 40 Root 23* 48* 41* 42* Overflow Pages Leaf Index Pages Primary

15 15... After Deleting 42*, 51*, 97* * Note that 51* appears in index levels, but not in leaf! * Index still is “fully balanced”, but with many empty slots. 10*15*20*27*33*37*40* 46*55* 63* 20335163 40 Root 23* 48* 41*

16 16 Comments on ISAM Data Pages Index Pages Overflow pages

17 17 Comments on ISAM  File creation : Leaf (data) pages allocated sequentially sorted by search key index pages allocated space for overflow pages later created  Index entries : ; they `direct’ search for data entries, which are in leaf pages or even in overflow pages. Data Pages Index Pages Overflow pages

18 18 Comments on ISAM  Search :  Start at root; use key comparisons to go to leaf. Cost log F N ; F = # entries/index pg, N = # leaf pgs  Insert :  Find leaf that data entry belongs to, and put it there.  Delete :  Find and remove from leaf; if empty overflow page, de-allocate. If empty primary page, leave it. Data Pages Index Pages Overflow pages

19 19 Comments on ISAM : Pros? Cons? Static tree structure : inserts/deletes affect only leaf pages. - Long overflow chains may appear. - May affect retrieve performance. Keep 20% free in leaf page. Problem might still appear! + Concurrent access. index page is not locked since it is never changed. ? What to do instead ????

20 20 B+ Tree: Most Widely Used Index  Guarantee Search/Insert/delete at log F N cost with F = fanout, N = # leaf pages  Keep tree height-balanced.  Supports equality and range-searches efficiently. Index Entries Data Entries ("Sequence set") (Direct search)

21 21 Example B+ Tree  Search begins at root, and key comparisons direct it to a leaf (as in ISAM).  Search for 5*; 15*, for all data entries >= 24*. * leaf nodes form a sequence to answer range query Root 1724 30 2* 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13

22 22 B+ Tree Concepts  Insert/delete at log F N cost; keep tree height- balanced.  Each node contains d <= m <= 2 d entries. The parameter d is called order of tree.  Minimum 50% occupancy or fill-factor (except for root). Index Entries Data Entries ("Sequence set") (Direct search)

23 23 B+ Trees in Practice  Typical order: 100. Typical fill-factor: 67%.  average fanout = 133  Typical capacities:  Height 4: 133 4 = 312,900,700 records  Height 3: 133 3 = 2,352,637 records  Can often hold top levels in buffer pool:  Level 1 = 1 page = 8 Kbytes  Level 2 = 133 pages = 1 Mbyte  Level 3 = 133 2 = 17,689 pages = 133 MBytes

24 24 Inserting Data Entry into B+ Tree  Find correct leaf L.  Put data entry onto L.  If L has enough space, done !  Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L.  This can happen recursively  To split index node, redistribute entries evenly, and push up middle key. (Contrast with leaf splits.)

25 25 Inserting Data Entry into B+ Tree  Splits “grow” the tree in width.  A root split increases height of tree.  Tree growth: gets wider or one level taller at top.

26 26 Root 1724 30 2* 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13 Inserting 8* into Example B+ Tree

27 27 Root 1724 30 2* 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13 Inserting 8* into Example B+ Tree 2* 3*5* 7* 8* 5 (Note that 5 is continues to appear in the leaf.) s copied up and

28 28 Root 1724 30 2* 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13 Inserting 8* into Example B+ Tree 2* 3*5* 7* 8* 5 52430 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only once Appears in the index.)

29 29 Inserting 8* into Example B+ Tree  Observe how minimum occupancy is guaranteed in both leaf and index pg splits.  Note difference between copy- up and push-up ; be sure you understand the reasons for this. 2* 3*5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is continues to appear in the leaf.) s copied up and appears once in the index. Contrast 52430 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only this with a leaf split.)

30 30 Inserting 8* into Example B+ Tree  Observe how minimum occupancy is guaranteed in both leaf and index pg splits.  Note difference between copy- up and push-up ; be sure you understand the reasons for this. 2* 3*5* 7* 8* 5 Entry to be inserted in parent node. (Note that 5 is continues to appear in the leaf.) s copied up and appears once in the index. Contrast 52430 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only this with a leaf split.)

31 31 Example B+ Tree After Inserting 8*  Notice that root was split, leading to increase in height. 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*

32 32 Example B+ Tree After Inserting 8*  In this example, we could avoid split by re-distributing entries; however, this is usually not done in practice. 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*

33 33 Root 1724 30 2* 3*5* 7*14*16* 19*20*22*24*27* 29*33*34* 38* 39* 13 Inserting 8* into Example B+ Tree using Redistribution  Avoid split by re-distributing entries : -Checking sibling to decide whether redistribution is possible. -Usually redistribution in leaf, not in non leaf nodes.

34 34 Example B+ Tree After Inserting 8* using Redistribution Root 1724 30 2* 3*5* 7*14* 16* 19*20*22*24*27* 29*33*34* 38* 39* 8 8*

35 35 Deleting a Data Entry from a B+ Tree  Start at root, find leaf L where entry belongs.  Remove the entry.  If L is at least half-full, done!  If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling.  If merge occurred, must delete entry (pointing to L or sibling) from parent of L.  Merge could propagate to root, decreasing height.

36 36 Now Try to Delete 19* and then 20* 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*

37 37 Now Try to Delete 19* … 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8*  Deleting 19* is easy.

38 38 Now Delete 20* … 2*3* Root 17 24 30 14*16* 20*22* 24*27* 29*33*34* 38* 39* 135 7*5*8*

39 39 Now Delete 20* … 2*3* Root 17 24 30 14*16* 20*22* 24*27* 29*33*34* 38* 39* 135 7*5*8* Deleting 20* can be done with re-distribution.

40 40 We have deleted 20*...  Deleting 20* done with re-distribution.  Notice how middle key is copied up. 2*3* Root 17 30 14*16* 33*34* 38* 39* 135 7*5*8*22*24* 27 27*29*

41 41 Now Deleting 24*...  Can re-distribution among sibling work again? 2*3* Root 17 30 14*16* 33*34* 38* 39* 135 7*5*8*22*24* 27 27*29*

42 42 Now Deleting 24*... ?  If not, so let’s try merging of siblings … 2*3* Root 17 30 14*16* 33*34* 38* 39* 135 7*5*8*22*24* 27 27*29*

43 43 Now Deleting 24* via Merging. 2*3* Root 17 30 14*16* 33*34* 38* 39* 135 7*5*8*22* 27 27*29*  Who link to who ? 22*27* 29*

44 44 Now Deleting 24* via Merging. 2*3* Root 17 30 14*16* 33*34* 38* 39* 135 7*5*8*22*24* 27 27*29*  Observe ` toss ’ of index entry (on right) 30 22*27* 29*33*34* 38* 39*

45 45 Now Deleting 24* via Merging. 2*3* Root 17 14*16* 135 7*5*8* 30 22*27* 29*33*34* 38* 39* Are we done ?

46 46 Now Deleting 24* via Merging. 2*3* Root 17 14*16* 135 7*5*8* 30 22*27* 29*33*34* 38* 39* First try re-distributing If not possible, try merging.

47 47 Now Deleting 24* via Merging. 2*3* Root 17 14*16* 135 7*5*8* 30 22*27* 29*33*34* 38* 39* Observe ` pull down ’ of index entry. 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root 30 135 17

48 48... And Then Deleting 24*  Must merge.  Observe ` toss ’ of index entry (on right), and ` pull down ’ of index entry (below). 30 22*27* 29*33*34* 38* 39* 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root 30 135 17

49 49 Example of Non-leaf Re-distribution  Tree is shown below during deletion of 24*.  In contrast to previous example, can re-distribute entry from left child of root to right child. Root 135 1720 22 30 14*16* 17*18* 20*33*34* 38* 39* 22*27*29*21* 7*5*8* 3*2*

50 50 Example of Non-leaf Re-distribution  Tree is shown below during deletion of 24*. Root 135 17 2020 2 30 14*16* 17*18* 20*33*34* 38* 39* 22*27*29*21* 7*5*8* 3*2*

51 51 Miscellaneous about B-Trees

52 52 Duplicate keys in B+ Tree  Alternative 1: overflow page  Alternative 2: contain duplicates in leaf page and fetch using leaf sequence pointer not efficient in delete! (check duplicate entries) Solution:, use rid as part of searching key. Index includes both key and rid fields.  Alternative 3:, similar with 2.

53 53 Prefix Key Compression  Important to increase fan-out.  Key values in index entries only `direct traffic’; can often compress them.  Insert/delete must be suitably modified.

54 54 Prefix Key Compression of Key Values  Adjacent index entries with search key values: Dannon Yogurt, David Smith and Devarakonda Murthy, We can abbreviate David Smith to Dav. Is this correct? Not quite! What if there is a data entry Davey Jones ? So can only compress David Smith to Davi  In general, while compressing, leave each index entry greater than every key value (in any subtree) to its left.

55 55 Loading of a B+ Tree  If we have a large collection of records, and want to create a B+ tree on some field, how do we load these records into index?

56 56 Bulk Loading of a B+ Tree  Repeatedly inserting records is very slow.  Bulk Loading can be done more efficiently.  Step 1: Initialization :  Sort all data entries,  Insert pointer to first (leaf) page in a new (root) page. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Sorted pages of data entries; not yet in B+ tree Root

57 57 Bulk Loading (Contd.)  Index entries for leaf pages always entered into right-most index page just above leaf level.  When this fills up, it splits.  Split may go up right-most path to the root. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root Data entry pages not yet in B+ tree 3523126 1020 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* 6 Root 10 12 23 20 35 38 not yet in B+ tree Data entry pages

58 58 Summary of Bulk Loading  Option 1: multiple inserts.  Slow.  Does not give sequential storage of leaves.  Option 2: Bulk Loading  Has advantages for concurrency control.  Fewer I/Os during build.  Leaves will be stored sequentially (and linked)  Can control “fill factor” on pages.

59 59 A Note on `Order’  Order ( d ) concept replaced by physical space criterion in practice (` at least half-full ’).  Idea: Merge when more than half of space in node is unused.  Why?  Index pages can typically hold many more entries than leaf pages.  Variable sized records and search keys mean different nodes will contain different numbers of entries.  Even with fixed length fields, multiple records with the same search key value ( duplicates ) can lead to variable-sized data entries (if we use Alternative (3)).

60 60 Summary  Tree-structured indexes are ideal for range- searches, also good for equality searches.  ISAM is a static structure.  Only leaf pages modified; overflow pages needed.  Overflow chains can degrade performance unless size of data set and data distribution stay constant.  B+ tree is a dynamic structure.  Inserts/deletes leave tree height-balanced; log F N cost.  High fanout ( F ) means depth rarely more than 3 or 4.  Almost always better than maintaining a sorted file.

61 61 Summary (Contd.)  B+ Tree :  Typically, 67% occupancy on average.  Usually preferable to ISAM, modulo locking considerations  Adjusts to growth gracefully.  Key compression increases fanout and reduces height.  Bulk loading of B+ tree faster than repeated inserts  Most widely used index in database management systems because of its versatility => One of the most optimized components of a DBMS.


Download ppt "1 Tree-Structured Indexes Chapter 10. 2 Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice."

Similar presentations


Ads by Google