Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 44321 CS4432: Database Systems II Basic indexing.

Similar presentations


Presentation on theme: "CS 44321 CS4432: Database Systems II Basic indexing."— Presentation transcript:

1 CS 44321 CS4432: Database Systems II Basic indexing

2 CS 44322 Indexing : helps to retrieve data quicker for certain queries value= 1,000,000 Select * FROM Emp WHERE salary = 1,000,000; Chapter 13  value record

3 CS 44323 Topics Sequential Index Files (chap 13.1) Secondary Indexes (chap 13.2)

4 CS 44324 Sequential File 20 10 40 30 60 50 80 70 100 90

5 CS 44325 Sequential File 20 10 40 30 60 50 80 70 100 90 Dense Index 10 20 30 40 50 60 70 80 90 100 110 120 Every record is in index.

6 CS 44326 Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse Index 10 30 50 70 90 110 130 150 170 190 210 230 Only first record per block in index.

7 CS 44327 Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse 2nd level 10 30 50 70 90 110 130 150 170 190 210 230 10 90 170 250 330 410 490 570

8 CS 44328 Note : DATA FILE or INDEX are “ordered files”. Question: How would we lay them out on disk ? - contiguous layout on disk ? - block-chained layout on disk ?

9 CS 44329 Questions: Do we want to build a dense 2 nd -level index for a dense index? Can we even do this ? Sequential File 20 10 40 30 60 50 80 70 100 90 2nd level? 10 30 50 70 90 110 130 150 170 190 210 230 10 90 170 250 330 410 490 570 1st level?

10 CS 443210 Notes on pointers: (1)Block pointer (used in sparse index) can be smaller than record pointer (used in dense index) BP RP

11 CS 443211 K1 K3 K4 K2 R1R2R3R4 say: 1024 B per block if we want K3 block: get it at offset (3-1)*1024 = 2048 bytes Note : If file is contiguous, then we can omit pointers

12 CS 443212 Sparse vs. Dense Tradeoff Sparse: Less index space per record can keep more of index in memory (Later: sparse better for insertions) Dense: Can tell if any record exists without accessing file (Later: dense needed for secondary indexes)

13 CS 443213 Terms Index sequential file Search key (  primary key) Primary index (on sequencing field) Secondary index Dense index (contains all search key values) Sparse index Multi-level index

14 CS 443214 Next: Duplicate keys Deletion/Insertion Secondary indexes

15 CS 443215 Duplicate keys 10 20 10 30 20 30 45 40

16 CS 443216 10 20 10 30 20 30 45 40 10 20 10 30 20 30 45 40 10 20 30 10 20 30 Dense index ! Point to each value ! Duplicate keys

17 CS 443217 10 20 10 30 20 30 45 40 Dense index. Point to each distinct value! 10 20 30 40 Duplicate keys

18 CS 443218 10 20 10 30 20 30 45 40 10 20 30 Sparse index: point to start of block ! Duplicate keys careful if looking for 20 or 30!

19 CS 443219 10 20 10 30 20 30 45 40 10 20 30 Sparse index, another way ? Duplicate keys – place first new key from block should this be 40?

20 CS 443220 Duplicate values, primary index Index may point to first instance of each value only File Index Summary a a a b 

21 CS 443221 Next: Duplicate keys Deletion/Insertion Secondary indexes

22 CS 443222 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150

23 CS 443223 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete record 40

24 CS 443224 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete record 30 40

25 CS 443225 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete records 30 & 40 50 70

26 CS 4432lecture #826 Deletion from dense index 20 10 40 30 60 50 80 70 10 20 30 40 50 60 70 80

27 CS 443227 Deletion from dense index 20 10 40 30 60 50 80 70 10 20 30 40 50 60 70 80 – delete record 30 40

28 CS 443228 Insertion, sparse index case 20 1030 50 4060 10 30 40 60

29 CS 443229 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 34 34 our lucky day! we have free space where we need it!

30 CS 443230 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 15 15 20 30 20 Immediate reorganization Other variations?

31 CS 443231 Just Illustrated: -Immediate reorganization Now Variation: – insert new block (chained file)

32 CS 443232 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 25 25 overflow blocks (reorganize later...)

33 CS 443233 Insertion, dense index case Similar Often more expensive...

34 CS 443234 Next: Duplicate keys Deletion/Insertion Secondary indexes

35 CS 443235 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Can I make a secondary index sparse ?

36 CS 443236 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Sparse index 30 20 80 100 90... does not make sense!

37 CS 443237 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Must be dense index ! 10 20 30 40 50 60 70... 10 50 90... sparse high level allowed?

38 CS 443238 With secondary indexes: Lowest level is dense Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed)

39 CS 443239 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30

40 CS 443240 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40... one option... Problem: excess overhead! disk space search time

41 CS 443241 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 another option... 403020 Problem: variable size records in index!

42 CS 443242 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60... Another idea : Chain records with same key ! Problems: Need to add fields to data records for each index Need to follow chain to know records

43 CS 443243 Summary : Conventional Indexes –Basic Ideas: sparse, dense, multi-level… –Duplicate Keys –Deletion/Insertion –Secondary indexes

44 CS 443244 Multi-level Index Structures Sequence field 50 30 70 20 40 80 10 100 60 90 first level (dense, if non- sequential) 10 20 30 40 50 60 70... 10 50 90... high Level (always sparse) 1 2 5 4 3

45 CS 443245 Sequential indexes : pros/cons ? Advantage: - Simple - Index is sequential file good for scans - Search efficient for static data Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance - Then search time unpredictable

46 CS 443246 ExampleSequential Index continuous free space 10 20 30 40 50 60 70 80 90 39 31 35 36 32 38 34 33 overflow area (not sequential)

47 CS 443247 Another type of index Give up “sequentiality” of index Predictable performance under updates Achieve always balance of “tree” Automate restructuring under updates

48 CS 443248 Root B+Tree Examplen=3 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

49 CS 443249 Sample non-leaf to keysto keysto keys to keys < 5757  k<8181  k<95  95 57 81 95

50 CS 443250 Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85

51 CS 443251 In textbook’s notationn=3 Leaf: Non-leaf: 30 35 30 35 30

52 CS 443252 Size of nodes:n+1 pointers n keys (fixed)

53 CS 443253 Don’t want nodes to be too empty Use at least Non-leaf:  (n+1)/2  pointers Leaf:  (n+1)/2  pointers to data

54 CS 443254 Full nodemin. node Non-leaf Leaf n=3 120 150 180 30 3 5 11 30 35 counts even if null Non-leaf:  (n+1)/2  pointers Leaf:  (n+1)/2  pointers to data

55 CS 443255 B+tree rulestree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

56 CS 443256 (3) Number of pointers/keys for B+tree Non-leaf (non-root) n+1n  (n+1)/ 2   (n+1)/ 2  - 1 Leaf (non-root) n+1n Rootn+1n11 Max Max Min Min ptrs keys ptrs  data keys  (n+ 1) / 2 

57 CS 443257 Root B+Tree Example : Searches 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

58 CS 443258 Insert into B+tree (a) simple case –space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

59 CS 443259 (a) Insert key = 32 n=3 3 5 11 30 31 30 100 32

60 CS 443260 (a) Insert key = 7 n=3 3 5 11 30 31 30 100 3535 7 7

61 CS 443261 (c) Insert key = 160 n=3 100 120 150 180 150 156 179 180 200 160 180 160 179

62 CS 443262 (d) New root, insert 45 n=3 10 20 30 123123 10 12 20 25 30 32 40 45 4030 new root

63 CS 443263 Recap: Insert Data into B+ Tree Find correct leaf L. Put data entry onto L. – If L has enough space, done! – Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively – To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. – Tree growth: gets wider or one level taller at top.

64 CS 443264 (a) Simple case (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

65 CS 443265 (a) Delete key = 11 n=3 3 5 11 30 31 30 100

66 CS 443266 (b) Coalesce with sibling –Delete 50 10 40 100 10 20 30 40 50 n=4 40

67 CS 443267 (c) Redistribute keys –Delete 50 10 40 100 10 20 30 35 40 50 n=4 35

68 CS 443268 40 45 30 37 25 26 20 22 10 14 1313 10 2030 40 (d) Coalese and Non-leaf coalese –Delete 37 n=4 40 30 25 new root

69 CS 443269 B+tree deletions in practice –Often, coalescing is not implemented –Too hard and not worth it!

70 CS 443270 Delete Data from B+ Tree Start at root, find leaf L where entry belongs. Remove the entry. – If L is at least half-full, done! – If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height.

71 CS 443271 Concurrency control harder in B-Trees B-tree consumes more space DBA does not know when to reorganize DBA does not know how full to load pages of new index Buffering –B-tree: has fixed buffer requirements –Static index: must read several overflow blocks to be efficient (large & variable size buffers needed) Comparison: B-trees vs. static indexed sequential file

72 CS 443272 Speaking of buffering… Is LRU a good policyfor B+tree buffers?  Of course not!  Should try to keep root in memory at all times (and perhaps some nodes from second level)

73 CS 443273 Comparison B-tree vs. indexed seq. file Less space, so lookup faster Inserts managed by overflow area Requires temporary restructuring Unpredictable performance Consumes more space, so lookup slower Each insert/delete potentially restructures Build-in restructuring Predictable performance

74 CS 443274 Interesting problem: For B+tree, how large should n be? … n is number of keys / node

75 CS 443275 assumptions: n children per node and N records in database (1)Time to read B-Tree node from disk is (tseek + tread*n) msec. (2)Once in main memory, use binary search to locate key, (a + b log_2 n) msec (3)Need to search (read) log_n (N) tree nodes (4)t-search = (tseek + tread*n + (a + b*log_2(n)) * log n (N)

76 CS 443276  Can get: f(n) = time to find a record f(n) n opt n  FIND n opt by f’(n) = 0 øWhat happens to n opt as: Disk gets faster? CPU get faster? …

77 CS 443277 Bulk Loading of B+ Tree For large collection of records, create B+ tree. Method 1: Repeatedly insert records  slow. Method 2: Bulk Loading  more efficient.

78 CS 443278 Bulk Loading of B+ Tree Initialization: – Sort all data entries – Insert pointer to first (leaf) page in new (root) page. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Sorted pages of data entries; not yet in B+ tree Root

79 CS 4432 79 Bulk Loading (Contd.) Index entries for leaf pages always entered into right- most index page When this fills up, it splits. Split may go up right-most path to root. 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* Root Data entry pages not yet in B+ tree 3523126 1020 3* 4* 6*9*10*11*12*13* 20*22* 23*31* 35* 36*38*41*44* 6 Root 10 12 23 20 35 38 not yet in B+ tree Data entry pages

80 CS 443280 Summary of Bulk Loading Method 1: multiple inserts. – Slow. – Does not give sequential storage of leaves. Method 2: Bulk Loading – Has advantages for concurrency control. – Fewer I/Os during build. – Leaves will be stored sequentially (and linked) – Can control “fill factor” on pages.


Download ppt "CS 44321 CS4432: Database Systems II Basic indexing."

Similar presentations


Ads by Google