CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina.

CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina

CS 245Notes 42 Indexing & Hashing value Chapter 4  value record

CS 245Notes 43 Topics Conventional indexes B-trees Hashing schemes

CS 245Notes 44 Sequential File 20 10 40 30 60 50 80 70 100 90

CS 245Notes 45 Sequential File 20 10 40 30 60 50 80 70 100 90 Dense Index 10 20 30 40 50 60 70 80 90 100 110 120

CS 245Notes 46 Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse Index 10 30 50 70 90 110 130 150 170 190 210 230

CS 245Notes 47 Sequential File 20 10 40 30 60 50 80 70 100 90 Sparse 2nd level 10 30 50 70 90 110 130 150 170 190 210 230 10 90 170 250 330 410 490 570

CS 245Notes 48 Comment: {FILE,INDEX} may be contiguous or not (blocks chained)

CS 245Notes 49 Question: Can we build a dense, 2nd level index for a dense index?

CS 245Notes 410 Notes on pointers: (1) Block pointer (sparse index) can be smaller than record pointer BP RP

CS 245Notes 411 (2) If file is contiguous, then we can omit pointers (i.e., compute them) Notes on pointers:

CS 245Notes 412 K1 K3 K4 K2 R1R2R3R4

CS 245Notes 413 K1 K3 K4 K2 R1R2R3R4 say: 1024 B per block if we want K3 block: get it at offset (3-1)1024 = 2048 bytes

CS 245Notes 414 Sparse vs. Dense Tradeoff Sparse: Less index space per record can keep more of index in memory Dense: Can tell if any record exists without accessing file (Later: –sparse better for insertions –dense needed for secondary indexes)

CS 245Notes 415 Terms Index sequential file Search key (  primary key) Primary index (on Sequencing field) Secondary index Dense index (all Search Key values in) Sparse index Multi-level index

CS 245Notes 416 Next: Duplicate keys Deletion/Insertion Secondary indexes

CS 245Notes 417 Duplicate keys 10 20 10 30 20 30 45 40

CS 245Notes 418 10 20 10 30 20 30 45 40 10 20 30 10 20 10 30 20 30 45 40 10 20 30 Dense index, one way to implement? Duplicate keys

CS 245Notes 419 10 20 10 30 20 30 45 40 10 20 30 40 Dense index, better way? Duplicate keys

CS 245Notes 420 10 20 10 30 20 30 45 40 10 20 30 Sparse index, one way? Duplicate keys

CS 245Notes 421 10 20 10 30 20 30 45 40 10 20 30 Sparse index, one way? Duplicate keys careful if looking for 20 or 30!

CS 245Notes 422 10 20 10 30 20 30 45 40 10 20 30 Sparse index, another way? Duplicate keys – place first new key from block

CS 245Notes 423 10 20 10 30 20 30 45 40 10 20 30 Sparse index, another way? Duplicate keys – place first new key from block should this be 40?

CS 245Notes 424 Duplicate values, primary index Index may point to first instance of each value only File Index Summary a a a b 

CS 245Notes 425 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150

CS 245Notes 426 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete record 40

CS 245Notes 429 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete record 30 40

CS 245Notes 430 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete records 30 & 40

CS 245Notes 431 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete records 30 & 40

CS 245Notes 432 Deletion from sparse index 20 10 40 30 60 50 80 70 10 30 50 70 90 110 130 150 – delete records 30 & 40 50 70

CS 245Notes 433 Deletion from dense index 20 10 40 30 60 50 80 70 10 20 30 40 50 60 70 80

CS 245Notes 434 Deletion from dense index 20 10 40 30 60 50 80 70 10 20 30 40 50 60 70 80 – delete record 30

CS 245Notes 435 Deletion from dense index 20 10 40 30 60 50 80 70 10 20 30 40 50 60 70 80 – delete record 30 40

CS 245Notes 436 Deletion from dense index 20 10 40 30 60 50 80 70 10 20 30 40 50 60 70 80 – delete record 30 40

CS 245Notes 437 Insertion, sparse index case 20 1030 50 4060 10 30 40 60

CS 245Notes 438 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 34

CS 245Notes 439 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 34 34 our lucky day! we have free space where we need it!

CS 245Notes 441 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 15 15 20 30 20

CS 245Notes 442 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 15 15 20 30 20 Illustrated: Immediate reorganization Variation: – insert new block (chained file) – update index

CS 245Notes 444 Insertion, sparse index case 20 1030 50 4060 10 30 40 60 – insert record 25 25 overflow blocks (reorganize later...)

CS 245Notes 445 Insertion, dense index case Similar Often more expensive...

CS 245Notes 446 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90

CS 245Notes 447 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Sparse index 30 20 80 100 90...

CS 245Notes 448 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Sparse index 30 20 80 100 90... does not make sense!

CS 245Notes 449 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Dense index

CS 245Notes 450 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Dense index 10 20 30 40 50 60 70...

CS 245Notes 451 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Dense index 10 20 30 40 50 60 70... 10 50 90... sparse high level

CS 245Notes 452 With secondary indexes: Lowest level is dense Other levels are sparse Also: Pointers are record pointers (not block pointers; not computed)

CS 245Notes 453 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30

CS 245Notes 454 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40... one option...

CS 245Notes 455 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40... one option... Problem: excess overhead! disk space search time

CS 245Notes 456 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 another option... 403020

CS 245Notes 457 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 another option... 403020 Problem: variable size records in index!

CS 245Notes 458 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60... Another idea (suggested in class): Chain records with same key?

CS 245Notes 459 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60... Another idea (suggested in class): Chain records with same key? Problems: Need to add fields to records Need to follow chain to know records

CS 245Notes 460 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60... buckets

CS 245Notes 461 Why “bucket” idea is useful IndexesRecords Name: primary EMP (name,dept,floor,...) Dept: secondary Floor: secondary

CS 245Notes 462 Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. indexEMP Floor index Toy 2nd

CS 245Notes 463 Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. indexEMP Floor index Toy 2nd  Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

CS 245Notes 464 This idea used in text information retrieval Documents...the cat is fat......was raining cats and dogs......Fido the dog...

CS 245Notes 465 This idea used in text information retrieval Documents...the cat is fat......was raining cats and dogs......Fido the dog... Inverted lists cat dog

CS 245Notes 466 IR QUERIES Find articles with “cat” and “dog” Find articles with “cat” or “dog” Find articles with “cat” and not “dog”

CS 245Notes 467 IR QUERIES Find articles with “cat” and “dog” Find articles with “cat” or “dog” Find articles with “cat” and not “dog” Find articles with “cat” in title Find articles with “cat” and “dog” within 5 words

CS 245Notes 468 Common technique: more info in inverted list cat Title5 100 Author10 Abstract57 Title12 d3d3 d2d2 d1d1 dog type position location

CS 245Notes 469 Posting: an entry in inverted list. Represents occurrence of term in article Size of a list:1Rare words or (in postings) miss-spellings 10 6 Common words Size of a posting: 10-15 bits (compressed)

CS 245Notes 470 IR DISCUSSION Stop words Truncation Thesaurus Full text vs. Abstracts Vector model

CS 245Notes 471 Vector space model w1 w2 w3 w4 w5 w6 w7 … DOC = Query=

CS 245Notes 472 Vector space model w1 w2 w3 w4 w5 w6 w7 … DOC = Query= PRODUCT = 1 + ……. = score

CS 245Notes 473 Tricks to weigh scores + normalize e.g.: Match on common word not as useful as match on rare words...

CS 245Notes 474 How to process V.S. Queries? w1 w2 w3 w4 w5 w6 … Q =

CS 245Notes 475 Try Stanford Libraries Try Google, Yahoo,...

CS 245Notes 476 Summary so far Conventional index –Basic Ideas: sparse, dense, multi-level… –Duplicate Keys –Deletion/Insertion –Secondary indexes –Buckets of Postings List

CS 245Notes 477 Conventional indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Inserts expensive, and/or - Lose sequentiality & balance

CS 245Notes 478 ExampleIndex (sequential) continuous free space 10 20 30 40 50 60 70 80 90

CS 245Notes 479 ExampleIndex (sequential) continuous free space 10 20 30 40 50 60 70 80 90 39 31 35 36 32 38 34 33 overflow area (not sequential)

CS 245Notes 480 Outline: Conventional indexes B-Trees  NEXT Hashing schemes

CS 245Notes 481 NEXT: Another type of index –Give up on sequentiality of index –Try to get “balance”

CS 245Notes 482 Root B+Tree Examplen=3 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

CS 245Notes 483 Sample non-leaf to keysto keysto keys to keys < 5757  k<8181  k<95  95 57 81 95

CS 245Notes 484 Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85

CS 245Notes 485 In textbook’s notationn=3 Leaf: Non-leaf: 30 35 30 35 30

CS 245Notes 486 Size of nodes:n+1 pointers n keys (fixed)

CS 245Notes 487 Don’t want nodes to be too empty Use at least Non-leaf:  (n+1)/2  pointers Leaf:  (n+1)/2  pointers to data

CS 245Notes 488 Full nodemin. node Non-leaf Leaf n=3 120 150 180 30 3 5 11 30 35 counts even if null

CS 245Notes 489 B+tree rulestree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

CS 245Notes 490 (3) Number of pointers/keys for B+tree Non-leaf (non-root) n+1n  (n+1)/ 2   (n+1)/ 2  - 1 Leaf (non-root) n+1n Rootn+1n11 Max Max Min Min ptrs keys ptrs  data keys  (n+ 1) / 2 

CS 245Notes 491 Insert into B+tree (a) simple case –space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

CS 245Notes 492 (a) Insert key = 32 n=3 3 5 11 30 31 30 100

CS 245Notes 493 (a) Insert key = 32 n=3 3 5 11 30 31 30 100 32

CS 245Notes 494 (a) Insert key = 7 n=3 3 5 11 30 31 30 100

CS 245Notes 495 (a) Insert key = 7 n=3 3 5 11 30 31 30 100 3535 7

CS 245Notes 496 (a) Insert key = 7 n=3 3 5 11 30 31 30 100 3535 7 7

CS 245Notes 497 (c) Insert key = 160 n=3 100 120 150 180 150 156 179 180 200

CS 245Notes 498 (c) Insert key = 160 n=3 100 120 150 180 150 156 179 180 200 160 179

CS 245Notes 499 (c) Insert key = 160 n=3 100 120 150 180 150 156 179 180 200 180 160 179

CS 245Notes 4100 (c) Insert key = 160 n=3 100 120 150 180 150 156 179 180 200 160 180 160 179

CS 245Notes 4101 (d) New root, insert 45 n=3 10 20 30 123123 10 12 20 25 30 32 40

CS 245Notes 4102 (d) New root, insert 45 n=3 10 20 30 123123 10 12 20 25 30 32 40 45

CS 245Notes 4103 (d) New root, insert 45 n=3 10 20 30 123123 10 12 20 25 30 32 40 45 40

CS 245Notes 4104 (d) New root, insert 45 n=3 10 20 30 123123 10 12 20 25 30 32 40 45 4030 new root

CS 245Notes 4105 (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

CS 245Notes 4106 (b) Coalesce with sibling –Delete 50 10 40 100 10 20 30 40 50 n=4

CS 245Notes 4107 (b) Coalesce with sibling –Delete 50 10 40 100 10 20 30 40 50 n=4 40

CS 245Notes 4108 (c) Redistribute keys –Delete 50 10 40 100 10 20 30 35 40 50 n=4

CS 245Notes 4109 (c) Redistribute keys –Delete 50 10 40 100 10 20 30 35 40 50 n=4 35

CS 245Notes 4110 40 45 30 37 25 26 20 22 10 14 1313 10 2030 40 (d) Non-leaf coalese –Delete 37 n=4 25

CS 245Notes 4111 40 45 30 37 25 26 20 22 10 14 1313 10 2030 40 (d) Non-leaf coalese –Delete 37 n=4 30 25

CS 245Notes 4112 40 45 30 37 25 26 20 22 10 14 1313 10 2030 40 (d) Non-leaf coalese –Delete 37 n=4 40 30 25

CS 245Notes 4113 40 45 30 37 25 26 20 22 10 14 1313 10 2030 40 (d) Non-leaf coalese –Delete 37 n=4 40 30 25 new root

CS 245Notes 4114 B+tree deletions in practice –Often, coalescing is not implemented –Too hard and not worth it!

CS 245Notes 4115 Comparison: B-trees vs. static indexed sequential file Ref #1: Held & Stonebraker “B-Trees Re-examined” CACM, Feb. 1978

CS 245Notes 4116 Ref # 1 claims: - Concurrency control harder in B-Trees - B-tree consumes more space For their comparison: block = 512 bytes key = pointer = 4 bytes 4 data records per block

CS 245Notes 4117 Example: 1 block static index 127 keys (127+1)4 = 512 Bytes -> pointers in index implicit!up to 127 blocks k1 k2 k3 k1k2k3 1 data block

CS 245Notes 4118 Example: 1 block B-tree 63 keys 63x(4+4)+8 = 512 Bytes -> pointers needed in B-treeup to 63 blocks because index isblocks not contiguous k1 k2... k63 k1k2k3 1 data block next -

CS 245Notes 4119 Size comparison Ref. #1 Static Index B-tree # data blocks height 2 -> 1272 2 -> 63 2 128 -> 16,129 3 64 -> 3968 3 16,130 -> 2,048,3834 3969 -> 250,047 4 250,048 -> 15,752,961 5

CS 245Notes 4120 Ref. #1 analysis claims For an 8,000 block file, after 32,000 inserts after 16,000 lookups  Static index saves enough accesses to allow for reorganization

CS 245Notes 4121 Ref. #1 analysis claims For an 8,000 block file, after 32,000 inserts after 16,000 lookups  Static index saves enough accesses to allow for reorganization Ref. #1 conclusion Static index better!!

CS 245Notes 4122 Ref #2: M. Stonebraker, “Retrospective on a database system,” TODS, June 1980 Ref. #2 conclusion B-trees better!!

CS 245Notes 4123 DBA does not know when to reorganize DBA does not know how full to load pages of new index Ref. #2 conclusion B-trees better!!

CS 245Notes 4124 Buffering –B-tree: has fixed buffer requirements –Static index: must read several overflow blocks to be efficient (large & variable size buffers needed for this) Ref. #2 conclusion B-trees better!!

CS 245Notes 4125 Speaking of buffering… Is LRU a good policyfor B+tree buffers?

CS 245Notes 4126 Speaking of buffering… Is LRU a good policyfor B+tree buffers?  Of course not!  Should try to keep root in memory at all times (and perhaps some nodes from second level)

CS 245Notes 4127 Interesting problem: For B+tree, how large should n be? … n is number of keys / node

CS 245Notes 4128 Sample assumptions: (1) Time to read node from disk is (S+Tn) msec.

CS 245Notes 4129 Sample assumptions: (1) Time to read node from disk is (S+Tn) msec. (2) Once block in memory, use binary search to locate key: (a + b LOG 2 n) msec. For some constants a,b; Assume a << S

CS 245Notes 4130 Sample assumptions: (1) Time to read node from disk is (S+Tn) msec. (2) Once block in memory, use binary search to locate key: (a + b LOG 2 n) msec. For some constants a,b; Assume a << S (3) Assume B+tree is full, i.e., # nodes to examine is LOG n N where N = # records

CS 245Notes 4131  Can get: f(n) = time to find a record f(n) n opt n

CS 245Notes 4132  FIND n opt by f’(n) = 0 Answer is n opt = “few hundred” (see homework for details)

CS 245Notes 4133  FIND n opt by f’(n) = 0 Answer is n opt = “few hundred” (see homework for details)  What happens to n opt as Disk gets faster? CPU get faster?

Exercise f(n)= log n N*[S+T*n+a+b*log 2 (n)] CS 245Notes 4134 S=14000 T=0.2 b=0.002 a=0 N=10000000

N=10 million records CS 245Notes 4135 S=14000 T=0.2 b=0.002 a=0 N=10000000 times in microseconds

N=100 million records CS 245Notes 4136 S=14000 T=0.2 b=0.002 a=0 N=10000000 times in microseconds

CS 245Notes 4137 Variation on B+tree: B-tree (no +) Idea: –Avoid duplicate keys –Have record pointers in non-leaf nodes

CS 245Notes 4138 to record to record to record with K1 with K2 with K3 to keys to keys to keys to keys k3 K1 P1K2 P2K3 P3

CS 245Notes 4139 B-tree examplen=2 65 125 145 165 85 105 25 45 10 20 30 40 110 120 90 100 70 80 170 180 50 60 130 140 150 160

CS 245Notes 4140 B-tree examplen=2 65 125 145 165 85 105 25 45 10 20 30 40 110 120 90 100 70 80 170 180 50 60 130 140 150 160 sequence pointers not useful now! (but keep space for simplicity)

CS 245Notes 4141 Note on inserts Say we insert record with key = 25 10 20 30 n=3 leaf

CS 245Notes 4142 Note on inserts Say we insert record with key = 25 10 20 30 n=3 leaf 10 – 20 – 25 30 Afterwards:

CS 245Notes 4143 So, for B-trees: MAXMIN Tree Rec Keys PtrsPtrs Ptrs Ptrs Non-leaf non-root n+1nn  (n+1)/2   (n+1)/2  -1  (n+1)/2  -1 Leaf non-root 1nn 1  n/2   n/2  Root non-leaf n+1nn 2 1 1 Root Leaf 1nn 1 1 1

CS 245Notes 4144 Tradeoffs: B-trees have faster lookup than B+trees  in B-tree, non-leaf & leaf different sizes  in B-tree, deletion more complicated

CS 245Notes 4145 Tradeoffs: B-trees have faster lookup than B+trees  in B-tree, non-leaf & leaf different sizes  in B-tree, deletion more complicated  B+trees preferred!

CS 245Notes 4146 But note: If blocks are fixed size (due to disk and buffering restrictions) Then lookup for B+tree is actually better!!

CS 245Notes 4147 Example: - Pointers 4 bytes - Keys4 bytes - Blocks100 bytes (just example) - Look at full 2 level tree

CS 245Notes 4148 Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 100 bytes B-tree:

CS 245Notes 4149 Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 100 bytes B-tree: Each of 9 sons: 12 rec. pointers (+12 keys) = 12x(4+4) + 4 = 100 bytes

CS 245Notes 4150 Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 100 bytes B-tree: Each of 9 sons: 12 rec. pointers (+12 keys) = 12x(4+4) + 4 = 100 bytes 2-level B-tree, Max # records = 12x9 + 8 = 116

CS 245Notes 4151 Root has 12 keys + 13 son pointers = 12x4 + 13x4 = 100 bytes B+tree:

CS 245Notes 4152 Root has 12 keys + 13 son pointers = 12x4 + 13x4 = 100 bytes B+tree: Each of 13 sons: 12 rec. ptrs (+12 keys) = 12x(4 +4) + 4 = 100 bytes

CS 245Notes 4153 Root has 12 keys + 13 son pointers = 12x4 + 13x4 = 100 bytes B+tree: Each of 13 sons: 12 rec. ptrs (+12 keys) = 12x(4 +4) + 4 = 100 bytes 2-level B+tree, Max # records = 13x12 = 156

CS 245Notes 4154 So... ooooooooooooo ooooooooo 156 records 108 records Total = 116 B+ B 8 records

CS 245Notes 4155 So... ooooooooooooo ooooooooo 156 records 108 records Total = 116 B+ B 8 records Conclusion: –For fixed block size, –B+ tree is better because it is bushier

CS 245Notes 4156 An Interesting Problem... What is a good index structure when: –records tend to be inserted with keys that are larger than existing values? (e.g., banking records with growing data/time) –we want to remove older data

CS 245Notes 4157 One Solution: Multiple Indexes Example: I1, I2 day days indexed days indexed I1 I2 101,2,3,4,56,7,8,9,10 1111,2,3,4,56,7,8,9,10 1211,12,3,4,56,7,8,9,10 1311,12,13,4,56,7,8,9,10 advantage: deletions/insertions from smaller index disadvantage: query multiple indexes

CS 245Notes 4158 Another Solution (Wave Indexes) dayI1I2I3I4 101,2,34,5,67,8,910 111,2,34,5,67,8,910,11 121,2,34,5,67,8,910,11, 12 13134,5,67,8,910,11, 12 1413,144,5,67,8,910,11, 12 1513,14,154,5,67,8,910,11, 12 1613,14,15167,8,910,11, 12 advantage: no deletions disadvantage: approximate windows

CS 245Notes 4159 Outline/summary Conventional Indexes Sparse vs. dense Primary vs. secondary B trees B+trees vs. B-trees B+trees vs. indexed sequential Hashing schemes-->Next

CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina.

Similar presentations

Presentation on theme: "CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina.

Similar presentations

Presentation on theme: "CS 245Notes 41 CS 245: Database System Principles Notes 4: Indexing Hector Garcia-Molina."— Presentation transcript:

Similar presentations

About project

Feedback