Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value.

Similar presentations


Presentation on theme: "1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value."— Presentation transcript:

1 1 CS232A: Database System Principles INDEXING

2 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value Indexing  value Qualified records value

3 Indexing Data Stuctures used for quickly locating tuples that meet a specific type of condition –Equality condition: find Movie tuples where Director=X –Other conditions possible, eg, range conditions: find Employee tuples where Salary>40 AND Salary<50 Many types of indexes. Evaluate them on –Access time –Insertion time –Deletion time –Disk Space needed (esp. as it effects access time)

4 4 Topics Conventional indexes B-trees Hashing schemes

5 Terms and Distinctions Primary index –the index on the attribute (a.k.a. search key) that determines the sequencing of the table Secondary index –index on any other attribute Dense index –every value of the indexed attribute appears in the index Sparse index –many values do not appear A Dense Primary Index Sequential File

6 Dense and Sparse Primary Indexes Dense Primary Index Sparse Primary Index Find the index record with largest value that is less or equal to the value we are looking. + can tell if a value exists without accessing file (consider projection) + better access to overflow records + less index space more + and - in a while

7 7 Sparse vs. Dense Tradeoff Sparse: Less index space per record can keep more of index in memory Dense: Can tell if any record exists without accessing file (Later: –sparse better for insertions –dense needed for secondary indexes)

8 Multi-Level Indexes Treat the index as a file and build an index on it “Two levels are usually sufficient. More than three levels are rare.” Q: Can we build a dense second level index for a dense index ?

9 A Note on Pointers Record pointers consist of block pointer and position of record in the block Using the block pointer only saves space at no extra disk accesses cost

10 Representation of Duplicate Values in Primary Indexes Index may point to first instance of each value only

11 Deletion from Dense Index Delete 40, 80 Lists of available entries Deletion from dense primary index file with no duplicate values is handled in the same way with deletion from a sequential file Q: What about deletion from dense primary index with duplicates

12 Deletion from Sparse Index if the deleted entry does not appear in the index do nothing Delete 40

13 Deletion from Sparse Index (cont’d) if the deleted entry does not appear in the index do nothing if the deleted entry appears in the index replace it with the next search-key value –comment: we could leave the deleted value in the index assuming that no part of the system may assume it still exists without checking the block Delete 30

14 Deletion from Sparse Index (cont’d) if the deleted entry does not appear in the index do nothing if the deleted entry appears in the index replace it with the next search-key value unless the next search key value has its own index entry. In this case delete the entry Delete 40, then 30

15 Insertion in Sparse Index if no new block is created then do nothing Insert 35

16 16 Insertion in Sparse Index if no new block is created then do nothing else create overflow record –Reorganize periodically –Could we claim space of next block? –How often do we reorganize and how much expensive it is? –B-trees offer convincing answers Insert 15

17 17 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 File not sorted on secondary search key

18 18 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Sparse index 30 20 80 100 90... does not make sense!

19 19 Secondary indexes Sequence field 50 30 70 20 40 80 10 100 60 90 Dense index 10 20 30 40 50 60 70... 10 50 90... sparse high level First level has to be dense, next levels are sparse (as usual)

20 20 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30

21 21 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40... one option... Problem: excess overhead! disk space search time

22 22 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 another option: lists of pointers 403020 Problem: variable size records in index!

23 23 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60... Yet another idea : Chain records with same key? Problems: Need to add fields to records, messes up maintenance Need to follow chain to know records

24 24 Duplicate values & secondary indexes 10 20 40 20 40 10 40 10 40 30 10 20 30 40 50 60... buckets

25 25 Why “bucket” idea is useful IndexesRecords Name: primary EMP (name,dept,year,...) Dept: secondary Year: secondary Enables the processing of queries working with pointers only. Very common technique in Information Retrieval

26 Advantage of Buckets: Process Queries Using Pointers Only Find employees of the Toys dept with 4 years in the company SELECT Name FROM Employee WHERE Dept=“Toys” AND Year=4 Dept Index Year Index Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s

27 27 This idea used in text information retrieval Documents...the cat is fat......my cat and my dog like each other......Fido the dog... Buckets known as Inverted lists cat dog

28 28 Information Retrieval (IR) Queries Find articles with “cat” and “dog” –Intersect inverted lists Find articles with “cat” or “dog” –Union inverted lists Find articles with “cat” and not “dog” –Subtract list of dog pointers from list of cat pointers Find articles with “cat” in title Find articles with “cat” and “dog” within 5 words

29 29 Common technique: more info in inverted list cat Title5 100 Author10 Abstract57 Title12 d3d3 d2d2 d1d1 dog type position location

30 30 Posting: an entry in inverted list. Represents occurrence of term in article Size of a list:1Rare words or (in postings) mis-spellings 10 6 Common words Size of a posting: 10-15 bits (compressed)

31 31 Vector space model w1 w2 w3 w4 w5 w6 w7 … DOC = Query= PRODUCT = 1 + ……. = score

32 32 Tricks to weigh scores + normalize e.g.: Match on common word not as useful as match on rare words...

33 Summary of Indexing So Far Basic topics in conventional indexes –multiple levels –sparse/dense –duplicate keys and buckets –deletion/insertion similar to sequential files Advantages –simple algorithms –index is sequential file Disadvantages –eventually sequentiality is lost because of overflows, reorganizations are needed

34 34 ExampleIndex (sequential) continuous free space 10 20 30 40 50 60 70 80 90 39 31 35 36 32 38 34 33 overflow area (not sequential)

35 35 Outline: Conventional indexes B-Trees  NEXT Hashing schemes

36 36 NEXT: Another type of index –Give up on sequentiality of index –Try to get “balance”

37 37 Root B+Tree Examplen=3 100 120 150 180 30 3 5 11 30 35 100 101 110 120 130 150 156 179 180 200

38 38 Sample non-leaf to keysto keysto keys to keys < 5757  k<8181  k<95  95 57 81 95

39 39 Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95 To record with key 57 To record with key 81 To record with key 85

40 40 In textbook’s notationn=3 Leaf: Non-leaf: 30 35 30 35 30

41 41 Size of nodes:n+1 pointers n keys (fixed)

42 42 Non-root nodes have to be at least half-full Use at least Non-leaf:  (n+1)/2  pointers Leaf:  (n+1)/2  pointers to data

43 43 Full nodemin. node Non-leaf Leaf n=3 120 150 180 30 3 5 11 30 35

44 44 B+tree rulestree of order n (1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”

45 45 (3) Number of pointers/keys for B+tree Non-leaf (non-root) n+1n  (n+1)/ 2   (n+1)/ 2  - 1 Leaf (non-root) n+1n Rootn+1n11 Max Max Min Min ptrs keys ptrs  data keys  (n+ 1) / 2 

46 46 Insert into B+tree (a) simple case –space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root

47 47 (a) Insert key = 32 n=3 3 5 11 30 31 30 100 32

48 48 (a) Insert key = 7 n=3 3 5 11 30 31 30 100 3535 7 7

49 49 (c) Insert key = 160 n=3 100 120 150 180 150 156 179 180 200 160 180 160 179

50 50 (d) New root, insert 45 n=3 10 20 30 123123 10 12 20 25 30 32 40 45 4030 new root

51 51 (a) Simple case - no example (b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf Deletion from B+tree

52 52 (b) Coalesce with sibling –Delete 50 10 40 100 10 20 30 40 50 n=4 40

53 53 (c) Redistribute keys –Delete 50 10 40 100 10 20 30 35 40 50 n=4 35

54 54 40 45 30 37 25 26 20 22 10 14 1313 10 2030 40 (d) Non-leaf coalese –Delete 37 n=4 40 30 25 new root

55 55 B+tree deletions in practice –Often, coalescing is not implemented –Too hard and not worth it!

56 56 Comparison: B-trees vs. static indexed sequential file Ref #1: Held & Stonebraker “B-Trees Re-examined” CACM, Feb. 1978

57 57 Ref # 1 claims: - Concurrency control harder in B-Trees - B-tree consumes more space For their comparison: block = 512 bytes key = pointer = 4 bytes 4 data records per block

58 58 Example: 1 block static index 127 keys (127+1)4 = 512 Bytes -> pointers in index implicit!up to 127 contigous blocks k1 k2 k3 k1k2k3 1 data block

59 59 Example: 1 block B-tree 63 keys 63x(4+4)+8 = 512 Bytes -> pointers needed in B-treeup to 63 blocks because index and datablocks are not contiguous k1 k2... k63 k1k2k3 1 data block next -

60 60 Size comparison Ref. #1 Static Index B-tree # data blocks height 2 -> 1272 2 -> 63 2 128 -> 16,129 3 64 -> 3968 3 16,130 -> 2,048,3834 3969 -> 250,047 4 250,048 -> 15,752,961 5

61 61 Ref. #1 analysis claims For an 8,000 block file, after 32,000 inserts after 16,000 lookups  Static index saves enough accesses to allow for reorganization Ref. #1 conclusion Static index better!!

62 62 Ref #2: M. Stonebraker, “Retrospective on a database system,” TODS, June 1980 Ref. #2 conclusion B-trees better!!

63 63 DBA does not know when to reorganize –Self-administration is important target DBA does not know how full to load pages of new index Ref. #2 conclusion B-trees better!!

64 64 Buffering –B-tree: has fixed buffer requirements –Static index: large & variable size buffers needed due to overflow Ref. #2 conclusion B-trees better!!

65 65 Speaking of buffering… Is LRU a good policyfor B+tree buffers?  Of course not!  Should try to keep root in memory at all times (and perhaps some nodes from second level)

66 66 Interesting problem: For B+tree, how large should n be? … n is number of keys / node

67 Assumptions You have the right to set the disk page size for the disk where a B-tree will reside. Compute the optimum page size n assuming that –The items are 4 bytes long and the pointers are also 4 bytes long. –Time to read a node from disk is 12+.003n –Time to process a block in memory is unimportant –B+tree is full (I.e., every page has the maximum number of items and pointers

68 68  Can get: f(n) = time to find a record f(n) n opt n

69 69  FIND n opt by f’(n) = 0 Answer should be n opt = “few hundred”  What happens to n opt as Disk gets faster? CPU get faster?

70 70 Variation on B+tree: B-tree (no +) Idea: –Avoid duplicate keys –Have record pointers in non-leaf nodes

71 71 to record to record to record with K1 with K2 with K3 to keys to keys to keys to keys k3 K1 P1K2 P2K3 P3

72 72 B-tree examplen=2 65 125 145 165 85 105 25 45 10 20 30 40 110 120 90 100 70 80 170 180 50 60 130 140 150 160 sequence pointers not useful now! (but keep space for simplicity)

73 73 So, for B-trees: MAXMIN Tree Rec Keys PtrsPtrs Ptrs Ptrs Non-leaf non-root n+1nn  (n+1)/2   (n+1)/2  -1  (n+1)/2  -1 Leaf non-root 1nn 1  (n+1)/2   (n+1)/2  Root non-leaf n+1nn 2 1 1 Root Leaf 1nn 1 1 1

74 74 Tradeoffs: B-trees have marginally faster average lookup than B+trees (assuming the height does not change)  in B-tree, non-leaf & leaf different sizes  Smaller fan-out  in B-tree, deletion more complicated  B+trees preferred!

75 75 Example: - Pointers 4 bytes - Keys4 bytes - Blocks100 bytes (just example) - Look at full 2 level tree

76 76 Root has 8 keys + 8 record pointers + 9 son pointers = 8x4 + 8x4 + 9x4 = 100 bytes B-tree: Each of 9 sons: 12 rec. pointers (+12 keys) = 12x(4+4) + 4 = 100 bytes 2-level B-tree, Max # records = 12x9 + 8 = 116

77 77 Root has 12 keys + 13 son pointers = 12x4 + 13x4 = 100 bytes B+tree: Each of 13 sons: 12 rec. ptrs (+12 keys) = 12x(4 +4) + 4 = 100 bytes 2-level B+tree, Max # records = 13x12 = 156

78 78 So... ooooooooooooo ooooooooo 156 records 108 records Total = 116 B+ B 8 records Conclusion: –For fixed block size, –B+ tree is better because it is bushier

79 79 Outline/summary Conventional Indexes Sparse vs. dense Primary vs. secondary B trees B+trees vs. B-trees B+trees vs. indexed sequential Hashing schemes-->Next

80 Hashing hash function h(key) returns address of bucket or record for secondary index buckets are required if the keys for a specific hash value do not fit into one page the bucket is a linked list of pages key h(key) Records BucketsRecords key

81 81 Example hash function Key = ‘x 1 x 2 … x n ’ n byte character string Have b buckets h: add x 1 + x 2 + ….. x n – compute sum modulo b

82 82  This may not be best function …  Read Knuth Vol. 3 if you really need to select a good function. Good hash  Expected number of function:keys/bucket is the same for all buckets

83 83 Within a bucket: Do we keep keys sorted? Yes, if CPU time critical & Inserts/Deletes not too frequent

84 84 Next: example to illustrate inserts, overflows, deletes h(K)

85 85 EXAMPLE 2 records/bucket INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 01230123 d a c b h(e) = 1 e

86 86 01230123 a b c e d EXAMPLE: deletion Delete: e f f g maybe move “g” up c d

87 87 Rule of thumb: Try to keep space utilization between 50% and 80% Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket

88 88 How do we cope with growth? Overflows and reorganizations Dynamic hashing Extensible Linear

89 89 Extensible hashing: two ideas (a) Use i of b bits output by hash function b h(K)  use i  grows over time…. 00110101

90 90 (b) Use directory h(K)[0-i ] to bucket............

91 91 Example: h(k) is 4 bits; 2 keys/bucket i = 1 1 1 0001 1001 1100 Insert 1010 1 1100 1010 New directory 2 00 01 10 11 i = 2 2

92 92 1 0001 2 1001 1010 2 1100 Insert: 0111 0000 00 01 10 11 2 i = Example continued 0111 0000 0111 0001 2 2

93 93 00 01 10 11 2 i = 2 1001 1010 2 1100 2 0111 2 0000 0001 Insert: 1001 Example continued 1001 1010 000 001 010 011 100 101 110 111 3 i = 3 3

94 94 Extensible hashing: deletion No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure)

95 95 Deletion example: Run thru insert example in reverse!

96 96 Extensible hashing Can handle growing files - with less wasted space - with no full reorganizations Summary + Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) - -

97 97 Linear hashing Another dynamic hashing scheme Two ideas: (a) Use i low order bits of hash 01110101 grows b i (b) File grows linearly

98 98 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets If h(k)[i ]  m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2 i -1 Rule 0101 can have overflow chains! insert 0101

99 99 Example b=4 bits, i =2, 2 keys/bucket 00 01 1011 0101 1111 0000 1010 m = 01 (max used block) Future growth buckets 10 1010 0101 insert 0101 11 1111 0101

100 100 Example Continued: How to grow beyond this? 00 01 1011 111110100101 0000 m = 11 (max used block) i = 2 0000 100 101 110 111 3... 100 101 0101

101 101 If U > threshold then increase m (and maybe i )  When do we expand file? Keep track of: # used slots total # of slots = U

102 102 Linear Hashing Can handle growing files - with less wasted space - with no full reorganizations No indirection like extensible hashing Summary + + Can still have overflow chains -

103 103 Example: BAD CASE Very full Very emptyNeed to move m here… Would waste space...

104 104 Hashing - How it works - Dynamic hashing - Extensible - Linear Summary

105 105 Next: Indexing vs Hashing Index definition in SQL Multiple key access

106 106 Hashing good for probes given key e.g., SELECT … FROM R WHERE R.A = 5 Indexing vs Hashing

107 107 INDEXING (Including B Trees) good for Range Searches: e.g., SELECT FROM R WHERE R.A > 5 Indexing vs Hashing

108 108 Index definition in SQL Create index name on rel (attr) Create unique index name on rel (attr) defines candidate key Drop INDEX name

109 109 CANNOT SPECIFY TYPE OF INDEX (e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...)... at least in SQL... Note

110 110 ATTRIBUTE LIST  MULTIKEY INDEX (next) e.g., CREATE INDEX foo ON R(A,B,C) Note

111 111 Motivation: Find records where DEPT = “Toy” AND SAL > 50k Multi-key Index

112 112 Strategy I: Use one index, say Dept. Get all Dept = “Toy” records and check their salary I1I1

113 113 Use 2 Indexes; Manipulate Pointers ToySal > 50k Strategy II:

114 114 Multiple Key Index One idea: Strategy III: I1I1 I2I2 I3I3

115 115 Example Record Dept Index Salary Index Name=Joe DEPT=Sales SAL=15k Art Sales Toy 10k 15k 17k 21k 12k 15k 19k

116 116 For which queries is this index good? Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k

117 117 Interesting application: Geographic Data DATA: x y...

118 118 Queries: What city is at ? What is within 5 miles from ? Which is closest point to ?

119 119 h n b i a c o d 10 20 Example e g f m l k j 25 15 3520 40 30 20 10 h i a b c d e f g n o m l j k Search points near f Search points near b 5 15

120 120 Queries Find points with Yi > 20 Find points with Xi < 5 Find points “close” to i = Find points “close” to b =

121 121 Many types of geographic index structures have been suggested Quad Trees R Trees

122 122 Two more types of multi key indexes Grid Partitioned hash

123 123 Grid Index Key 2 X 1 X 2 …… X n V 1 V 2 Key 1 V n To records with key1=V 3, key2=X 2

124 124 CLAIM Can quickly find records with –key 1 = V i  Key 2 = X j –key 1 = V i –key 2 = X j And also ranges…. –E.g., key 1  V i  key 2 < X j

125 125  But there is a catch with Grid Indexes! How is Grid Index stored on disk? Like Array... X1X2X3X4 X1X2X3X4X1X2X3X4 V1V2V3 Problem: Need regularity so we can compute position of entry

126 126 Solution: Use Indirection Buckets V 1 V 2 V 3 * Grid only V 4 contains pointers to buckets Buckets -- X1 X2 X3

127 127 With indirection: Grid can be regular without wasting space We do have price of indirection

128 128 Can also index grid on value ranges SalaryGrid Linear Scale 123 ToySalesPersonnel 0-20K1 20K-50K2 50K-3 8

129 129 Grid files Good for multiple-key search Space, management overhead (nothing is free) Need partitioning ranges that evenly split keys + - -

130 130 Idea: Key1 Key2 Partitioned hash function h1h2 010110 1110010

131 131 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111., EX: Insert

132 132 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales  Sal=40k

133 133 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Sal=30k look here

134 134 h1(toy)=0000 h1(sales)=1001 h1(art)=1010.011. h2(10k)=01100 h2(20k)=11101 h2(30k)=01110 h2(40k)=00111. Find Emp. with Dept. = Sales look here

135 135 Post hashing discussion: - Indexing vs. Hashing - SQL Index Definition - Multiple Key Access - Multi Key Index Variations: Grid, Geo Data - Partitioned Hash Summary

136 136 Reading Chapter 5 Skim the following sections: –5.3.6, 5.3.7, 5.3.8 –5.4.2, 5.4.3, 5.4.4 Read the rest

137 137 The BIG picture…. Chapters 2 & 3: Storage, records, blocks... Chapter 4 & 5: Access Mechanisms - Indexes - B trees - Hashing - Multi key Chapter 6 & 7: Query Processing NEXT


Download ppt "1 CS232A: Database System Principles INDEXING. 2 Given condition on attribute find qualified records Attr = value Condition may also be Attr>value Attr>=value."

Similar presentations


Ads by Google