Download presentation
Presentation is loading. Please wait.
1
CS232A: Database System Principles INDEXING
2
Indexing ? Given condition on attribute find qualified records
Attr = value Condition may also be Attr>value Attr>=value Qualified records ? value value value
3
Indexing Data Stuctures used for quickly locating tuples that meet a specific type of condition Equality condition: find Movie tuples where Director=X Other conditions possible, eg, range conditions: find Employee tuples where Salary>40 AND Salary<50 Many types of indexes. Evaluate them on Access time Insertion time Deletion time Disk Space needed (esp. as it effects access time)
4
Topics Conventional indexes B-trees Hashing schemes
5
Terms and Distinctions
Primary index the index on the attribute (a.k.a. search key) that determines the sequencing of the table Secondary index index on any other attribute Dense index every value of the indexed attribute appears in the index Sparse index many values do not appear A Dense Primary Index Sequential File
6
Dense and Sparse Primary Indexes
Dense Primary Index Sparse Primary Index Find the index record with largest value that is less or equal to the value we are looking. + can tell if a value exists without accessing file (consider projection) + better access to overflow records + less index space more + and - in a while
7
Sparse vs. Dense Tradeoff
Sparse: Less index space per record can keep more of index in memory Dense: Can tell if any record exists without accessing file (Later: sparse better for insertions dense needed for secondary indexes)
8
Multi-Level Indexes Treat the index as a file and build an index on it
“Two levels are usually sufficient. More than three levels are rare.” Q: Can we build a dense second level index for a dense index ?
9
A Note on Pointers Record pointers consist of block pointer and position of record in the block Using the block pointer only saves space at no extra disk accesses cost
10
Representation of Duplicate Values in Primary Indexes
Index may point to first instance of each value only
11
Deletion from Dense Index
Delete 40, 80 Deletion from dense primary index file with no duplicate values is handled in the same way with deletion from a sequential file Q: What about deletion from dense primary index with duplicates Lists of available entries
12
Deletion from Sparse Index
Delete 40 if the deleted entry does not appear in the index do nothing
13
Deletion from Sparse Index (cont’d)
Delete 30 if the deleted entry does not appear in the index do nothing if the deleted entry appears in the index replace it with the next search-key value comment: we could leave the deleted value in the index assuming that no part of the system may assume it still exists without checking the block
14
Deletion from Sparse Index (cont’d)
Delete 40, then 30 if the deleted entry does not appear in the index do nothing if the deleted entry appears in the index replace it with the next search-key value unless the next search key value has its own index entry. In this case delete the entry
15
Insertion in Sparse Index
if no new block is created then do nothing
16
Insertion in Sparse Index
if no new block is created then do nothing else create overflow record Reorganize periodically Could we claim space of next block? How often do we reorganize and how much expensive it is? B-trees offer convincing answers Insert overflow record 15
17
Secondary indexes File not sorted on secondary search key 30 50 20 70
Sequence field 50 30 File not sorted on secondary search key 70 20 40 80 10 100 60 90
18
Secondary indexes does not make sense! Sparse index 30 50 20 70 80 40
Sequence field Sparse index does not make sense! 50 30 30 20 80 100 70 20 90 ... 40 80 10 100 60 90
19
Secondary indexes Dense index sparse high level
Sequence field Dense index 10 20 30 40 50 60 70 ... 50 30 10 50 90 ... sparse high level 70 20 40 80 10 100 60 90 First level has to be dense, next levels are sparse (as usual)
20
Duplicate values & secondary indexes
10 20 40 20 40 10 40 10 40 30
21
Duplicate values & secondary indexes
one option... 10 20 10 20 Problem: excess overhead! disk space search time 40 20 20 30 40 40 10 40 10 40 ... 40 30
22
Duplicate values & secondary indexes
another option: lists of pointers 10 20 10 40 20 Problem: variable size records in index! 20 40 10 30 40 40 10 40 30
23
Duplicate values & secondary indexes
10 20 10 20 30 40 40 20 40 10 50 60 ... 40 10 40 30 Yet another idea : Chain records with same key? Problems: Need to add fields to records, messes up maintenance Need to follow chain to know records
24
Duplicate values & secondary indexes
10 20 10 20 30 40 40 20 50 60 ... 40 10 40 10 40 30 buckets
25
Why “bucket” idea is useful
Enables the processing of queries working with pointers only. Very common technique in Information Retrieval Indexes Records Name: primary EMP (name,dept,year,...) Dept: secondary Year: secondary
26
Advantage of Buckets: Process Queries Using Pointers Only
Find employees of the Toys dept with 4 years in the company SELECT Name FROM Employee WHERE Dept=“Toys” AND Year=4 Year Index Dept Index Intersect toy bucket and 2nd Floor bucket to get set of matching EMP’s
27
This idea used in text information retrieval
Documents Buckets known as Inverted lists cat dog ...the cat is fat ... ...my cat and my dog like each other... ...Fido the dog ...
28
Information Retrieval (IR) Queries
Find articles with “cat” and “dog” Intersect inverted lists Find articles with “cat” or “dog” Union inverted lists Find articles with “cat” and not “dog” Subtract list of dog pointers from list of cat pointers Find articles with “cat” in title Find articles with “cat” and “dog” within 5 words
29
Common technique: more info in inverted list
position location type cat d1 Title 5 Author 10 Abstract 57 d2 d3 dog Title 100 Title 12
30
Size of a posting: 10-15 bits (compressed)
Posting: an entry in inverted list. Represents occurrence of term in article Size of a list: 1 Rare words or (in postings) mis-spellings 106 Common words Size of a posting: bits (compressed)
31
Vector space model w1 w2 w3 w4 w5 w6 w7 …
DOC = < …> Query= < …> PRODUCT = ……. = score
32
Tricks to weigh scores + normalize
e.g.: Match on common word not as useful as match on rare words...
33
Summary of Indexing So Far
Basic topics in conventional indexes multiple levels sparse/dense duplicate keys and buckets deletion/insertion similar to sequential files Advantages simple algorithms index is sequential file Disadvantages eventually sequentiality is lost because of overflows, reorganizations are needed
34
Example Index (sequential)
continuous free space 10 39 31 35 36 32 38 34 33 overflow area (not sequential) 20 30 40 50 60 70 80 90
35
Outline: Conventional indexes B-Trees NEXT Hashing schemes
36
NEXT: Another type of index
Give up on sequentiality of index Try to get “balance”
37
B+Tree Example n=3 Root 100 120 150 180 30 3 5 11 120 130 180 200 30 35 100 101 110 150 156 179
38
Sample non-leaf to keys to keys to keys to keys
< k<81 81k<95 95 57 81 95
39
Sample leaf node: From non-leaf node to next leaf in sequence 57 81 95
with key 57 with key 81 To record with key 85
40
In textbook’s notation n=3
Leaf: Non-leaf: 30 35 30 35 30 30
41
Size of nodes: n+1 pointers n keys
(fixed)
42
Non-root nodes have to be at least half-full
Use at least Non-leaf: (n+1)/2 pointers Leaf: (n+1)/2 pointers to data
43
n=3 Full node min. node Non-leaf Leaf 120 150 180 30 3 5 11 30 35
44
B+tree rules tree of order n
(1) All leaves at same lowest level (balanced tree) (2) Pointers in leaves point to records except for “sequence pointer”
45
(3) Number of pointers/keys for B+tree
Max Max Min Min ptrs keys ptrsdata keys Non-leaf (non-root) n+1 n (n+1)/2 (n+1)/2- 1 Leaf (non-root) n+1 n (n+1)/2 (n+1)/2 Root n+1 n 1 1
46
Insert into B+tree (a) simple case (b) leaf overflow
space available in leaf (b) leaf overflow (c) non-leaf overflow (d) new root
47
(a) Insert key = 32 n=3 100 30 3 5 11 30 31 32
48
(a) Insert key = 7 n=3 100 30 7 3 5 11 30 31 3 5 7
49
(c) Insert key = 160 n=3 100 160 120 150 180 180 150 156 179 180 200 160 179
50
(d) New root, insert 45 n=3 30 new root 10 20 30 40 1 2 3 10 12 20 25
32 40 40 45
51
Deletion from B+tree (a) Simple case - no example
(b) Coalesce with neighbor (sibling) (c) Re-distribute keys (d) Cases (b) or (c) at non-leaf
52
(b) Coalesce with sibling
Delete 50 n=4 10 40 100 40 10 20 30 40 50
53
(c) Redistribute keys Delete 50 n=4 10 40 100 35 10 20 30 35 40 50
54
n=4 (d) Non-leaf coalese Delete 37 new root 25 25 10 20 30 40 40 30 25
26 1 3 10 14 20 22 30 37 40 45
55
B+tree deletions in practice
Often, coalescing is not implemented Too hard and not worth it!
56
Comparison: B-trees vs. static indexed sequential file
Ref #1: Held & Stonebraker “B-Trees Re-examined” CACM, Feb. 1978
57
Ref # 1 claims: - Concurrency control harder in B-Trees - B-tree consumes more space For their comparison: block = 512 bytes key = pointer = 4 bytes 4 data records per block
58
Example: 1 block static index
1 data block 127 keys (127+1)4 = 512 Bytes -> pointers in index implicit! up to contigous blocks k1 k2 k2 k3 k3
59
Example: 1 block B-tree k1 k2 63 keys ... k63 63x(4+4)+8 = 512 Bytes
1 data block 63 keys 63x(4+4)+8 = 512 Bytes -> pointers needed in B-tree up to 63 blocks because index and data blocks are not contiguous k2 ... k2 k63 k3 - next
60
Size comparison Ref. #1 Static Index B-tree # data # data
blocks height blocks height 2 -> > 128 -> 16, > 16,130 -> 2,048, > 250, 250,048 -> 15,752,
61
Ref. #1 analysis claims For an 8,000 block file, after 32,000 inserts
after 16,000 lookups Static index saves enough accesses to allow for reorganization Ref. #1 conclusion Static index better!!
62
Ref #2: M. Stonebraker, “Retrospective on a database system,” TODS, June 1980 Ref. #2 conclusion B-trees better!!
63
DBA does not know when to reorganize
Ref. #2 conclusion B-trees better!! DBA does not know when to reorganize Self-administration is important target DBA does not know how full to load pages of new index
64
Ref. #2 conclusion B-trees better!! Buffering
B-tree: has fixed buffer requirements Static index: large & variable size buffers needed due to overflow
65
Speaking of buffering… Is LRU a good policy for B+tree buffers?
Of course not! Should try to keep root in memory at all times (and perhaps some nodes from second level)
66
n is number of keys / node
Interesting problem: For B+tree, how large should n be? … n is number of keys / node
67
Assumptions You have the right to set the disk page size for the disk where a B-tree will reside. Compute the optimum page size n assuming that The items are 4 bytes long and the pointers are also 4 bytes long. Time to read a node from disk is n Time to process a block in memory is unimportant B+tree is full (I.e., every page has the maximum number of items and pointers
68
Can get: f(n) = time to find a record
nopt n
69
What happens to nopt as
FIND nopt by f’(n) = 0 Answer should be nopt = “few hundred” What happens to nopt as Disk gets faster? CPU get faster?
70
Variation on B+tree: B-tree (no +)
Idea: Avoid duplicate keys Have record pointers in non-leaf nodes
71
K1 P1 K2 P2 K3 P3 to record to record to record
with K1 with K with K3 to keys to keys to keys to keys < K K1<x<K K2<x<k >k3 K1 P1 K2 P2 K3 P3
72
B-tree example n=2 sequence pointers not useful now! 65 125 25 45 85
(but keep space for simplicity) 65 125 25 45 85 105 145 165 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180
73
So, for B-trees: MAX MIN Tree Rec Keys Tree Rec Keys
Ptrs Ptrs Ptrs Ptrs Non-leaf non-root n+1 n n (n+1)/2 (n+1)/2-1 (n+1)/2-1 Leaf non-root 1 n n (n+1)/2 (n+1)/2 Root non-leaf n+1 n n Leaf 1 n n
74
Tradeoffs: B+trees preferred!
B-trees have marginally faster average lookup than B+trees (assuming the height does not change) in B-tree, non-leaf & leaf different sizes Smaller fan-out in B-tree, deletion more complicated B+trees preferred!
75
Example: - Pointers 4 bytes - Keys 4 bytes - Blocks 100 bytes (just example) - Look at full 2 level tree
76
B-tree: Root has 8 keys + 8 record pointers son pointers = 8x4 + 8x4 + 9x4 = 100 bytes Each of 9 sons: 12 rec. pointers (+12 keys) = 12x(4+4) + 4 = 100 bytes 2-level B-tree, Max # records = 12x9 + 8 = 116
77
B+tree: Root has 12 keys + 13 son pointers = 12x4 + 13x4 = 100 bytes Each of 13 sons: 12 rec. ptrs (+12 keys) = 12x(4 +4) + 4 = 100 bytes 2-level B+tree, Max # records = 13x12 = 156
78
So... B+ B Conclusion: For fixed block size,
8 records ooooooooooooo ooooooooo 156 records records Total = 116 B+ B Conclusion: For fixed block size, B+ tree is better because it is bushier
79
Logistics Extension students: pls check out www.db.ucsd.edu/CSE232W10
Send me your s
80
Outline/summary Conventional Indexes B trees
Sparse vs. dense Primary vs. secondary B trees B+trees vs. B-trees B+trees vs. indexed sequential Hashing schemes --> Next
81
Hashing hash function h(key) returns address of bucket
if the keys for a specific hash value do not fit into one page the bucket is a linked list of pages Buckets Records key h(key) key
82
Example hash function Key = ‘x1 x2 … xn’ n byte character string
Have b buckets h: add x1 + x2 + ….. xn compute sum modulo b
83
This may not be best function …
Read Knuth Vol. 3 if you really need to select a good function. Good hash Expected number of function: keys/bucket is the same for all buckets
84
Within a bucket: Do we keep keys sorted? Yes, if CPU time critical
& Inserts/Deletes not too frequent
85
Next: example to illustrate inserts, overflows, deletes
h(K)
86
EXAMPLE 2 records/bucket
INSERT: h(a) = 1 h(b) = 2 h(c) = 1 h(d) = 0 1 2 3 d a c b e h(e) = 1
87
EXAMPLE: deletion Delete: e f c a b d c d e f g 1 2 3 maybe move
1 2 3 a d b d c c e maybe move “g” up f g
88
Rule of thumb: Try to keep space utilization between 50% and 80%
Utilization = # keys used total # keys that fit If < 50%, wasting space If > 80%, overflows significant depends on how good hash function is & on # keys/bucket
89
How do we cope with growth?
Overflows and reorganizations Dynamic hashing Extensible Linear
90
Extensible hashing: two ideas
(a) Use i of b bits output by hash function b h(K) use i grows over time….
91
(b) Use directory h(K)[0-i ] to bucket . .
92
Example: h(k) is 4 bits; 2 keys/bucket
New directory 2 00 01 10 11 i = 1 i = 0001 1 1 1001 1 1100 1010 1100 Insert 1010
93
Example continued 2 0000 0111 0001 i = 2 00 01 10 11 1 0001 0111 2 1001 1010 Insert: 0111 0000 2 1100
94
Example continued i = 3 2 0000 0001 i = 2 2 0111 1001 1010 2 1001 1010
110 111 3 i = 0000 2 i = 0001 2 00 01 10 11 0111 2 1001 1010 2 1001 1010 Insert: 1001 2 1100
95
Extensible hashing: deletion
No merging of blocks Merge blocks and cut directory if possible (Reverse insert procedure)
96
Deletion example: Run thru insert example in reverse!
97
Extensible hashing Summary Can handle growing files
- with less wasted space - with no full reorganizations + Indirection (Not bad if directory in memory) Directory doubles in size (Now it fits, now it does not) -
98
Linear hashing Another dynamic hashing scheme Two ideas:
(a) Use i low order bits of hash grows b i (b) File grows linearly
99
Example b=4 bits, i =2, 2 keys/bucket
0101 can have overflow chains! insert 0101 Future growth buckets 0000 0101 1010 1111 m = 01 (max used block) If h(k)[i ] m, then look at bucket h(k)[i ] else, look at bucket h(k)[i ] - 2i -1 Rule
100
Example b=4 bits, i =2, 2 keys/bucket
0101 insert 0101 1111 0101 Future growth buckets 11 0000 1010 0101 10 1010 1111 m = 01 (max used block)
101
Example Continued: How to grow beyond this?
3 i = 2 0000 100 0101 101 0101 1010 1111 0101 . . . m = 11 (max used block)
102
When do we expand file? Keep track of: # used slots = U
total # of slots = U If U > threshold then increase m (and maybe i )
103
Linear Hashing Summary Can handle growing files
- with less wasted space - with no full reorganizations No indirection like extensible hashing + + Can still have overflow chains -
104
Example: BAD CASE Very full Very empty Need to move m here…
Would waste space...
105
Summary Hashing - How it works - Dynamic hashing - Extensible - Linear
106
Next: Indexing vs Hashing Index definition in SQL Multiple key access
107
Indexing vs Hashing Hashing good for probes given key e.g., SELECT …
FROM R WHERE R.A = 5
108
Indexing vs Hashing INDEXING (Including B Trees) good for
Range Searches: e.g., SELECT FROM R WHERE R.A > 5
109
Index definition in SQL
Create index name on rel (attr) Create unique index name on rel (attr) defines candidate key Drop INDEX name
110
CANNOT SPECIFY TYPE OF INDEX
(e.g. B-tree, Hashing, …) OR PARAMETERS (e.g. Load Factor, Size of Hash,...) ... at least in SQL... Note
111
ATTRIBUTE LIST MULTIKEY INDEX
(next) e.g., CREATE INDEX foo ON R(A,B,C) Note
112
Multi-key Index Motivation: Find records where
DEPT = “Toy” AND SAL > 50k
113
Strategy I: Use one index, say Dept.
Get all Dept = “Toy” records and check their salary I1
114
Strategy II: Use 2 Indexes; Manipulate Pointers Toy Sal > 50k
115
Strategy III: Multiple Key Index One idea: I2 I3 I1
116
Example Example Record Dept Index Salary 10k 15k Art Sales Toy 17k 21k
Name=Joe DEPT=Sales SAL=15k 12k 15k 15k 19k
117
For which queries is this index good?
Find RECs Dept = “Sales” SAL=20k Find RECs Dept = “Sales” SAL > 20k Find RECs Dept = “Sales” Find RECs SAL = 20k
118
Interesting application:
Geographic Data DATA: <X1,Y1, Attributes> <X2,Y2, Attributes> y x . . .
119
Queries: What city is at <Xi,Yi>?
What is within 5 miles from <Xi,Yi>? Which is closest point to <Xi,Yi>?
120
Example 25 15 35 20 40 30 10 10 20 Search points near f
15 35 20 40 30 10 i d e h Search points near f Search points near b b n f 5 15 l o c j g m k h i a b c d e f g n o m l j k
121
Queries Find points with Yi > 20 Find points with Xi < 5
Find points “close” to i = <12,38> Find points “close” to b = <7,24>
122
Many types of geographic index structures have been suggested
Quad Trees R Trees
123
Two more types of multi key indexes
Grid Partitioned hash
124
Grid Index Key 2 X1 X2 …… Xn V2 Key 1 Vn V1
To records with key1=V3, key2=X2
125
CLAIM Can quickly find records with And also ranges….
key 1 = Vi Key 2 = Xj key 1 = Vi key 2 = Xj And also ranges…. E.g., key 1 Vi key 2 < Xj
126
But there is a catch with Grid Indexes!
How is Grid Index stored on disk? Like Array... X1 X2 X3 X4 V1 V2 V3 Problem: Need regularity so we can compute position of <Vi,Xj> entry
127
Solution: Use Indirection
Buckets V1 V2 V *Grid only V contains pointers to buckets X1 X2 X3 -- -- -- -- --
128
With indirection: Grid can be regular without wasting space
We do have price of indirection
129
Can also index grid on value ranges
Salary Grid 0-20K 1 20K-50K 2 50K- 8 3 Linear Scale 1 2 3 Toy Sales Personnel
130
Grid files Good for multiple-key search
Space, management overhead (nothing is free) Need partitioning ranges that evenly split keys + - -
131
Partitioned hash function
Idea: Key Key2 h1 h2
132
<Joe><Sally>
EX: h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . h2(10k) = h2(20k) = h2(30k) = h2(40k) = <Fred,toy,10k>,<Joe,sales,10k> <Sally,art,30k> <Joe><Sally> <Fred> Insert
133
Find Emp. with Dept. = Sales Sal=40k
h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . h2(10k) = h2(20k) = h2(30k) = h2(40k) = Find Emp. with Dept. = Sales Sal=40k <Fred> <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy>
134
h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . 011 . h2(10k) =01 100
. h2(10k) = h2(20k) = h2(30k) = h2(40k) = Find Emp. with Sal=30k <Fred> look here <Joe><Jan> <Mary> <Sally> <Tom><Bill> <Andy>
135
Find Emp. with Dept. = Sales
h1(toy) =0 000 h1(sales) =1 001 h1(art) =1 010 . h2(10k) = h2(20k) = h2(30k) = h2(40k) = Find Emp. with Dept. = Sales <Fred> <Joe><Jan> <Mary> look here <Sally> <Tom><Bill> <Andy>
136
Summary Post hashing discussion: - Indexing vs. Hashing
- SQL Index Definition - Multiple Key Access - Multi Key Index Variations: Grid, Geo Data - Partitioned Hash
137
Reading Chapter 5 Skim the following sections: Read the rest
5.3.6, 5.3.7, 5.3.8 5.4.2, 5.4.3, 5.4.4 Read the rest
138
Revisit: Processing queries without accessing records until last step
Find employees of the Toys dept with 4 years in the company SELECT Name FROM Employee WHERE Dept=“Toys” AND Year=4 Year Index Dept Index
139
Bitmap indices: Alternate structure, heavily used in OLAP
Assume the tuples of the Employees table are ordered. Year Index Dept Index + Find even more quickly intersections and unions (e.g., Dept=“Toys” AND Year=4) ? Seems it needs too much space -> We’ll do compression ? How do we deal with insertions and deletions -> Easier than you think
140
2logm Compression Toys: 00011010 1011 00 0 1
Naive solution needs mn bits, where m is #distinct values and n is #tuples But there is just n 1’s=> let’s utilize this Bit encoding of sequence of runs (e.g. [3,0,1]) Toys: Third run says: The 3rd ace appears after 1 zero after the 2nd 1 3 First run says: The first ace appears after 3 zeros Second run says: The 2nd ace appears immediately after the 1st 10 says: The binary encoding of the first number needs 1+1 digits. 11 says: The first number is 3
141
2log m compression Example Pens: 01000001 Sequence [1,5]
Encoding:
142
Insertions and deletions & miscellaneous engineering
Assume tuples are inserted in order Deletions: Do nothing Insertions: If tuple t with value v is inserted, add one more run in v’s sequence (compact bitmap)
143
The BIG picture…. NEXT Chapters 2 & 3: Storage, records, blocks...
Chapter 4 & 5: Access Mechanisms Indexes - B trees - Hashing - Multi key Chapter 6 & 7: Query Processing NEXT
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.