Database Management Systems (CS 564)

Database Management Systems (CS 564)
Fall 2017 Lecture 18

Indexing: Faster Access to Data for a Price
Use responsibly! CS 564 (Fall'17)

(Ubiquitous) B+tree Height-balanced (dynamic) tree structure
Insert/delete at logF N cost F = fan-out, N = #leaf pages Each node contains d ≤ m ≤ 2d entries (except for root where 1 ≤ m ≤ 2d) i.e. minimum 50% occupancy d is called the order of the tree Supports equality and range searches efficiently Each node corresponds to a disk page Index entries In all the non-leaf nodes (search key value, pid) Think about page and record organization Non-leaf nodes Leaf nodes Root node Data entries Exist only in the leaf nodes (search key value, rid) or (search key value, record) Are sorted according to the search key CS 564 (Fall'17)

Example Height = 1 13 17 24 30 2 3 5 7 14 16 19 20 22 24 27 29 33 34 38 39 Page 1 Page 2 Page 3 Page 4 CS 564 (Fall'17)

Review Exercise Assume a file which has 950,000 records
Say we index this file using a B+ tree In this particular B+ tree, the average page has an occupancy of 100 pointers (i.e. the tree's average branching factor is 100) Assume further that the amount of memory set aside for storing the index is 150 blocks, and that all records of the above file reside on disk No duplicate search key exists Given a search key K, compute the minimal number of disk I/O needed to retrieve the record with that search key CS 564 (Fall'17)

Review Exercise Answer
Assume each leaf node points to data records i.e. leaf nodes do not contain data records We have 950,000 records, so need 950,000 pointers from leaf nodes Each leaf node can point to 100 records, so need 9,500 leaf nodes Hence tree has three levels: root at level 0, 100 nodes at level 1, nodes at level 2 Have 150 memory pages for index, so can store root node + all of level 1 and some of level 2 Thus minimum cost is 1 CS 564 (Fall'17)

Hash(-based) Indexes Best for equality searches Different flavors
Don’t support range searches Different flavors Static hashing Dynamic hashing Extendible hashing Linear hashing CS 564 (Fall'17)

Person(name, zipcode, phone)
Static Hashing Person(name, zipcode, phone) Search key: zipcode Hash function; h(k)=k%100 (last two digits of k) N = 4 (number of buckets) Each bucket holds two data entries (records) Primary pages Overflow pages Bucket 0 (John, 53400, ) (Navneet, 54768, ) Bucket 1 (Zuyu, 53409, ) (Han, 53633, ) Bucket 2 Find 53633* h(53633)=53633%100=33 Bucket#=33%4=1 … Bucket 3 (Theo, 34411, ) How do we know H, N, etc. for each index? CS 564 (Fall'17)

Static Hashing (Cont.) A hash index is a collection of buckets
A specific bucket consists of a primary page and possibly some overflow pages Number of primary bucket pages is fixed, they’re allocated sequentially, never de-allocated Overflow pages are allocated (and de-allocated) if needed Each bucket contains zero or more data entries To find the bucket for each record, we use a hash function h applied on the search key k N = number of buckets h(k) mod N = bucket in which the data entry belongs e.g. h(k) = a * k + b, where a and b constant Records with different search key may belong in the same bucket CS 564 (Fall'17)

Operations on Static Hash Indexes
Equality search Apply the hash function on the search key value to locate the appropriate bucket Search through the primary page (and possibly overflow pages if they exist) to find the matching record(s) Deletion Find the appropriate bucket, delete the record Possibly delete the overflow page Insertion Find the appropriate bucket, insert the record If there is no space, create a new overflow page CS 564 (Fall'17)

Hash Functions A good hash function is uniform; i.e. each bucket is assigned the same number of search key values (or records) A bad hash function maps all search key values to the same bucket Examples of good hash functions: h(k) = a * k + b, where a and b are constants h(k) = (a * k + b) % p where p is a prime number Example of bad hash functions? CS 564 (Fall'17)

Static Hashing Problems
Fixed number of buckets in the index, hence If the database grows, the number of buckets will be too small Long overflow chains can degrade performance (why?) If the database shrinks, space is wasted Reorganizing the index is expensive and can block query execution Fix: dynamic hashing (e.g. extendible and linear hashing) CS 564 (Fall'17)

Extendible Hashing Keep a directory of pointers to buckets
On overflow, double the directory (not the number of buckets) 2 (John, 53400, ) (Navneet, 54768, ) Bucket A 2 00 01 10 11 2 (Zuyu, 53409, ) h(k) = k%100 N = 4 Find 53409* Bucket B 2 Bucket C 2 (Theo, 34411, ) Bucket D CS 564 (Fall'17)

Extendible Hashing (Cont.)
Insert (Paul, 54717, ) 2 (John, 53400, ) (Navneet, 54768, ) Bucket A 2 00 01 10 11 2 (Zuyu, 53409, ) Bucket B (Paul, 54717, ) 2 Bucket C 2 (Theo, 34411, ) Bucket D CS 564 (Fall'17)

Insert (Meera, , ) 2 (John, 53400, ) (Navneet, 54768, ) Bucket A 2 00 01 10 11 2 (Zuyu, 53409, ) Bucket B (Paul, 54717, ) 2 Bucket C 2 (Theo, 34411, ) Bucket D CS 564 (Fall'17)

Insert (Meera, , ) Local depth 3 (John, 53400, ) Bucket A Global depth 3 000 001 010 011 100 101 110 111 2 (Zuyu, 53409, ) (Paul, 54717, ) Bucket B 2 Bucket C 2 (Theo, 34411, ) Bucket D 3 (Meera, , ) (Navneet, 54768, ) Bucket A2 CS 564 (Fall'17)

Pros and Cons of Extendible Hashing
Benefits: Directory is much smaller than the entire index file Only one page of data entries is split Drawbacks: Need overflow pages if we have key collision, i.e., multiple data entries can have the same hash value CS 564 (Fall'17)

Operations on Extendible Hashing
Search Apply hash function h(k) Take last global depth # bits of h(k), … Insert Find the target bucket If the bucket has space, insert, done! If the bucket if full, split it, re-distribute If necessary, double the directory CS 564 (Fall'17)

Insert: Example (Revisited)
Insert (Salma, , ) 3 (John, 53400, ) Bucket A 3 000 001 010 011 100 101 110 111 2 (Zuyu, 53409, ) (Paul, 54717, ) Bucket B 2 Bucket C 2 (Theo, 34411, ) Bucket D 3 (Meera, , ) (Navneet, 54768, ) Bucket A2 CS 564 (Fall'17)

Operations on Extendible Hashing (Cont.)
Delete Locate the bucket of the record and remove it If the bucket becomes empty, remove it (and update the directory) Two buckets can also be coalesced together if the sum of the entries fit in a single bucket Decreasing the size of the directory can also be done, but it is expensive CS 564 (Fall'17)

Operations on Extendible Hashing (Cont.)
How many disk accesses for equality search? One if directory fits in memory, else two Directory grows in spurts If the distribution of hash values is skewed, the directory can grow very large Can you think of an example where the directory suddenly grows? CS 564 (Fall'17)

h(k)=k%1000, N=8 How about inserting the following rows in order? (Cecilia, 79768, ) (Petr, 80896, ) (Sajika, 80832, ) 3 (John, 79744, ) (Meera, , ) Bucket A 3 000 001 010 011 100 101 110 111 2 (Zuyu, 53409, ) (Paul, 54717, ) Bucket B 2 Bucket C 2 (Theo, 34411, ) Bucket D 3 (Navneet, 54772, ) Bucket A2 CS 564 (Fall'17)

Recap Hash indexes Static hashing Dynamic hashing
Efficient equality search Static hashing Simple, limited Dynamic hashing Example: extendible hashing CS 564 (Fall'17)

Composite Search Keys Search key is composed of multiple attributes
e.g. (Name, Address) Hash index Define the hash function to map each combination (e.g. of Name and Address) to a hash code B+tree Sort keys by Name, then Address (3, 15) (3, 112) (5, 8) (100, 3)* (700, 5)* (2, 5)* (3, 13)* (3, 14)* (3, 17)* (3, 100)* (4, 1)* (5, 2)* CS 564 (Fall'17)

Indexes in SQL CREATE INDEX statement
CREATE INDEX UserNameInx ON User(Name); CREATE INDEX NameAgeInx ON User(Name, Age); CREATE UNIQUE INDEX EvNameInx ON Event(Name); CREATE INDEX EvNHash ON Event USING hash (Name); CREATE INDEX AnotherNameAgeInx ON User(Name ASC, Age DESC) CS 564 (Fall'17)

Classification of Indexes
A table can have multiple indexes Primary vs secondary Clustered vs unclustered Primary index: if the search key of the index contains the primary key of the table Secondary index: any other index that is not a primary index Unique index: if the search key contains a candidate key CS 564 (Fall'17)

Classification of Indexes (Cont.)
Clustered index: if the order of records (in the data file) is the same or “close to” the order of data entries in the index If k* is the actual record, then the index is clustered A table can be clustered on at most one search key Data retrieval cost varies significantly depending on whether or not the index/table is clustered CS 564 (Fall'17)

External Sorting Next Up Questions? CS 564 (Fall'17)

Database Management Systems (CS 564)

Similar presentations

Presentation on theme: "Database Management Systems (CS 564)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database Management Systems (CS 564)

Similar presentations

Presentation on theme: "Database Management Systems (CS 564)"— Presentation transcript:

Similar presentations

About project

Feedback