Database Management Systems (CS 564)

Slides:



Advertisements
Similar presentations
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
Advertisements

CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Hash-Based Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Hashing. CENG 3512 Motivation The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(log.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Indexing Techniques. Advanced DatabasesIndexing Techniques2 The Problem What can we introduce to make search more efficient? –Indices! What is an index?
B+-tree and Hashing.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Storage and Indexing February 26 th, 2003 Lecture 19.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Indexed Sequential Access Method.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Chapter 5 Record Storage and Primary File Organizations
Database Applications (15-415) DBMS Internals- Part IV Lecture 15, March 13, 2016 Mohammad Hammoud.
CS522 Advanced database Systems
Indexing and hashing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
CS522 Advanced database Systems
COP Introduction to Database Structures
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Dynamic Hashing (Chapter 12)
Hash-Based Indexes Chapter 11
Hashing CENG 351.
Dynamic Hashing.
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Lecture 12 Lecture 12: Indexing.
Introduction to Database Systems
B+-Trees and Static Hashing
External Memory Hashing
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hash-Based Indexes Chapter 10
Indexing and Hashing Basic Concepts Ordered Indices
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Hashing.
Hash-Based Indexes Chapter 11
Index tuning Hash Index.
Database Management Systems (CS 564)
Database Systems (資料庫系統)
LINEAR HASHING E0 261 Jayant Haritsa Computer Science and Automation
Indexing 1.
2018, Spring Pusan National University Ki-Joune Li
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Storage and Indexing.
Module 12a: Dynamic Hashing
General External Merge Sort
Indexing February 28th, 2003 Lecture 20.
Hash-Based Indexes Chapter 11
Chapter 11 Instructor: Xin Zhang
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Presentation transcript:

Database Management Systems (CS 564) Fall 2017 Lecture 18

Indexing: Faster Access to Data for a Price Use responsibly! CS 564 (Fall'17)

(Ubiquitous) B+tree Height-balanced (dynamic) tree structure Insert/delete at logF N cost F = fan-out, N = #leaf pages Each node contains d ≤ m ≤ 2d entries (except for root where 1 ≤ m ≤ 2d) i.e. minimum 50% occupancy d is called the order of the tree Supports equality and range searches efficiently Each node corresponds to a disk page Index entries In all the non-leaf nodes (search key value, pid) Think about page and record organization Non-leaf nodes Leaf nodes Root node Data entries Exist only in the leaf nodes (search key value, rid) or (search key value, record) Are sorted according to the search key CS 564 (Fall'17)

Example Height = 1 13 17 24 30 2 3 5 7 14 16 19 20 22 24 27 29 33 34 38 39 Page 1 Page 2 Page 3 Page 4 CS 564 (Fall'17)

Review Exercise Assume a file which has 950,000 records Say we index this file using a B+ tree In this particular B+ tree, the average page has an occupancy of 100 pointers (i.e. the tree's average branching factor is 100) Assume further that the amount of memory set aside for storing the index is 150 blocks, and that all 950000 records of the above file reside on disk No duplicate search key exists Given a search key K, compute the minimal number of disk I/O needed to retrieve the record with that search key CS 564 (Fall'17)

Review Exercise Answer Assume each leaf node points to data records i.e. leaf nodes do not contain data records We have 950,000 records, so need 950,000 pointers from leaf nodes Each leaf node can point to 100 records, so need 9,500 leaf nodes Hence tree has three levels: root at level 0, 100 nodes at level 1, 10000 nodes at level 2 Have 150 memory pages for index, so can store root node + all of level 1 and some of level 2 Thus minimum cost is 1 CS 564 (Fall'17)

Hash(-based) Indexes Best for equality searches Different flavors Don’t support range searches Different flavors Static hashing Dynamic hashing Extendible hashing Linear hashing CS 564 (Fall'17)

Person(name, zipcode, phone) Static Hashing Person(name, zipcode, phone) Search key: zipcode Hash function; h(k)=k%100 (last two digits of k) N = 4 (number of buckets) Each bucket holds two data entries (records) Primary pages Overflow pages Bucket 0 (John, 53400, 23218564) (Navneet, 54768, 60743111) Bucket 1 (Zuyu, 53409, 23200564) (Han, 53633, 23209964) Bucket 2 Find 53633* h(53633)=53633%100=33 Bucket#=33%4=1 … Bucket 3 (Theo, 34411, 29010533) How do we know H, N, etc. for each index? CS 564 (Fall'17)

Static Hashing (Cont.) A hash index is a collection of buckets A specific bucket consists of a primary page and possibly some overflow pages Number of primary bucket pages is fixed, they’re allocated sequentially, never de-allocated Overflow pages are allocated (and de-allocated) if needed Each bucket contains zero or more data entries To find the bucket for each record, we use a hash function h applied on the search key k N = number of buckets h(k) mod N = bucket in which the data entry belongs e.g. h(k) = a * k + b, where a and b constant Records with different search key may belong in the same bucket CS 564 (Fall'17)

Operations on Static Hash Indexes Equality search Apply the hash function on the search key value to locate the appropriate bucket Search through the primary page (and possibly overflow pages if they exist) to find the matching record(s) Deletion Find the appropriate bucket, delete the record Possibly delete the overflow page Insertion Find the appropriate bucket, insert the record If there is no space, create a new overflow page CS 564 (Fall'17)

Hash Functions A good hash function is uniform; i.e. each bucket is assigned the same number of search key values (or records) A bad hash function maps all search key values to the same bucket Examples of good hash functions: h(k) = a * k + b, where a and b are constants h(k) = (a * k + b) % p where p is a prime number Example of bad hash functions? CS 564 (Fall'17)

Static Hashing Problems Fixed number of buckets in the index, hence If the database grows, the number of buckets will be too small Long overflow chains can degrade performance (why?) If the database shrinks, space is wasted Reorganizing the index is expensive and can block query execution Fix: dynamic hashing (e.g. extendible and linear hashing) CS 564 (Fall'17)

Extendible Hashing Keep a directory of pointers to buckets On overflow, double the directory (not the number of buckets) 2 (John, 53400, 23218564) (Navneet, 54768, 60743111) Bucket A 2 00 01 10 11 2 (Zuyu, 53409, 23200564) h(k) = k%100 N = 4 Find 53409* Bucket B 2 Bucket C 2 (Theo, 34411, 29010533) Bucket D CS 564 (Fall'17)

Extendible Hashing (Cont.) Insert (Paul, 54717, 69967743) 2 (John, 53400, 23218564) (Navneet, 54768, 60743111) Bucket A 2 00 01 10 11 2 (Zuyu, 53409, 23200564) Bucket B (Paul, 54717, 69967743) 2 Bucket C 2 (Theo, 34411, 29010533) Bucket D CS 564 (Fall'17)

Extendible Hashing (Cont.) Insert (Meera, 561104, 60055657) 2 (John, 53400, 23218564) (Navneet, 54768, 60743111) Bucket A 2 00 01 10 11 2 (Zuyu, 53409, 23200564) Bucket B (Paul, 54717, 69967743) 2 Bucket C 2 (Theo, 34411, 29010533) Bucket D CS 564 (Fall'17)

Extendible Hashing (Cont.) Insert (Meera, 561104, 60055657) Local depth 3 (John, 53400, 23218564) Bucket A Global depth 3 000 001 010 011 100 101 110 111 2 (Zuyu, 53409, 23200564) (Paul, 54717, 69967743) Bucket B 2 Bucket C 2 (Theo, 34411, 29010533) Bucket D 3 (Meera, 561104, 60055657) (Navneet, 54768, 60743111) Bucket A2 CS 564 (Fall'17)

Pros and Cons of Extendible Hashing Benefits: Directory is much smaller than the entire index file Only one page of data entries is split Drawbacks: Need overflow pages if we have key collision, i.e., multiple data entries can have the same hash value CS 564 (Fall'17)

Operations on Extendible Hashing Search Apply hash function h(k) Take last global depth # bits of h(k), … Insert Find the target bucket If the bucket has space, insert, done! If the bucket if full, split it, re-distribute If necessary, double the directory CS 564 (Fall'17)

Insert: Example (Revisited) Insert (Salma, 561121, 64837757) 3 (John, 53400, 23218564) Bucket A 3 000 001 010 011 100 101 110 111 2 (Zuyu, 53409, 23200564) (Paul, 54717, 69967743) Bucket B 2 Bucket C 2 (Theo, 34411, 29010533) Bucket D 3 (Meera, 561104, 60055657) (Navneet, 54768, 60743111) Bucket A2 CS 564 (Fall'17)

Operations on Extendible Hashing (Cont.) Delete Locate the bucket of the record and remove it If the bucket becomes empty, remove it (and update the directory) Two buckets can also be coalesced together if the sum of the entries fit in a single bucket Decreasing the size of the directory can also be done, but it is expensive CS 564 (Fall'17)

Operations on Extendible Hashing (Cont.) How many disk accesses for equality search? One if directory fits in memory, else two Directory grows in spurts If the distribution of hash values is skewed, the directory can grow very large Can you think of an example where the directory suddenly grows? CS 564 (Fall'17)

Extendible Hashing (Cont.) h(k)=k%1000, N=8 How about inserting the following rows in order? (Cecilia, 79768, 69386254) (Petr, 80896, 69386255) (Sajika, 80832, 69386256) 3 (John, 79744, 23218564) (Meera, 561104, 60055657) Bucket A 3 000 001 010 011 100 101 110 111 2 (Zuyu, 53409, 23200564) (Paul, 54717, 69967743) Bucket B 2 Bucket C 2 (Theo, 34411, 29010533) Bucket D 3 (Navneet, 54772, 60743111) Bucket A2 CS 564 (Fall'17)

Recap Hash indexes Static hashing Dynamic hashing Efficient equality search Static hashing Simple, limited Dynamic hashing Example: extendible hashing CS 564 (Fall'17)

Composite Search Keys Search key is composed of multiple attributes e.g. (Name, Address) Hash index Define the hash function to map each combination (e.g. of Name and Address) to a hash code B+tree Sort keys by Name, then Address (3, 15) (3, 112) (5, 8) (100, 3)* (700, 5)* (2, 5)* (3, 13)* (3, 14)* (3, 17)* (3, 100)* (4, 1)* (5, 2)* CS 564 (Fall'17)

Indexes in SQL CREATE INDEX statement CREATE INDEX UserNameInx ON User(Name); CREATE INDEX NameAgeInx ON User(Name, Age); CREATE UNIQUE INDEX EvNameInx ON Event(Name); CREATE INDEX EvNHash ON Event USING hash (Name); CREATE INDEX AnotherNameAgeInx ON User(Name ASC, Age DESC) CS 564 (Fall'17)

Classification of Indexes A table can have multiple indexes Primary vs secondary Clustered vs unclustered Primary index: if the search key of the index contains the primary key of the table Secondary index: any other index that is not a primary index Unique index: if the search key contains a candidate key CS 564 (Fall'17)

Classification of Indexes (Cont.) Clustered index: if the order of records (in the data file) is the same or “close to” the order of data entries in the index If k* is the actual record, then the index is clustered A table can be clustered on at most one search key Data retrieval cost varies significantly depending on whether or not the index/table is clustered CS 564 (Fall'17)

External Sorting Next Up Questions? CS 564 (Fall'17)