Database Design and Programming

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
DBMS 2001Notes 4.2: Hashing1 Principles of Database Management Systems 4.2: Hashing Techniques Pekka Kilpeläinen (after Stanford CS245 slide originals.
Hashing and Indexing John Ortiz.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Indexing and.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Hash Table indexing and Secondary Storage Hashing.
1 Indexing and Hashing Indexing and Hashing Basic Concepts Dense and Sparse Indices B+Trees, B-trees Dynamic Hashing Comparison of Ordered Indexing and.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #8.
CPSC-608 Database Systems Fall 2011 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #11.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
CS 4432lecture #10 - indexing & hashing1 CS4432: Database Systems II Lecture #10 Professor Elke A. Rundensteiner.
CS 277 – Spring 2002Notes 51 CS 277: Database System Implementation Arthur Keller Notes 5: Hashing and More.
Primary Indexes Dense Indexes
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
CS CS4432: Database Systems II. CS Index definition in SQL Create index name on rel (attr) (Check online for index definitions in SQL) Drop.
1 CS143: Index. 2 Topics to Learn Important concepts –Dense index vs. sparse index –Primary index vs. secondary index (= clustering index vs. non-clustering.
Ch12: Indexing and Hashing  Basic Concepts  Ordered Indices B+-Tree Index Files B+-Tree Index Files B-Tree Index Files B-Tree Index Files  Hashing Static.
CS4432: Database Systems II
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Basic Concepts Indexing mechanisms used to speed up access to desired data. E.g., author catalog in library Search Key - attribute to set of attributes.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
CS 245Notes 51 CS 245: Database System Principles Hector Garcia-Molina Notes 5: Hashing and More.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 111 Database Systems II Index Structures.
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
CPSC 8620Notes 61 CPSC 8620: Database Management System Design Notes 6: Hashing and More.
Access Structures COMP3211 Advanced Databases Dr Nicholas Gibbins
COMP3017 Advanced Databases
CS 245: Database System Principles
CPS216: Data-intensive Computing Systems
Relational Database Systems 2
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
Hashing CENG 351.
CPSC-608 Database Systems
Database Management Systems (CS 564)
Introduction to Database Systems
Yan Huang - CSCI5330 Database Implementation – Access Methods
(Slides by Hector Garcia-Molina,
External Memory Hashing
Indexing and Hashing Basic Concepts Ordered Indices
Index tuning Hash Index.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Database Systems (資料庫系統)
2018, Spring Pusan National University Ki-Joune Li
CPS216: Advanced Database Systems
Storage and Indexing.
Chapter 11: Indexing and Hashing
CS4433 Database Systems Indexing.
CS4432: Database Systems II
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Database Design and Programming Jan Baumbach jan.baumbach@imada.sdu.dk http://www.baumbachlab.net

Sparse vs Dense Sparse uses less index space per record (can keep more of index in memory) Sparse allows multi-level indexes Dense can tell if record exists without accessing it Dense needed for secondary indexes Primary index = order of records in storage Secondary index = impose different order

Secondary Index 2nd level Secondary Index Sequential File 40 20 10 30 50 Secondary Index 10 20 30 40 50 60 Sequential File 40 20 10 30 50 60 Careful when Looking for 20

Secondary Index 2nd level Secondary Index Sequential File 40 20 10 30 50 Secondary Index Sequential File 40 20 10 30 50 60 10 20 30 40 50 60

Combining Indexes SELECT * FROM Sells WHERE beer = “Od.Cl.“ AND price = “20“ Beer index Sells Price index OC 20 C.Ch. Just intersect buckets in memory!

Conventional Indexes Sparse, Dense, Multi-level, ... Advantages: Simple Sequential index is good for scans Disadvantage: Inserts expensive Lose sequentiality and balance

Example: Unbalanced Index 10 33 39 31 35 36 32 38 34 overflow area (not sequential) 20 30 40 50 60 70 80 90

B+Trees

Idea Conventional indexes are fixed-level Give up sequentiality of the index in favour of balance B+Tree = variant of B-Tree Allows index tree to grow as needed Ensures that all blocks are between half used and completely full

Characteristics Parameter n determines number of keys per node Example: Key size 4 bytes, pointer size 8 bytes Memory consumption for n records: 4 bytes x n + 8 bytes x (n+1) Example: Block size 4096 bytes 4n + 8(n+1) < 4096  n=340

Characteristics Leafs contain at least n/2 key-pointer pairs to records and a pointer to the next leaf Inner nodes contain at least (n-1)/2 keys and at least n/2 pointers to other nodes No restrictions for the root node

Example: B+Tree (n=3) 42 11 23 64 3 6 9 11 15 17 23 31 37 42 57 64 85

Example: Leaf node 42 57 To next leaf To record With key 42 To record

Example: Interior node 11 23 To keys K < 11 To keys 11 ≤ K < 23 To keys 23 ≤ K

Restrictions Full node at least (n-1)/2 keys Non-leaf Leaf 11 23 42 64 11 15 17 64 85 Counts even when null

Insertion If there is free space in the appropriate leaf, just insert it there Otherwise: Split the leaf in two and divide the keys Insert the smallest value reachable through the right node into the parent node Recurse until there is enough room Special case: Splitting the root results in a new root

Example: Insertion Insert 85 11 23 42 3 6 9 11 17 23 31 37 42 57 85 42

Example: Insertion Insert 15 11 23 42 3 6 9 11 15 17 11 17 23 31 37 42 57 85

Example: Insertion Insert 64 42 42 11 23 42 11 23 64 3 6 9 11 15 17 23 31 37 42 57 85 42 57 64 85

Deletion If there are enough keys left in the appropriate leaf, just delete the key Otherwise: If there is a direct sibling with more than minimum key, steal one! If not, join the node with a direct sibling and delete the smallest value reachable through the former right sibling from its parent Special case: If the root contains only one pointer after deletion, delete it

Example: Deletion Delete 9 42 11 23 64 3 6 9 3 6 3 6 9 11 15 17 23 31 37 42 57 64 85

Example: Deletion Delete 3 42 11 23 15 23 64 6 11 3 6 6 3 6 15 17 11 31 37 42 57 64 85

Example: Deletion Delete 11 42 15 23 23 64 6 11 6 6 11 6 15 17 15 17 31 37 42 57 64 85

Example: Deletion Delete 17, 37 42 23 64 6 15 6 15 17 23 31 23 31 37 57 64 85

Example: Deletion Delete 31 42 42 64 23 64 6 15 6 15 23 6 15 23 23 31 57 64 85

Efficiency Need to load one block for each level! Example: With n = 340 and an average fill of 255 pointers, we can index 255^3 = 16.6 million records in only 3 levels There are at most 342 blocks in the first two levels First two levels can be kept in memory using less than 1.4 Mbyte Need to access only one block per level!

Range Queries Queries often restrict an attribute to a range of values Example: SELECT * FROM Sells WHERE price > 20; Records are found efficiently by searching for value 20 and then traversing the leafs Can also be used if there is both an upper and a lower limit

Summary 10 More things you should know: Dense Index, Sparse Index Multi-Level Indexes Primary vs Secondary Index Structure of B+Trees Insertion and Deletion in B+Trees

Hash Tables

Hash Table in Primary Storage Main parameter B = number of buckets Hash function h maps key to numbers from 0 to B-1 Bucket array indexed from 0 to B-1 Each bucket contains exactly one value Strategy for handling conflicts

Example: B = 4 Insert c (h(c) = 3) Insert a (h(a) = 1) Insert e (h(e) = 1) Alternative 1: Search for free bucket, e.g. by Linear Probing Alternative 2: Add overflow bucket Conflict! 1 2 3 a e e c .

Hash Function Hash function should ensure hash values are equally distributed For integer key K, take h(K) = K modulo B For string key, add up the numeric values of the characters and compute the remainder modulo B For really good hash functions, see Donald Knuth, The Art of Computer Programming: Volume 3 – Sorting and Searching

Hash Table in Secondary Storage Each bucket is a block containing f key-pointer pairs Conflict resolution by probing potentially leads to a large number of I/Os Thus, conflict resolution by adding overflow buckets Need to ensure we can directly access bucket i given number i

Example: Insertion, B=4, f=2 Insert a Insert b Insert c Insert d Insert e Insert g Insert i d a e 1 a 1 1 i 2 b 2 c g 3 c 3 3

Efficiency Very efficient if buckets use only one block: one I/O per lookup Space utilization is #keys in hash divided by total #keys that fit Try to keep between 50% and 80%: < 50% wastes space > 80% significant number of overflows

Dynamic Hashing How to grow and shrink hash tables? Alternative 1: Use overflows and reorganizations Alternative 2: Use dynamic hashing Extensible Hash Tables Linear Hash Tables

Extensible Hash Tables Hash function computes sequence of k bits for each key k = 8 At any time, use only the first i bits Introduce indirection by a pointer array Pointer array grows and shrinks (size 2i ) Pointers may share data blocks (store number of bits used for block in j ) 00110101 i = 3

Example: k = 4, f = 2 i = 2 i = 1 0001 0111 1 00 01 10 11 1001 1010 2 1100 2

Insertion Find destination block B for key-pointer pair If there is room, just insert it Otherwise, let j denote the number of bits used for block B If j = i, increment i by 1: Double the length of the bucket array to 2i+1 Adjust pointers such that for old bit strings w, w0 and w1 point to the same bucket Retry insertion

Insertion If j < i, add a new block B‘: Key-pointer pairs with (j+1)st bit = 0 stay in B Key-pointer pairs with (j+1)st bit = 1 go to B‘ Set number of bits used to j+1 for B and B‘ Adjust pointers in bucket array such that if for all w where previously w0 and w1 pointed to B, now w1 points to B‘ Retry insertion

Example: Insert, k = 4, f = 2 Insert 1010 i = 2 i = 1 0001 1001 1001 11 1 1001 1 1001 1010 2 1001 1100 1 1001 2 1100 2 1100 1

Example: Insert, k = 4, f = 2 Insert 0111 i = 1 i = 2 0001 0001 0111 10 11 1001 1010 2 1100 2

Example: Insert, k = 4, f = 2 Insert 0000 i = 2 i = 1 0001 0001 0001 0111 1 0001 0000 2 00 01 10 11 0111 2 0111 1 1001 1010 2 1100 2

Deletion Find destination block B for key-pointer pair Delete the key-pointer pair If two blocks B referenced by w0 and w1 contain at most f/2 keys, merge them, decrease their j by 1, and adjust pointers If there is no block with j = i, reduce the pointer array to size 2i-1 and decrease i by 1

Example: Delete, k = 4, f = 2 Delete 0000 i = 2 i = 1 0001 0001 0111 10 11 0111 2 1001 1010 2 1100 2

Example: Delete, k = 4, f = 2 Delete 0111 i = 2 i = 1 0001 0001 0111 10 11 1001 1010 2 1100 2

Example: Delete, k = 4, f = 2 Delete 1010 i = 2 i = 1 0001 1001 1001 11 1001 2 1001 1100 2 1001 1010 2 1001 1100 1 1100 2

Efficiency As long as pointer array fits into memory and hash function behaves nicely, just need one I/O per lookup Overflows can still happen if many key-pointer pairs hash to the same bit string Solve by adding overflow blocks

Extensible Hash Tables Advantage: Not too much waste of space No full reorganizations needed Disadvantages: Doubling the pointer array is expensive Performance degrades abruptly For f = 2, k = 32, if there are 3 keys for which the first 20 bits agree, we already need a pointer array of size 1048576

Linear Hash Tables Number of records: r Number of buckets: n Choose n such that on average between 50% and 80% of a block filled with records pmin = 0.5, pmax = 0.8 Use ceiling (log2 n) lower bits for addressing If the bit string used for addressing corresponds to integer m and m≥n, use m-2i-1 instead

Example: k = 4, f = 2 i = 2 i = 1 1100 n = 4 0001 1001 0101 r = 6 1010 1 2 3 1 2 1100 n = 4 0001 1001 0101 r = 6 1010 0111

Example: k = 4, f = 2 i = 1 i = 2 1100 n = 4 0001 1001 0101 r = 6 1010 nr. of key-pointer pairs per bucket Example: k = 4, f = 2 size of the key i = 1 i = 2 1 2 3 1 2 1100 nr. of bits used n = 4 nr. of buckets 0001 1001 0101 r = 6 nr. of records 1010 0111

Insertion Find appropriate bucket (h(K) or h(K)-2i-1) If there is room, insert the key-pointer pair Otherwise, create an overflow block and insert the key-pointer pair there Increase r by 1; if r/n>f∙pmax, add bucket: If the binary representation of n is 1a2...ai, split bucket 0a2...ai according to the i -th bit Increase n by 1 If n > 2i, increase i by 1

Example: Insert, f = 2, pmax = 0.8 nr. of key-pointer pairs per bucket max. average nr. of records per block Example: Insert, f = 2, pmax = 0.8 Insert 1010 i = 1 i = 1 1 1100 1010 1100 nr. of bits used n = 2 nr. of buckets 0001 1001 r = 3 r = 4 nr. of records

Example: Insert, f = 2, pmax = 0.8 nr. of key-pointer pairs per bucket max. average nr. of records per block Example: Insert, f = 2, pmax = 0.8 Check: r/n>f∙pmax? YES!  4/2 > 1.6 i = 2 i = 1 i = 1 1 2 1 1100 1010 1100 nr. of bits used n = 3 n = 2 nr. of buckets 0001 1001 r = 4 r = 3 nr. of records 1010

Example: Insert, f = 2, pmax = 0.8 nr. of key-pointer pairs per bucket max. average nr. of records per block Example: Insert, f = 2, pmax = 0.8 Insert 0111 i = 2 i = 1 1 2 1100 nr. of bits used n = 3 nr. of buckets 0001 1001 0111 r = 4 r = 3 r = 5 nr. of records 1010

Example: Insert, f = 2, pmax = 0.8 nr. of key-pointer pairs per bucket max. average nr. of records per block Example: Insert, f = 2, pmax = 0.8 Attention: 5/3 > 1.6 i = 1 i = 2 1 2 3 1 2 1100 nr. of bits used n = 3 n = 4 nr. of buckets 0001 1001 0111 r = 5 nr. of records 1010 0111

Example: Insert, f = 2, pmax = 0.8 nr. of key-pointer pairs per bucket max. average nr. of records per block Example: Insert, f = 2, pmax = 0.8 Insert 0101 i = 2 i = 1 1 2 1 2 3 1100 nr. of bits used n = 4 nr. of buckets 0001 1001 0101 r = 5 r = 6 r = 6 nr. of records 1010 0111

Linear Hash Tables Advantage: Disadvantages: Not too much waste of space No full reorganizations needed No indirections needed Disadvantages: Can still have overflow chains

B+Trees vs Hashing Hashing good for given key values Example: SELECT * FROM Sells WHERE price = 20; B+Trees and conventional indexes good for range queries: Example: SELECT * FROM Sells WHERE price > 20;

Summary 11 More things you should know: Hashing in Secondary Storage Extensible Hashing Linear Hashing

THE END