CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007

3/17/2016Jinze Liu @ University of Kentucky2 Today’s Topic Recap of B + -tree File and Index organization

Example B+ Tree - Inserting 8* In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. Root 17 24 30 2* 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* Notice that root was split, leading to increase in height. 3/17/20163Jinze Liu @ University of Kentucky

Root 17 24 30 2* 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* Example Tree (including 8*) Delete 19* and 20*... 3/17/20164Jinze Liu @ University of Kentucky

Example Tree (including 8*) Delete 19* and 20*... Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. 2*3* Root 17 24 30 14*16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* 2*3* Root 17 30 14*16* 33*34* 38* 39* 135 7*5*8* 22*24* 27 27*29* 3/17/20165Jinze Liu @ University of Kentucky

... And Then Deleting 24* Must merge. Observe `toss’ of index entry (key 27 on right), and `pull down’ of index entry (below). 30 22*27* 29*33*34* 38* 39* 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root 30 135 17 3/17/20166Jinze Liu @ University of Kentucky

3/17/2016Jinze Liu @ University of Kentucky7 Index and Data File Organization Heap file vs. B + -tree Index file Header Page Data Page Dat a Pag e Data Page Data Page Data Page Data Page Pages with Free Space Full Pages

3/17/2016Jinze Liu @ University of Kentucky8 Alternatives for Data Entry in Index Three alternatives:  Primary indexes  Secondary indexes  Clustering indexes Can have multiple (different) indexes per file. E.g. Employee = (EID, name, age, salary) Primary key is EID, name is unique, age and salary are non-key attributes We may (or not) sort the file by EID, with a B + -tree index on EID (primary), name (secondary) and age (secondary), and a hash index on salary.

Primary indexes Index for primary key: File is sorted according to the primary key The first record in each block is called the anchor record and is used to build the index Sometime the thing goes the other way around (called B + - tree based sorting) 123, Susan, 30, 50K 124, John, 40, 80K …. CLUSTERED 3/17/20169Jinze Liu @ University of Kentucky

Secondary Index Alternative 2: Easier to maintain (do not need to sort the data file) k must be a candidate key of the relation, e.g. unique attribute Data pointer is the RID of the record indexed by k E.g. (Block#, slot#) Leaf node UNCLUSTERED 123, Susan, 30, 50K 124, John, 40, 80K …. Data file 3/17/201610Jinze Liu @ University of Kentucky

Clustering Index Alternative 3: Also need to sort the data file k must be a non-key attribute of the relation Data pointer is the BID of the first RID that is indexed by k Block 2, slot 1 Block 1, slot 1 Block 1, slot 2 …. Leaf node CLUSTERED 3/17/201611Jinze Liu @ University of Kentucky

Index Classification Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of index data entries, then called clustered index. A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not! 3/17/201612Jinze Liu @ University of Kentucky

Clustered vs. Unclustered Index Suppose that clustering index is used for data entries, and that the data records are stored in a Heap file. To build clustering index, first sort the Heap file (with some free space on each block for future inserts). Overflow blocks may be needed for inserts. (Thus, order of data recds is `close to’, but not identical to, the sort order.) Index entries Data entries direct search for (Index File) (Data file) data entries Data entries CLUSTERED UNCLUSTERED 3/17/201613Jinze Liu @ University of Kentucky

3/17/2016Jinze Liu @ University of Kentucky14 Comparison IndexDense (D)/Sparse (S) Clustered (C) or not (N) Sorting required PrimarySCYes SecondaryDNNo ClusteringSCYes

Cost of Operations B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page Heap FileSorted FileClustered File Scan all records BD 1.5 BD Equality Search 0.5 BD(log 2 B) * D(log F 1.5B) * D Range Search BDsearch + #match pg*D Insert 2Dsearch + BDsearch+ D Delete search + Dsearch + BDsearch + D 3/17/201615Jinze Liu @ University of Kentucky

Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree index. Sorting useful for eliminating duplicate copies in a collection of records (Why?) Sorting is useful for summarizing related groups of tuples Sort-merge join algorithm involves sorting. Problem: sort 100Gb of data with 1Gb of RAM. why not virtual memory? 3/17/201616Jinze Liu @ University of Kentucky

2-Way Sort: Requires 3 Buffers Pass 0: Read a page, sort it, write it. only one buffer page is used (as in previous slide) Pass 1, 2, 3, …, etc.: requires 3 buffer pages merge pairs of runs into runs twice as long three buffer pages used. Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk 3/17/201617Jinze Liu @ University of Kentucky

Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Idea: Divide and conquer: sort subfiles and merge Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,48,75,63,1 2 3,45,62,64,97,8 1,32 2,3 4,6 4,7 8,9 1,3 5,62 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8 3/17/201618Jinze Liu @ University of Kentucky

Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clusteredGood idea! B+ tree is not clusteredCould be a very bad idea! 3/17/201619Jinze Liu @ University of Kentucky

Clustered B+ Tree Used for Sorting Cost: root to the left-most leaf, then retrieve all leaf pages (primary index) If clustering index is used? Additional cost of retrieving data records: each page fetched just once. * Always better than external sorting! (Directs search) Data Records Index Data Entries ("Sequence set") 3/17/201620Jinze Liu @ University of Kentucky

Unclustered B+ Tree Used for Sorting Unclustered index for data entries; each data entry contains rid of a data record. In general, one I/O per data record! (Directs search) Data Records Index Data Entries ("Sequence set") 3/17/201621Jinze Liu @ University of Kentucky

Summary Index can be used to organize files (clustered file) External sorting is important tool for many applications including building index and answering query (next time) External merge sort minimizes disk I/O cost: Pass 0: Produces sorted runs of size B (# buffer pages). Later passes: merge runs. # of runs merged at a time depends on B, and block size. Larger block size means less I/O cost per page. Larger block size means smaller # runs merged. In practice, # of runs rarely more than 2 or 3. 3/17/201622Jinze Liu @ University of Kentucky

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

Similar presentations

Presentation on theme: "CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

Similar presentations

Presentation on theme: "CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007."— Presentation transcript:

Similar presentations

About project

Feedback