CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.

Slides:



Advertisements
Similar presentations
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Advertisements

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
B+-Trees and Hashing Techniques for Storage and Index Structures
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
1 Overview of Storage and Indexing Chapter 8 (part 1)
External Sorting R & G Chapter 13 One of the advantages of being
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
File Organizations and Indexing R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears, Roebuck,
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Storage and Indexing February 26 th, 2003 Lecture 19.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Indexing - revisited CS 186, Fall 2012 R & G Chapter 8.
Overview of File Organizations and Indexing Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 “How index-learning turns no student pale Yet holds.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
1 Overview of Storage and Indexing Chapter 8 (part 1)
Storage and Indexing1 Overview of Storage and Indexing.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
ZA classic problem in computer science! zData requested in sorted order ye.g., find students in increasing cap order zSorting is used in many applications.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Data on External Storage – File Organization and Indexing – Cluster Indexes - Primary and Secondary Indexes – Index data Structures – Hash Based Indexing.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
External Sorting Chapter 13
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Database Management Systems (CS 564)
Lecture#12: External Sorting (R&G, Ch13)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Indexing 1.
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
CS222: Principles of Data Management Lecture #10 External Sorting
General External Merge Sort
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
Database Systems (資料庫系統)
EECS 647: Introduction to Database Systems
External Sorting Chapter 13
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Presentation transcript:

CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007

3/17/2016Jinze University of Kentucky2 Today’s Topic Recap of B + -tree File and Index organization

Example B+ Tree - Inserting 8* In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. Root * 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* Notice that root was split, leading to increase in height. 3/17/20163Jinze University of Kentucky

Root * 3*5* 7*14*16* 19*20* 22*24*27* 29*33*34* 38* 39* 13 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* Example Tree (including 8*) Delete 19* and 20*... 3/17/20164Jinze University of Kentucky

Example Tree (including 8*) Delete 19* and 20*... Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. 2*3* Root *16* 19*20*22*24*27* 29*33*34* 38* 39* 135 7*5*8* 2*3* Root *16* 33*34* 38* 39* 135 7*5*8* 22*24* 27 27*29* 3/17/20165Jinze University of Kentucky

... And Then Deleting 24* Must merge. Observe `toss’ of index entry (key 27 on right), and `pull down’ of index entry (below) *27* 29*33*34* 38* 39* 2* 3* 7* 14*16* 22* 27* 29* 33*34* 38* 39* 5*8* Root /17/20166Jinze University of Kentucky

3/17/2016Jinze University of Kentucky7 Index and Data File Organization Heap file vs. B + -tree Index file Header Page Data Page Dat a Pag e Data Page Data Page Data Page Data Page Pages with Free Space Full Pages

3/17/2016Jinze University of Kentucky8 Alternatives for Data Entry in Index Three alternatives:  Primary indexes  Secondary indexes  Clustering indexes Can have multiple (different) indexes per file. E.g. Employee = (EID, name, age, salary) Primary key is EID, name is unique, age and salary are non-key attributes We may (or not) sort the file by EID, with a B + -tree index on EID (primary), name (secondary) and age (secondary), and a hash index on salary.

Primary indexes Index for primary key: File is sorted according to the primary key The first record in each block is called the anchor record and is used to build the index Sometime the thing goes the other way around (called B + - tree based sorting) 123, Susan, 30, 50K 124, John, 40, 80K …. CLUSTERED 3/17/20169Jinze University of Kentucky

Secondary Index Alternative 2: Easier to maintain (do not need to sort the data file) k must be a candidate key of the relation, e.g. unique attribute Data pointer is the RID of the record indexed by k E.g. (Block#, slot#) Leaf node UNCLUSTERED 123, Susan, 30, 50K 124, John, 40, 80K …. Data file 3/17/201610Jinze University of Kentucky

Clustering Index Alternative 3: Also need to sort the data file k must be a non-key attribute of the relation Data pointer is the BID of the first RID that is indexed by k Block 2, slot 1 Block 1, slot 1 Block 1, slot 2 …. Leaf node CLUSTERED 3/17/201611Jinze University of Kentucky

Index Classification Clustered vs. unclustered: If order of data records is the same as, or `close to’, order of index data entries, then called clustered index. A file can be clustered on at most one search key. Cost of retrieving data records through index varies greatly based on whether index is clustered or not! 3/17/201612Jinze University of Kentucky

Clustered vs. Unclustered Index Suppose that clustering index is used for data entries, and that the data records are stored in a Heap file. To build clustering index, first sort the Heap file (with some free space on each block for future inserts). Overflow blocks may be needed for inserts. (Thus, order of data recds is `close to’, but not identical to, the sort order.) Index entries Data entries direct search for (Index File) (Data file) data entries Data entries CLUSTERED UNCLUSTERED 3/17/201613Jinze University of Kentucky

3/17/2016Jinze University of Kentucky14 Comparison IndexDense (D)/Sparse (S) Clustered (C) or not (N) Sorting required PrimarySCYes SecondaryDNNo ClusteringSCYes

Cost of Operations B: The number of data pages R: Number of records per page D: (Average) time to read or write disk page Heap FileSorted FileClustered File Scan all records BD 1.5 BD Equality Search 0.5 BD(log 2 B) * D(log F 1.5B) * D Range Search BDsearch + #match pg*D Insert 2Dsearch + BDsearch+ D Delete search + Dsearch + BDsearch + D 3/17/201615Jinze University of Kentucky

Why Sort? A classic problem in computer science! Data requested in sorted order e.g., find students in increasing gpa order Sorting is first step in bulk loading B+ tree index. Sorting useful for eliminating duplicate copies in a collection of records (Why?) Sorting is useful for summarizing related groups of tuples Sort-merge join algorithm involves sorting. Problem: sort 100Gb of data with 1Gb of RAM. why not virtual memory? 3/17/201616Jinze University of Kentucky

2-Way Sort: Requires 3 Buffers Pass 0: Read a page, sort it, write it. only one buffer page is used (as in previous slide) Pass 1, 2, 3, …, etc.: requires 3 buffer pages merge pairs of runs into runs twice as long three buffer pages used. Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk 3/17/201617Jinze University of Kentucky

Two-Way External Merge Sort Each pass we read + write each page in file. N pages in the file => the number of passes So total cost is: Idea: Divide and conquer: sort subfiles and merge Input file 1-page runs 2-page runs 4-page runs 8-page runs PASS 0 PASS 1 PASS 2 PASS 3 9 3,4 6,2 9,48,75,63,1 2 3,45,62,64,97,8 1,32 2,3 4,6 4,7 8,9 1,3 5,62 2,3 4,4 6,7 8,9 1,2 3,5 6 1,2 2,3 3,4 4,5 6,6 7,8 3/17/201618Jinze University of Kentucky

Using B+ Trees for Sorting Scenario: Table to be sorted has B+ tree index on sorting column(s). Idea: Can retrieve records in order by traversing leaf pages. Is this a good idea? Cases to consider: B+ tree is clusteredGood idea! B+ tree is not clusteredCould be a very bad idea! 3/17/201619Jinze University of Kentucky

Clustered B+ Tree Used for Sorting Cost: root to the left-most leaf, then retrieve all leaf pages (primary index) If clustering index is used? Additional cost of retrieving data records: each page fetched just once. * Always better than external sorting! (Directs search) Data Records Index Data Entries ("Sequence set") 3/17/201620Jinze University of Kentucky

Unclustered B+ Tree Used for Sorting Unclustered index for data entries; each data entry contains rid of a data record. In general, one I/O per data record! (Directs search) Data Records Index Data Entries ("Sequence set") 3/17/201621Jinze University of Kentucky

Summary Index can be used to organize files (clustered file) External sorting is important tool for many applications including building index and answering query (next time) External merge sort minimizes disk I/O cost: Pass 0: Produces sorted runs of size B (# buffer pages). Later passes: merge runs. # of runs merged at a time depends on B, and block size. Larger block size means less I/O cost per page. Larger block size means smaller # runs merged. In practice, # of runs rarely more than 2 or 3. 3/17/201622Jinze University of Kentucky