1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Slides:



Advertisements
Similar presentations
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Advertisements

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
External Sorting R & G Chapter 13 One of the advantages of being
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
6/27/20151 PSU’s CS External Sorting  Motivation  2-way External Sort: Memory, passes,cost  General External Sort: Memory, passes, cost  Optimizations.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 Evaluation of Relational Operations Yanlei Diao UMass Amherst March 01, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Query Processing 2: Sorting & Joins
BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
1 Database Systems ( 資料庫系統 ) December 7, 2011 Lecture #11.
ZA classic problem in computer science! zData requested in sorted order ye.g., find students in increasing cap order zSorting is used in many applications.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Database Systems (資料庫系統)
External Sorting Chapter 13
External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and
Evaluation of Relational Operations
Relational Operations
CS222P: Principles of Data Management Notes #11 Selection, Projection
Database Management Systems (CS 564)
Lecture#12: External Sorting (R&G, Ch13)
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
CS222: Principles of Data Management Lecture #10 External Sorting
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
Database Systems (資料庫系統)
External Sorting Chapter 13
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Presentation transcript:

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke

2 Query Parser Query Rewriter Query Optimizer Query Executor Query Processor Transactional Storage Manager Disk Space Manager DB Access Methods Buffer Manager Lock Manager Log Manager Disk Manager Overview of Query Processing

3 Relational Operations  We will consider how to implement:  Selection ( ) Selects a subset of rows from relation.  Projection ( ) Deletes unwanted columns from relation.  Join ( ) Allows us to combine two relations.  Set-difference ( ) Tuples in reln. 1, but not in reln. 2.  Union ( ) Tuples in reln. 1 and in reln. 2.  Aggregation ( SUM, MIN, etc.) and GROUP BY  Order By Returns tuples in specified order.  Since each op returns a relation, ops can be composed ! After we cover the operations, we will discuss how to optimize queries formed by composing them.

4 Topics for Query Processing  Sorting  Implementations of relational operators  Joins  Other operators  Query optimization  System R optimizer  Advanced optimization techniques

5 Why Sort?  A classic problem in computer science!  Important utility in DBMS:  Data requested in sorted order (e.g., ORDER BY ) e.g., find students in increasing gpa order  Sorting useful for eliminating duplicate copies in a collection of records (e.g., SELECT DISTINCT )  Sort-merge join algorithm involves sorting.  Sorting is first step in bulk loading B+ tree index.  Problem: sort 1GB of data with 1MB of RAM. Key is to minimize # I/Os!

6 2-Way Sort: Requires 3 Buffers  Pass 1: Read a page, sort it, write it.  only one buffer page is used  Pass 2, 3, …, etc.:  three buffer pages used. Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk

7 Two-Way External Merge Sort A file of N pages: Pass 0: N sorted runs of 1 page each Pass 1: N/2 sorted runs of 2 pages each Pass 2: N/4 sorted runs of 4 pages each … Pass P: 1 sorted run of 2 P pages 2 P  N  P  log 2 N  Divide and conquer, sort subfiles (runs) and merge

8 Two-Way External Merge Sort  Each pass, we read + write N pages in file  2N.  Number of passes is:  So total cost is:  Divide and conquer, sort subfiles (runs) and merge

9 General External Merge Sort  To sort a file with N pages using B buffer pages:  Pass 0: use B buffer pages. Produce  N/B  sorted runs of B pages each.  Pass 2, 3…, etc.: merge B-1 runs. B Main memory buffers INPUT 1 INPUT B-1 OUTPUT Disk INPUT 2... More than 3 buffer pages. How can we utilize them?

10 Cost of External Merge Sort  Number of passes = 1 +  log B-1  N/B   Cost = 2N * (# of passes)  E.g., with 5 (B) buffer pages, sort 108 (N) page file: Pass 0  108/5  = 22 sorted runs of 5 pages each (last run is only 3 pages)  N/B  sorted runs of B pages each Pass 1  22/4  = 6 sorted runs of 20 pages each (last run is only 8 pages)  N/B  /(B-1) sorted runs of B(B-1) pages each Pass 2 2 sorted runs, 80 pages and 28 pages  N/B  /(B-1) 2 sorted runs of B(B-1) 2 pages Pass 3 Sorted file of 108 pages  N/B  /(B-1) 3 sorted runs of B(B-1) 3 (  N) pages

11 Number of Passes of External Sort

12 Output (1 buffer) Input (1 buffer) Current Set (B-2 buffers) Replacement Sort  Produces initial sorted runs as long as possible.  Replacement Sort : when used in Pass 0 for sorting, can write out sorted runs of size 2B on average.  Affects calculation of the number of passes accordingly.

13 Output (1 buffer) Input (1 buffer) Current Set (B-2 buffers) Replacement Sort  Organize B available buffers:  1 buffer for input  B-2 buffers for current set  1 buffer for output  Pick tuple r in the current set with the smallest value that is  largest value in output, e.g. 8, to extend the current run.  Fill the space in current set by adding tuples from input.  Write output buffer out if full, extending the current run.  Current run terminates if every tuple in the current set is smaller than the largest tuple in output.

14 I/O Cost versus Number of I/Os  Cost metric has so far been the number of I/Os.  Issue 1: effect of sequential (blocked) I/O?  Refine external sorting using blocked I/O  Issue 2: parallelism between CPU and I/O?  Refine external sorting using double buffering

15 Blocked I/O for External Merge Sort Disk behavior of external sorting: sequential or random I/O for input, output?  To reduce I/O cost, make each input buffer a block of pages.  But this will reduce fan-out during merge passes! E.g. from B- 1 inputs to (B-1)/2 inputs.  In practice, most files still sorted in 2-3 passes. B Main memory buffers INPUT 1 INPUT B-1 OUTPUT Disk INPUT 2...

16 Double Buffering  To reduce wait time for I/O request to complete, can prefetch into `shadow block’.  Potentially, more passes.  In practice, most files still sorted in 2-3 passes. OUTPUT' INPUT 1' INPUT 2' INPUT k' Disk OUTPUT INPUT 1 INPUT k INPUT 2 block size b B main memory buffers, k-way merge What happens when an input block has been consumed?

17 Sorting Records!  Sorting has become a big game!  Parallel sorting is the name of the game...  Datamation sort benchmark: Sort 1M records of size 100 bytes  Typical DBMS: 15 minutes  World record: 1.18 seconds (1998 record) 16 off-the-shelf PC, each with 2 Pentium processor, two hard disks, running NT4.0.  New benchmarks proposed:  Minute Sort: How many can you sort in 1 minute?  Dollar Sort: How many can you sort for $1.00?

18 Using B+ Trees for Sorting  Scenario: Table to be sorted has B+ tree index on sorting column(s).  Idea: Can retrieve records in order by traversing leaf pages.  Is this a good idea? Cases to consider:  B+ tree is clustered  B+ tree is not clustered Good idea! Could be a very bad idea!

19 Clustered B+ Tree Used for Sorting  Cost: root to the left- most leaf, then retrieve all leaf pages (Alternative 1)  Almost always better than external sorting! (Directs search) Data Records Index Data Entries ("Sequential")   If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once.

20 Unclustered B+ Tree Used for Sorting  Alternative (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record! Worse case I/O: RN R : # records per page N : # pages in file 

21 External Sorting vs. Unclustered Index * R : # of records per page R= 100 is the more realistic value. * Worse case numbers (RN) here ! For sorting  B =1,000 * Block size =32

22 Summary  External sorting is important; DBMS may dedicate part of buffer pool for sorting!  External merge sort minimizes disk I/O cost:  Pass 0: Produces sorted runs of size B (# buffer pages). Later passes: merge runs.  # of runs merged at a time depends on B, and block size.  Larger block size means less I/O cost per page.  Larger block size means smaller # runs merged.  In practice, # of runs rarely more than 2 or 3.  Clustered B+ tree is good for sorting; unclustered tree is usually very bad.