CS 400/600 – Data Structures External Sorting.

Slides:



Advertisements
Similar presentations
COSC 2007 Data Structures II Chapter 14 External Methods.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Sorting.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
External Sorting Access to secondary storage is orders of magnitude slower than memory access. Minimize access to secondary storage (tape or disk).
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
CSC 2300 Data Structures & Algorithms March 20, 2007 Chapter 7. Sorting.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting - 3 CS 202 – Fundamental Structures of Computer Science II.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Adapted from instructor resource slides Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All.
Sorting.
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
Indexing.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13 (Sec ): Ramakrishnan & Gehrke and Chapter 11 (Sec ): G-M et al. (R2) OR.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Heapsort Idea: two phases: 1. Construction of the heap 2. Output of the heap For ordering number in an ascending sequence: use a Heap with reverse.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
Lecture 6 : External Sorting Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Internal and External Sorting External Searching
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Pass the Buck Every good programmer is lazy, arrogant, and impatient. In the game “Pass the Buck” you try to do as little work as possible, by making your.
Selection Sort Given an array[0-N], place the smallest item in the array in position 0, the second smallest in position 1, and so forth. We do thisby comparing.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
External Sorting Sort n records/elements that reside on a disk.
External Sort Any sort algorithm which uses external memory, such as tape or disk, during the sort. The best algorithms for processing large amounts of.
External Sorting Chapter 13
Database Management Systems (CS 564)
External Sorting Chapter 13
Improve Run Generation
Lecture 2- Query Processing (continued)
Chapter 12 Query Processing (1)
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
Database Systems (資料庫系統)
External Sorting Chapter 13
Presentation transcript:

CS 400/600 – Data Structures External Sorting

Motivation Sometimes the data to sort are too large to fit in memory. Recall from our discussions on disk access speed that the primary rule is: Minimize the number of disk accesses! External Sorting

Double Buffering It is often the case that a system will maintain at least two input and two output buffers: While one input buffer is being read, the previous one is being processed While one output buffer is being filled, the other is being written External Sorting

Model of External Computation Secondary memory is divided into equal-sized blocks (512, 1024, etc…) A basic I/O operation transfers the contents of one disk block to/from main memory. Under certain circumstances, reading blocks of a file in sequential order is more efficient. (When?) Primary goal is to minimize I/O operations. Assume only one disk drive is available. For sequential search to be more efficient than random access to the file adjacent logical blocks of the file must be physically adjacent. There must be no competition for the I/O head, either with another process, or within the same process (such as thrashing between and input buffer and an output buffer). External Sorting

Key Sorting Often, records are large, keys are small. Ex: Payroll entries keyed on ID number Approach 1: Read in entire records, sort them, then write them out again. Approach 2: Read only the key values, store with each key the location on disk of its associated record. After keys are sorted the records can be read and rewritten in sorted order. Usually the records are not rewritten in sorted order. (1) It is expensive (random access to all records). (2) If there are multiple keys, there is no single “correct” order. External Sorting

Simple External Mergesort (1) Quicksort requires random access to the entire set of records. Better: Modified Mergesort algorithm. Process n elements in Q(log n) passes. A group of sorted records is called a run. Usually the records are not rewritten in sorted order. (1) It is expensive (random access to all records). (2) If there are multiple keys, there is no single “correct” order. External Sorting

Simple External Mergesort (2) Split the file into two files. Read in a block from each file. Take first record from each block, output them in sorted order. Take next record from each block, output them to a second file in sorted order. Repeat until finished, alternating between output files. Read new input blocks as needed. Repeat steps 2-5, except this time input files have runs of two sorted records that are merged together. Each pass through the files provides larger runs. Usually the records are not rewritten in sorted order. (1) It is expensive (random access to all records). (2) If there are multiple keys, there is no single “correct” order. External Sorting

Simple External Mergesort (3) Usually the records are not rewritten in sorted order. (1) It is expensive (random access to all records). (2) If there are multiple keys, there is no single “correct” order. External Sorting

Problems with Simple Mergesort Is each pass through input and output files sequential? What happens if all work is done on a single disk drive? How can we reduce the number of Mergesort passes? In general, external sorting consists of two phases: Break the files into initial runs Merge the runs together into a single run. The passes are sequential. But, on a single disk drive, competition for the I/O head eliminates this advantage. But, we can get a massive improvement by reading in a block, or several blocks, and using an in-memory sort to create the initial runs. External Sorting

Breaking a File into Runs General approach: Read as much of the file into memory as possible. Perform an in-memory sort. Output this group of records as a single run. External Sorting

Replacement Selection (1) Break available memory into an array for the heap, an input buffer, and an output buffer. Fill the array from disk. Make a min-heap. Send the smallest value (root) to the output buffer. External Sorting

Replacement Selection (2) If the next key in the file is greater than the last value output, then Replace the root with this key else Replace the root with the last key in the array Add the next record in the file to a new heap (actually, stick it at the end of the array). External Sorting

Replacement Selection Example External Sorting

Initial runs It can be shown that the average run length for this algorithm will be 2M (where M is the size of memory) Snowplow argument Just reading in and sorting an array would only get a run of size M Now we just need to merge the initial runs External Sorting

Multiway Merge (1) With replacement selection, each initial run is several blocks long. Assume each run is placed in separate file. Read the first block from each file into memory and perform an r-way merge. When a buffer becomes empty, read a block from the appropriate run file. Each record is read only once from disk during the merge process. External Sorting

Multiway Merge (2) In practice, use only one file and seek to appropriate block. External Sorting