CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
February 15 & 171 Csci 2111: Data and File Structures Week 6, Lectures 1 & 2 Cosequential Processing and the Sorting of Large Files.
Chapter 8 Cosequential Processing and the Sorting of Large Files
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
CPSC 231 Organizing Files for Performance (D.H.) 1 LEARNING OBJECTIVES Data compression. Reclaiming space in files. Compaction. Searching. Sorting, Keysorting.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
Chapter 8 Cosequential Processing and the Sorting of Large Files
1 Outline File Systems Implementation How disks work How to organize data (files) on disks Data structures Placement of files on disk.
CPSC-608 Database Systems Fall 2009 Instructor: Jianer Chen Office: HRBB 309B Phone: Notes #5.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
CPSC-608 Database Systems Fall 2010 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Indexing.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Sorting by the Numbers Sorting Part Four. Question Suppose you are given the task of writing an application to sort a big data file. What do you need.
CPSC-608 Database Systems Fall 2015 Instructor: Jianer Chen Office: HRBB 315C Phone: Notes #5.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
1 Query Processing Exercise Session 1. 2 The system (OS or DBMS) manages the buffer Disk B1B2B3 Bn … … Program’s private memory An application program.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sort Any sort algorithm which uses external memory, such as tape or disk, during the sort. The best algorithms for processing large amounts of.
Database Management Systems (CS 564)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
RDBMS Chapter 4.
Lecture 31: The IO Model 2 Repacking
CS222: Principles of Data Management Lecture #10 External Sorting
제 7 장 Cosequential Processing and the Sorting of Large Files
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
Presentation transcript:

CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort

CPSC 231 Sorting Large Files (D.H.)2 Sorting of Large Files If a file is too large to be sorted in main memory then it has to be sorted on the disk. Example: If a file consists of records and each record is 100 bytes long then the file size is approximately 800 MB. If a computer has 8 MB of RAM available for sorting then only a small part of this file would fit into main memory.

CPSC 231 Sorting Large Files (D.H.)3 Merge Sort If we do not have enough of available RAM to sort the entire file, we may sort parts of the file, save the sorted sub-files (runs) on the disk and then use the K-way merge to sort the entire file. A run is a sorted subset of file which is used later to sort the entire file. Runs can be created using a heap sort. What is a maximum size of a run in the example on the previous slide?

CPSC 231 Sorting Large Files (D.H.)4 Pros of the Merge Sort Can sort very large files. Reading of the input file is sequential. Reading of run and writing the output file is also sequential. If heap sort is used for sorting of the runs then we can overlap I/O and sorting. Since I/O is largely sequential, this method can be used for sorting files on tapes. See fig p.320

CPSC 231 Sorting Large Files (D.H.)5 Performance of Merge Sort Merge sort requires I/O time for the following operations: –reading all records into memory for sorting and forming runs –writing sorted runs to disk –reading sorted runs into main memory –writing sorted file to the disk

CPSC 231 Sorting Large Files (D.H.)6 Merge Sort versus Key Sort It takes approximately 6 minutes to sort an 800 MB file from our example on a Seagate Cheetah 9 hard disk (track to track seek time = 11msec) It would have taken approximately 24 hours to sort the same file using the Key Sort algorithm.

CPSC 231 Sorting Large Files (D.H.)7 Sorting a File that is Even Larger To sort a file that is ten times larger we need to do more seeks on the disk (since the main memory is the same, we have to create more runs and perform more seeks to merge those runs) It takes approximately 2 hours and six minutes to merge sort an 8 GB file on the Seagate Cheetah 9 disk drive.

CPSC 231 Sorting Large Files (D.H.)8 The cost of merging a bigger file The number of seeks needed to merge a file that is 10 times larger than the original file is 100 times larger. WHY? In general, for a K-way merge of K runs where each run is as large as the memory space available, the buffer size for each of the runs is: (1/K)*size of each run

CPSC 231 Sorting Large Files (D.H.)9 The number of seeks needed to merge a big file K seeks are needed to read all records in each individual run. Since there are K runs altogether, then the merge operation requires: K 2 seeks. Thus if a file is N times bigger, N 2 more seeks are needed to merge it.

CPSC 231 Sorting Large Files (D.H.)10 How to improve performance of merge sort? Allocate more hardware: more main memory, multiple disk drives and I/O channels. Perform the merge in more than one step. Algorithmically increase the lengths of the initial sorted runs. Find ways to overlap I/O operations.

CPSC 231 Sorting Large Files (D.H.)11 Multi-Step Merge Multi-step merge is a merge in which not all runs are merged in one step. Rather, several sets of runs are merged separately, each set producing one long run consisting of the records from all its runs. These new, longer sets are then merged, either all together or in several sets. See example of a two-step merge fig p.330

CPSC 231 Sorting Large Files (D.H.)12 Pros and Cons of Multi-Step Merge Con: it requires that each record is read twice (once to form the intermediate runs and again to form the final sorted file) Pros: We can create large runs by using bigger buffers and thus reduce the number of disk accesses. In some cases multi-step merge is the only reasonable way to perform a merge on tape if the number of tape drives is limited.