External Sorting Sort n records/elements that reside on a disk.

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Preliminaries Multiway trees have nodes with greater than two children. Multiway trees of order k have nodes with most k children Trees –For all.
Improve Run Generation Overlap input,output, and internal CPU work. Reduce the number of runs (equivalently, increase average run length). DISK MEMORY.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Lecture 11: DMBS Internals
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
External Sorting Adapt fastest internal-sort methods.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Sorting.
Cache Memories.
Cache Memories CSE 238/2038/2138: Systems Programming
Lecture 16: Data Storage Wednesday, November 6, 2006.
External Sorting Chapter 13
Cache Memory Presentation I
CS 105 Tour of the Black Holes of Computing
Lecture 11: DMBS Internals
External Sorting Adapt fastest internal-sort methods.
Performance Measurement
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
Cache Memories Topics Cache memory organization Direct mapped caches
Advanced Sorting Methods: Shellsort
Database Management Systems (CS 564)
Unit-2 Divide and Conquer
Improve Run Merging Reduce number of merge passes.
Midterm Review – Part I ( Disk, Buffer and Index )
External Sorting Chapter 13
Improve Run Generation
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Secondary Storage Management Brian Bershad
CS222: Principles of Data Management Lecture #10 External Sorting
External Sorting.
Secondary Storage Management Hank Levy
CS222P: Principles of Data Management Lecture #10 External Sorting
These notes were largely prepared by the text’s author
CENG 351 Data Management and File Structures
External Sorting Adapt fastest internal-sort methods.
External Sorting Chapter 13
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
CSE 190D Database System Implementation
Presentation transcript:

External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large. n is very large, and each record may be large or small. n is small, but each record is very large. So, not feasible to input the n records, sort, and output in sorted order.

Small n But Large File Input the record keys. Sort the n keys to determine the sorted order for the n records. Permute the records into the desired order (possibly several fields at a time). We focus on the case: large n, large file. Alternatively, input front part (including key) of each of the n records so as to fill memory. Permute remainder. Input records, compacting keys together. Sort keys… 1 input pass Input records, write to proper place on disk, one record at a time… 1 input pass and 1 output pass. Even though n random access writes, n is small and each record is large. So, OK. Even with large n large file may need to first determine sort permutation and then rearrange.

New Data Structures/Concepts Tournament trees. Huffman trees. Double-ended priority queues. Buffering. Ideas also may be used to speed algorithms for small instances by using cache more efficiently.

External Sort Computer Model MAIN ALU DISK

Disk Characteristics read/write head tracks Seek time Latency time Approx. 100,000 arithmetics Latency time Approx. 25,000 arithmetics Transfer time Data access by block

Traditional Internal Memory Model MAIN ALU Let’s digress for a moment and take a closer look at the computer model sans disk. Not a very accurate model for internal memory algorithms. Classical algorithm analysis is done using this model. Number of operations is closely related to number of memory accesses. So, we may simply count number of operations. This model has severe limitations when trying to explain observed performance numbers.

Matrix Multiplication for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; ijk, ikj, jik, jki, kij, kji orders of loops yield same result. All perform same number of operations. But run time may differ significantly!

More Accurate Memory Model MAIN ALU 8-32 32KB 256KB 1GB 1C 2C 10C 100C Cost of L2 miss may be much larger on new processors > 100x rather than 10x of L1 miss. Cache organization—cache lines. In certain application, (e.g., packet routing), people count only cache misses and assume arithmetics are free.

2D Array Representation In Java, C, and C++ b c d e f g h i j k l x[] int x[3][4]; The 4 separate arrays are x, x[0], x[1], and x[2]. When you access x[0][0], a cache-line load is brought into L2 from main memory. Array of Arrays Representation

ijk Order for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) for (int k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; . . . . . . C . . . . . . A . . . . . . B * For analysis, assume a one-level cache with an LRU replacement policy. =

ijk Analysis . . . . . . C A B = * Block size = width of cache line = w. Assume one-level cache. C => n2/w cache misses. A => n3/w cache misses, when n is large. B => n3 cache misses, when n is large. Total cache misses = n3/w(1/n + 1 + w). The analyses is rather simplistic. An accurate analysis would need to know the cache line replacement strategy in use (direct mapped, random, LRU, etc.).

ikj Order for (int i = 0; i < n; i++) for (int k = 0; k < n; k++) for (int j = 0; j < n; j++) c[i][j] += a[i][k] * b[k][j]; . . . . . . C . . . . . . A . . . . . . B * =

ikj Analysis . . . . . . C A B = * C => n3/w cache misses, when n is large. A => n2/w cache misses. B => n3/w cache misses, when n is large. Total cache misses = n3/w(2 + 1/n).

ijk Vs. ikj Comparison ijk cache misses = n3/w(1/n + 1 + w). ikj cache misses = n3/w(2 + 1/n). ijk/ikj ~ (1 + w)/2, when n is large. w = 4 (32-byte cache line, double precision data) ratio ~ 2.5. w = 8 (64-byte cache line, double precision data) ratio ~ 4.5. w = 16 (64-byte cache line, integer data) ratio ~ 8.5.

Prefetch Prefetch can hide memory latency Successful prefetch requires ability to predict a memory access much in advance Prefetch cannot reduce energy as prefetch does not reduce number of memory accesses

Faster Internal Sorting May apply external sorting ideas to internal sorting. Internal tiled merge sort gives 2x (or more) speedup over traditional merge sort.

External Sort Methods Base the external sort method on a fast internal sort method. Average run time Quick sort Worst-case run time Merge sort

Internal Quick Sort Use 6 as the pivot. 2 8 5 11 10 4 1 9 7 3 Use 6 as the pivot. Middle group size is 1. In external sort adaptation, we make middle group as large as possible. 2 8 5 11 10 4 1 9 7 3 6 Sort left and right groups recursively.

Quick Sort – External Adaptation Middle group DISK input small large Each buffer is, say, 1 block long. Rest of memory is used for middle group. 3 input/output buffers input, small, large rest is used for middle group

Quick Sort – External Adaptation DISK input small large Middle group Fill input buffer. Delay when input buffer is empty or small/large full. Alternate between removing min and max elements of middle group so as to balance size of small and large groups. fill middle group from disk if next record <= middlemin send to small if next record >= middlemax send to large else remove middlemin or middlemax from middle and add new record to middle group

Quick Sort – External Adaptation DISK input small large Middle group Fill input buffer when it gets empty. Write small/large buffer when full. Write middle group in sorted order when done. Double-ended priority queue. Use additional buffers to reduce I/O wait time.