8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do?

Slides:



Advertisements
Similar presentations
CS 400/600 – Data Structures External Sorting.
Advertisements

Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Hashing and Indexing John Ortiz.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CS 253: Algorithms Chapter 8 Sorting in Linear Time Credit: Dr. George Bebis.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
E.G.M. Petrakissorting1 Sorting  Put data in order based on primary key  Many methods  Internal sorting:  data in arrays in main memory  External.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
External Sorting Access to secondary storage is orders of magnitude slower than memory access. Minimize access to secondary storage (tape or disk).
6/27/20151 PSU’s CS External Sorting  Motivation  2-way External Sort: Memory, passes,cost  General External Sort: Memory, passes, cost  Optimizations.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Improve Run Generation Overlap input,output, and internal CPU work. Reduce the number of runs (equivalently, increase average run length). DISK MEMORY.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
CS 202, Spring 2003 Fundamental Structures of Computer Science II Bilkent University1 Sorting - 3 CS 202 – Fundamental Structures of Computer Science II.
CS4432: Database Systems II
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
CSE 373 Data Structures Lecture 15
Chapter 8 File Processing and External Sorting. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Lecture 11: DMBS Internals
Sorting.
Indexing.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CPSC 461 Final Review I Hessam Zakerzadeh Dina Said.
Sorting by the Numbers Sorting Part Four. Question Suppose you are given the task of writing an application to sort a big data file. What do you need.
1 B + -Trees: Search  If there are n search-key values in the file,  the path is no longer than  log  f/2  (n)  (worst case).
External Sorting Adapt fastest internal-sort methods.
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
CSE 326: Data Structures Lecture 23 Spring Quarter 2001 Sorting, Part 1 David Kaplan
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
CPS216: Data-intensive Computing Systems
Lecture 16: Data Storage Wednesday, November 6, 2006.
Lecture 11: DMBS Internals
Database Management Systems (CS 564)
Improve Run Generation
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
CS222: Principles of Data Management Lecture #10 External Sorting
General External Merge Sort
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
CENG 351 Data Management and File Structures
Database Systems (資料庫系統)
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
External Sorting Dina Said
Presentation transcript:

8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort What is a major difference between two external sorts?

Sorting with Disk k - way merging “mergesort” merge internal sort

Example 4500 records 250 records/block available memory = 3 blocks Def’n : A segment of a file is said to be a run if all the records in the segment are sorted I 135 D 1 …… 246 D 2 ……

3 D 1 D 2 …… 6n D 3 D 4 : the size of a run

Run size How many passes? 1 + log 2 r (r  # of initial runs)

k-way merging … … …… … …… log k r………………………………………………. …… # of passes 1+log k r # of I/O operations? O(nlog k r)  better than 2-way merging !!!

How about # of comparisons? Is k-way merging always better than 2-way merging?

Replacement Selection … … …… … …… ………………………………………………. …… # of passes 1+  log k r   #(P) #(P)  k  r  r  run size 

# of comparisons(k-way merge)

How many comparisons in a pass? nlog 2 k why? Total # of comparisons? (# of passes) (# of comparisons in a pass) = (log k r)(nlog 2 k) = (nlog 2 r)independent of k !!! #(c)  r 

How to increase run size(initial run size) x 1, x 2, x 3,…,x m, x m+1, x m+2, x m+3,…,x 2m, x 2m+1, x 2m+2, x 2m+3,… m keysm keysm keys r = # of runs =   Any improvement? Observation See p.94 in textbook !!! …...

4,2,32,12,18,24,91,11 (record size >> the size of pointer) why do we need this?

A tree of losers 4parent 2loser 32 12Updating pointers 18ptr := winner .parent; 24while ptr  nil do 91 if (ptr .loser .key < winner .key) then 11interchange(ptr .loser, winner); end {if} ptr := ptr .parent; end {while} 1191 winner 1824

Explain p , textbook !!! Exercise : In a complete 2-tree(T) with n leaf nodes, show that total # of nodes in T = 2n -1

Performance Analysis (Average size of runs) m 0  # of records in (real) memory. H. Seward (M.S. Thesis, MIT, 1954) gave a good reason to believe that a run contains more than 1.5m 0 records (no proof) E. Friend (JACM, 3, (1966)) experiment  2m 0 E. Moore (1961) Proved that 2m 0 is the expected run length.

Sketch of Moore’s Proof Snowplow falling snow 2m 0 m 0 uniform distribution  2m 0

Tape Sorting Balanced k-way merging (similar to disk sorting) Polyphase merging  Cascade merging 

Polyphase Merging (Motivation) –(R 1, R 2, …, R 5000 ) –length (R i )  20 bytes –Only 1000 records fitted in the internal memory at one time. (  20k bytes) –4 tapes available Balanced 2-way merge T 1 T 2 T 3 T 4 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000   R 4001,5000   R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000     R 1,5000  Total # of operations = 15000

Tape 1Tape 2Tape 3Tape 4 R 1,1000 R 1001,2000 R 2001,3000  R 3001,4000 R 4001,5000 (rewind) R 3001,4000 R 4001,5000  R 1,3000   R 1,5000  Total # of I/O operations = 8000 Balanced Merge is not always best !!!

What if only 3 tapes available? Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000  R 4001,5000 R 1,2000   R 2001,4000 R 4001,5000  R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000   R 4001,5000  R 1,4000  R 1,5000  Total # of I/O Operations = 21,000 !!!

Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000  R 4001,5000 R 1,2000 R 4001,5000  R 2001,4000 (rewind)  R 1,2000; 4001,5000 (rewind) R 1,5000   Total # of I/O Operations = 11,000 !!!

Polyphase merge T 1 T 2 T 3 T 4 T 5 T            How to assign initial runs?

Cascade Merge T 1 T 2 T 3 T 4 T 5 T   5 15 Pass    (  ) 15 5   Pass   5 1 (  )   Pass   (  ) Pass 

Polyphase Merge T 1 T 2 T 3 T 4 T 5 T 6 phase     Gilstad(1960) 51 1    {{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4}, {16,15,14,12,8},{31,30,28,24,16}} Perfect Fibonacci Distribution !!! What is the underlying rule?

ia i b i c i d i e i

(a 0 + b 0 ) (a 0 + c 0 ) (a 0 + d 0 ) (a 0 + e 0 ) a 0 (a 1 + b 1 ) (a 1 + c 1 ) (a 1 + d 1 ) (a 1 + e 1 ) a 1 (a 2 + b 2 ) (a 2 + c 2 ) (a 2 + d 2 ) (a 2 + e 2 ) a 2 na n b n c n d n e n n+1 a n + b n a n + c n a n + d n a n + e n a n a n  b n  c n  d n  e n

ia i b i c i d i e i output T T T T T T T T 1 T 2 T 3 T 4 T 5

n-1a n-1 b n-1 c n-1 d n-1 e n-1 na n-1 +b n-1 a n-1 +c n-1 a n-1 +d n-1 a n-1 +e n-1 a n-1 a n b n c n d n e n  e n = a n-1 d n = a n-1 + e n = a n-1 + a n-2 c n = a n-1 + d n-1 = a n-1 + (a n-2 + e n-2 ) = a n-1 + a n-2 + a n-3 …………. e n = a n-1 d n = a n-1 + a n-2 c n = a n-1 + a n-2 + a n-3 b n = a n-1 + a n-2 + a n-3 + a n-4 a n = a n-1 + a n-2 + a n-3 + a n-4 + a n-5 (a 0 = 1, a i = 0, i = -1, -2, -3, -4)

e = a n-1 d = a n-1 + a n-2 c = a n-1 + a n-2 + a n-3 b = a n-1 + a n-2 + a n-3 + a n-4 a = a n-1 + a n-2 + a n-3 + a n-4 + a n-4

i a i b i 0 c i 0 d i 0 e i 0

a i =, i = -4, -3, -2, -1, 0, 1, 2,... “The k th order Fibonacci number” F n k = F n-1 k + F n-2 k + …… + F n-k k 0, 0  n  k-2 F n k = 1, n = k-1 e.g) The second order Fibonacci number …… F n 2 = F n F n-2 2 0, if n = 0 F n 2 = 1, if n = 1 Fibonacci number !!! a n = F n+k-1 k if k tapes(input) are used why?

What if not perfect Fib. Dist’n? Use dummy runs !!! 5 input tapes and 53 initial runs. LevelT 1 T 2 T 3 T 4 T >53 (87764) ……………………………… T 1 T 2 T 3 T 4 T 5 (34) (35)(36)(37) (38)(39)(40)(41) (42)(43)(44)(45) (46)(47)(48)(49)(50) (51)(52)(53)       

T 1 T 2 T 3 T 4 T 5 T 6  (2)(2)(2)(3)(3)  5 8 (2)(2)(2)(3) not best but simple and good !!! For better one, see Knuth !!!

Example (3 tapes) T 1 T 2 T 3 (k) 8 (k) 5  (k) 3  (2k) 5  (3k) 3 (2k) 2 0, 1, 1, 2, 3, 5, 8 (5k) 2 (3k) 1  (5k) 1  (8k) 1  (13k) 1  Runs on two input tapes (k) # of runsrun size(k)# of pairs# of I/O’s 8,5 1, ,3 2, ,2 3, ,1 5, ,1 8, How many passes over the data?

Total number  F s for some s. of initial runs the s th Fibonacci number F s F s-1 F s-2 T 1 T 2 T 3 F s-1 F s-2  F s-3  F s-2  F s-3 F s-4 ………… See Fig. p.107, textbook !!! Total # of I/O operations =  # of passes =

Lemma : [proof] (By induction on S) (s=2)LHS = RHS = (s=3)LHS = RHS = (s=k)Suppose that (s=k+1) Exercise !!! See page in textbook !!!

From the previous lemma, # of passes = F s = r (1) why?. Golden Ratio !!! From (1),

Theorem: F s-1 F s-2 Polyphase merge merge 3 tapes F s = r = # of initial runs # of passes = 1.04 log 2 r

APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING TapesPhasesPassesPass/phaseGrowth percent ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS – APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING TapesPhasesPasses Growth ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS

Cascade Merge Levela i b i c i d i e i na n b n c n d n e n n+1 a n +b n +c n a n+1 b n+1 c n+1 d n+1 +d n +e n -e n -d n -c n -b n a n+1 a n Perfect dist’n for detail see Knuth Vol III !!!