8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort What is a major difference between two external sorts?
Sorting with Disk k - way merging “mergesort” merge internal sort
Example 4500 records 250 records/block available memory = 3 blocks Def’n : A segment of a file is said to be a run if all the records in the segment are sorted I 135 D 1 …… 246 D 2 ……
3 D 1 D 2 …… 6n D 3 D 4 : the size of a run
Run size How many passes? 1 + log 2 r (r # of initial runs)
k-way merging … … …… … …… log k r………………………………………………. …… # of passes 1+log k r # of I/O operations? O(nlog k r) better than 2-way merging !!!
How about # of comparisons? Is k-way merging always better than 2-way merging?
Replacement Selection … … …… … …… ………………………………………………. …… # of passes 1+ log k r #(P) #(P) k r r run size
# of comparisons(k-way merge)
How many comparisons in a pass? nlog 2 k why? Total # of comparisons? (# of passes) (# of comparisons in a pass) = (log k r)(nlog 2 k) = (nlog 2 r)independent of k !!! #(c) r
How to increase run size(initial run size) x 1, x 2, x 3,…,x m, x m+1, x m+2, x m+3,…,x 2m, x 2m+1, x 2m+2, x 2m+3,… m keysm keysm keys r = # of runs = Any improvement? Observation See p.94 in textbook !!! …...
4,2,32,12,18,24,91,11 (record size >> the size of pointer) why do we need this?
A tree of losers 4parent 2loser 32 12Updating pointers 18ptr := winner .parent; 24while ptr nil do 91 if (ptr .loser .key < winner .key) then 11interchange(ptr .loser, winner); end {if} ptr := ptr .parent; end {while} 1191 winner 1824
Explain p , textbook !!! Exercise : In a complete 2-tree(T) with n leaf nodes, show that total # of nodes in T = 2n -1
Performance Analysis (Average size of runs) m 0 # of records in (real) memory. H. Seward (M.S. Thesis, MIT, 1954) gave a good reason to believe that a run contains more than 1.5m 0 records (no proof) E. Friend (JACM, 3, (1966)) experiment 2m 0 E. Moore (1961) Proved that 2m 0 is the expected run length.
Sketch of Moore’s Proof Snowplow falling snow 2m 0 m 0 uniform distribution 2m 0
Tape Sorting Balanced k-way merging (similar to disk sorting) Polyphase merging Cascade merging
Polyphase Merging (Motivation) –(R 1, R 2, …, R 5000 ) –length (R i ) 20 bytes –Only 1000 records fitted in the internal memory at one time. ( 20k bytes) –4 tapes available Balanced 2-way merge T 1 T 2 T 3 T 4 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000 R 1,5000 Total # of operations = 15000
Tape 1Tape 2Tape 3Tape 4 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 (rewind) R 3001,4000 R 4001,5000 R 1,3000 R 1,5000 Total # of I/O operations = 8000 Balanced Merge is not always best !!!
What if only 3 tapes available? Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 R 1,2000 R 2001,4000 R 4001,5000 R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000 R 4001,5000 R 1,4000 R 1,5000 Total # of I/O Operations = 21,000 !!!
Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 R 1,2000 R 4001,5000 R 2001,4000 (rewind) R 1,2000; 4001,5000 (rewind) R 1,5000 Total # of I/O Operations = 11,000 !!!
Polyphase merge T 1 T 2 T 3 T 4 T 5 T How to assign initial runs?
Cascade Merge T 1 T 2 T 3 T 4 T 5 T 5 15 Pass ( ) 15 5 Pass 5 1 ( ) Pass ( ) Pass
Polyphase Merge T 1 T 2 T 3 T 4 T 5 T 6 phase Gilstad(1960) 51 1 {{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4}, {16,15,14,12,8},{31,30,28,24,16}} Perfect Fibonacci Distribution !!! What is the underlying rule?
ia i b i c i d i e i
(a 0 + b 0 ) (a 0 + c 0 ) (a 0 + d 0 ) (a 0 + e 0 ) a 0 (a 1 + b 1 ) (a 1 + c 1 ) (a 1 + d 1 ) (a 1 + e 1 ) a 1 (a 2 + b 2 ) (a 2 + c 2 ) (a 2 + d 2 ) (a 2 + e 2 ) a 2 na n b n c n d n e n n+1 a n + b n a n + c n a n + d n a n + e n a n a n b n c n d n e n
ia i b i c i d i e i output T T T T T T T T 1 T 2 T 3 T 4 T 5
n-1a n-1 b n-1 c n-1 d n-1 e n-1 na n-1 +b n-1 a n-1 +c n-1 a n-1 +d n-1 a n-1 +e n-1 a n-1 a n b n c n d n e n e n = a n-1 d n = a n-1 + e n = a n-1 + a n-2 c n = a n-1 + d n-1 = a n-1 + (a n-2 + e n-2 ) = a n-1 + a n-2 + a n-3 …………. e n = a n-1 d n = a n-1 + a n-2 c n = a n-1 + a n-2 + a n-3 b n = a n-1 + a n-2 + a n-3 + a n-4 a n = a n-1 + a n-2 + a n-3 + a n-4 + a n-5 (a 0 = 1, a i = 0, i = -1, -2, -3, -4)
e = a n-1 d = a n-1 + a n-2 c = a n-1 + a n-2 + a n-3 b = a n-1 + a n-2 + a n-3 + a n-4 a = a n-1 + a n-2 + a n-3 + a n-4 + a n-4
i a i b i 0 c i 0 d i 0 e i 0
a i =, i = -4, -3, -2, -1, 0, 1, 2,... “The k th order Fibonacci number” F n k = F n-1 k + F n-2 k + …… + F n-k k 0, 0 n k-2 F n k = 1, n = k-1 e.g) The second order Fibonacci number …… F n 2 = F n F n-2 2 0, if n = 0 F n 2 = 1, if n = 1 Fibonacci number !!! a n = F n+k-1 k if k tapes(input) are used why?
What if not perfect Fib. Dist’n? Use dummy runs !!! 5 input tapes and 53 initial runs. LevelT 1 T 2 T 3 T 4 T >53 (87764) ……………………………… T 1 T 2 T 3 T 4 T 5 (34) (35)(36)(37) (38)(39)(40)(41) (42)(43)(44)(45) (46)(47)(48)(49)(50) (51)(52)(53)
T 1 T 2 T 3 T 4 T 5 T 6 (2)(2)(2)(3)(3) 5 8 (2)(2)(2)(3) not best but simple and good !!! For better one, see Knuth !!!
Example (3 tapes) T 1 T 2 T 3 (k) 8 (k) 5 (k) 3 (2k) 5 (3k) 3 (2k) 2 0, 1, 1, 2, 3, 5, 8 (5k) 2 (3k) 1 (5k) 1 (8k) 1 (13k) 1 Runs on two input tapes (k) # of runsrun size(k)# of pairs# of I/O’s 8,5 1, ,3 2, ,2 3, ,1 5, ,1 8, How many passes over the data?
Total number F s for some s. of initial runs the s th Fibonacci number F s F s-1 F s-2 T 1 T 2 T 3 F s-1 F s-2 F s-3 F s-2 F s-3 F s-4 ………… See Fig. p.107, textbook !!! Total # of I/O operations = # of passes =
Lemma : [proof] (By induction on S) (s=2)LHS = RHS = (s=3)LHS = RHS = (s=k)Suppose that (s=k+1) Exercise !!! See page in textbook !!!
From the previous lemma, # of passes = F s = r (1) why?. Golden Ratio !!! From (1),
Theorem: F s-1 F s-2 Polyphase merge merge 3 tapes F s = r = # of initial runs # of passes = 1.04 log 2 r
APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING TapesPhasesPassesPass/phaseGrowth percent ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS – APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING TapesPhasesPasses Growth ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS
Cascade Merge Levela i b i c i d i e i na n b n c n d n e n n+1 a n +b n +c n a n+1 b n+1 c n+1 d n+1 +d n +e n -e n -d n -c n -b n a n+1 a n Perfect dist’n for detail see Knuth Vol III !!!