# 8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do?

## Presentation on theme: "8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do?"— Presentation transcript:

8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort What is a major difference between two external sorts?

Sorting with Disk k - way merging “mergesort” merge internal sort........................

Example 4500 records 250 records/block available memory = 3 blocks Def’n : A segment of a file is said to be a run if all the records in the segment are sorted. 123456 I 135 D 1 …… 246 D 2 ……

3 D 1 D 2 …… 6n D 3 D 4 : the size of a run

1 3 5 7 Run size 24 68 1 3 5 7 2 4 6 8 3 12345678 6 12563478 12 12345678 24 How many passes? 1 + log 2 r (r  # of initial runs)

k-way merging … … …… … …… log k r………………………………………………. …… # of passes 1+log k r # of I/O operations? O(nlog k r)  better than 2-way merging !!!

How about # of comparisons? Is k-way merging always better than 2-way merging?

Replacement Selection … … …… … …… ………………………………………………. …… # of passes 1+  log k r   #(P) #(P)  k  r  r  run size 

# of comparisons(k-way merge) 16 383025501611020 15 202025151112018 10 92015899017 1092015899017 15817 98 8 8 9 89 1 3 2 4567 101112131415 8

How many comparisons in a pass? nlog 2 k why? Total # of comparisons? (# of passes) (# of comparisons in a pass) = (log k r)(nlog 2 k) = (nlog 2 r)independent of k !!! #(c)  r 

How to increase run size(initial run size) x 1, x 2, x 3,…,x m, x m+1, x m+2, x m+3,…,x 2m, x 2m+1, x 2m+2, x 2m+3,… m keysm keysm keys r = # of runs =   Any improvement? Observation See p.94 in textbook !!! …...

4,2,32,12,18,24,91,11 (record size >> the size of pointer) why do we need this? 11 91 24 18 11 18 11 4 5 6 7 2 3

A tree of losers 4parent 2loser 32 12Updating pointers 18ptr := winner .parent; 24while ptr  nil do 91 if (ptr .loser .key < winner .key) then 11interchange(ptr .loser, winner); end {if} ptr := ptr .parent; end {while} 1191 winner 1824

Explain p.97-101, textbook !!! Exercise : In a complete 2-tree(T) with n leaf nodes, show that total # of nodes in T = 2n -1

Performance Analysis (Average size of runs) m 0  # of records in (real) memory. H. Seward (M.S. Thesis, MIT, 1954) gave a good reason to believe that a run contains more than 1.5m 0 records (no proof) E. Friend (JACM, 3, (1966)) experiment  2m 0 E. Moore (1961) Proved that 2m 0 is the expected run length.

Sketch of Moore’s Proof Snowplow falling snow 2m 0 m 0 uniform distribution  2m 0

Tape Sorting Balanced k-way merging (similar to disk sorting) Polyphase merging  Cascade merging 

Polyphase Merging (Motivation) –(R 1, R 2, …, R 5000 ) –length (R i )  20 bytes –Only 1000 records fitted in the internal memory at one time. (  20k bytes) –4 tapes available Balanced 2-way merge T 1 T 2 T 3 T 4 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000   R 4001,5000   R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000     R 1,5000  Total # of operations = 15000

Tape 1Tape 2Tape 3Tape 4 R 1,1000 R 1001,2000 R 2001,3000  R 3001,4000 R 4001,5000 (rewind) R 3001,4000 R 4001,5000  R 1,3000   R 1,5000  Total # of I/O operations 3000 + 5000 = 8000 Balanced Merge is not always best !!!

What if only 3 tapes available? Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000  R 4001,5000 R 1,2000   R 2001,4000 R 4001,5000  R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000   R 4001,5000  R 1,4000  R 1,5000  Total # of I/O Operations 5000 + 2000 + 5000 + 4000 + 5000 = 21,000 !!!

Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000  R 4001,5000 R 1,2000 R 4001,5000  R 2001,4000 (rewind)  R 1,2000; 4001,5000 (rewind) R 1,5000   Total # of I/O Operations 4000 + 3000 + 5000 = 11,000 !!!

Polyphase merge T 1 T 2 T 3 T 4 T 5 T 6 1 31 1 30 1 28 1 24 1 16  1 15 1 14 1 12 1 8  5 16 1 7 1 6 1 4  9 8 5 8 1 3 1 2  17 4 9 4 5 4 1 1  33 2 17 2 9 2 5 2  65 1 33 1 17 1 9 1 5 1 129 1      How to assign initial runs?

Cascade Merge T 1 T 2 T 3 T 4 T 5 T 6 1 55 1 50 1 41 1 29 1 15  1 40 1 35 1 26 1 14  5 15 Pass 11 26 1 21 1 12  4 14 5 15 1 14 1 9  3 12 4 14 5 15 1 5  2 9 3 12 4 14 5 15 (  1 5 2 9 3 12 4 14 5 15 ) 15 5  2 4 3 7 4 9 5 10 15 5 14 4  3 3 4 5 5 6 Pass 215 5 14 4 12 3  4 2 5 3 15 5 14 4 12 3 9 2  5 1 (15 5 14 4 12 3 9 2 5 1  ) 15 4 14 3 12 2 9 1  55 1 15 3 14 2 12 1  50 1 55 1 Pass 3 15 2 14 1  41 1 50 1 55 1 15 1  29 1 41 1 50 1 55 1 (  15 1 29 1 41 1 50 1 55 1 ) Pass 4190 1 

Polyphase Merge T 1 T 2 T 3 T 4 T 5 T 6 phase 11 31 1 30 1 28 1 24 1 16  21 15 1 14 1 12 1 8  5 16 31 7 1 6 1 4  9 8 5 8 41 3 1 2  17 4 9 4 5 4 Gilstad(1960) 51 1  33 2 17 2 9 2 5 2 6  65 1 33 1 17 1 9 1 5 1 7129 1  {{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4}, {16,15,14,12,8},{31,30,28,24,16}} Perfect Fibonacci Distribution !!! What is the underlying rule?

ia i b i c i d i e i 010000 111111 222221 344432 488764 5161514128 63130282416

(a 0 + b 0 ) (a 0 + c 0 ) (a 0 + d 0 ) (a 0 + e 0 ) a 0 (a 1 + b 1 ) (a 1 + c 1 ) (a 1 + d 1 ) (a 1 + e 1 ) a 1 (a 2 + b 2 ) (a 2 + c 2 ) (a 2 + d 2 ) (a 2 + e 2 ) a 2 na n b n c n d n e n n+1 a n + b n a n + c n a n + d n a n + e n a n a n  b n  c n  d n  e n

ia i b i c i d i e i output 010000T 6 111111T 1 222221T 2 344432T 3 222102 111011 488764T 4 5161514128T 5 63130282416T 6 76159554731 T 1 T 2 T 3 T 4 T 5

n-1a n-1 b n-1 c n-1 d n-1 e n-1 na n-1 +b n-1 a n-1 +c n-1 a n-1 +d n-1 a n-1 +e n-1 a n-1 a n b n c n d n e n  e n = a n-1 d n = a n-1 + e n = a n-1 + a n-2 c n = a n-1 + d n-1 = a n-1 + (a n-2 + e n-2 ) = a n-1 + a n-2 + a n-3 …………. e n = a n-1 d n = a n-1 + a n-2 c n = a n-1 + a n-2 + a n-3 b n = a n-1 + a n-2 + a n-3 + a n-4 a n = a n-1 + a n-2 + a n-3 + a n-4 + a n-5 (a 0 = 1, a i = 0, i = -1, -2, -3, -4)

e = a n-1 d = a n-1 + a n-2 c = a n-1 + a n-2 + a n-3 b = a n-1 + a n-2 + a n-3 + a n-4 a = a n-1 + a n-2 + a n-3 + a n-4 + a n-4

i-4-3-2-101234567 a i 000011248163161 1 b i 0 c i 0 d i 0 e i 0

1248163161 1248153059 1247142855 1236122447 112481631

a i =, i = -4, -3, -2, -1, 0, 1, 2,... “The k th order Fibonacci number” F n k = F n-1 k + F n-2 k + …… + F n-k k 0, 0  n  k-2 F n k = 1, n = k-1 e.g) The second order Fibonacci number 011235…… F n 2 = F n-1 2 + F n-2 2 0, if n = 0 F n 2 = 1, if n = 1 Fibonacci number !!! a n = F n+k-1 k if k tapes(input) are used why?

What if not perfect Fib. Dist’n? Use dummy runs !!! 5 input tapes and 53 initial runs. LevelT 1 T 2 T 3 T 4 T 5 1111115 2222219 11110 34443217 22211 48876433 44332 516151412865>53 (87764) ……………………………… T 1 T 2 T 3 T 4 T 5 (34) (35)(36)(37) (38)(39)(40)(41) (42)(43)(44)(45) (46)(47)(48)(49)(50) (51)(52)(53)       

T 1 T 2 T 3 T 4 T 5 T 6  (2)(2)(2)(3)(3) 1 8 1 7 1 6 1 4  5 8 (2)(2)(2)(3)5 5 5 3 not best but simple and good !!! For better one, see Knuth !!!

Example (3 tapes) T 1 T 2 T 3 (k) 8 (k) 5  (k) 3  (2k) 5  (3k) 3 (2k) 2 0, 1, 1, 2, 3, 5, 8 (5k) 2 (3k) 1  (5k) 1  (8k) 1  (13k) 1  Runs on two input tapes (k) # of runsrun size(k)# of pairs# of I/O’s 8,5 1,1 5 10 5,3 2,1 3 9 3,2 3,2 2 10 2,1 5,3 1 8 1,1 8,5 1 13 1 13 How many passes over the data?

Total number  F s for some s. of initial runs the s th Fibonacci number F s F s-1 F s-2 T 1 T 2 T 3 F s-1 F s-2  F s-3  F s-2  F s-3 F s-4 ………… See Fig. p.107, textbook !!! Total # of I/O operations =  # of passes =

Lemma : [proof] (By induction on S) (s=2)LHS = RHS = (s=3)LHS = RHS = (s=k)Suppose that (s=k+1) Exercise !!! See page 106-107 in textbook !!!

From the previous lemma, # of passes = F s = r (1) why?. Golden Ratio !!! From (1),

Theorem: F s-1 F s-2 Polyphase merge merge 3 tapes F s = r = # of initial runs # of passes = 1.04 log 2 r

APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING TapesPhasesPassesPass/phaseGrowth percent ratio 32.078 lnS + 0.6721.504 lnS + 0.992 721.6180340 41.641 lnS + 0.3641.015 lnS + 0.965 621.8392868 51.524 lnS + 0.0780.863 lnS + 0.921 571.9275620 61.479 lnS + 0.1850.795 lnS + 0.864 541.9659482 71.460 lnS + 0.4240.762 lnS + 0.797 521.9835828 81.451 lnS + 0.6420.744 lnS + 0.723 511.9919642 91.447 lnS + 0.8380.734 lnS + 0.646 511.9960312 101.445 lnS + 1.0170.728 lnS + 0.568 501.9980295 201.443 lnS + 2.1700.721 lnS – 0.030 501.9999981 APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING TapesPhasesPasses Growth ratio 32.078 lnS + 0.6721.504 lnS + 0.9921.6180840 41.235 lnS + 0.7541.012 lnS + 0.8202.2469796 50.946 lnS + 0.7960.897 lnS + 0.8002.8793852 60.796 lnS + 0.8210.773 lnS + 0.8083.5133371 70.703 lnS + 0.8390.691 lnS + 0.8224.1481149 80.639 lnS + 0.8520.632 lnS + 0.8344.7833861 90.592 lnS + 0.8610.587 lnS + 0.8455.4189757 100.555 lnS + 0.8690.552 lnS + 0.8546.0547828 200.397 lnS + 0.9050.397 lnS + 0.90112.4174426

Cascade Merge Levela i b i c i d i e i 010000 111111 254321 315141295 45550412915 na n b n c n d n e n n+1 a n +b n +c n a n+1 b n+1 c n+1 d n+1 +d n +e n -e n -d n -c n -b n a n+1 a n Perfect dist’n for detail see Knuth Vol III !!!

Download ppt "8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do?"

Similar presentations