Presentation is loading. Please wait.

Presentation is loading. Please wait.

8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do?

Similar presentations


Presentation on theme: "8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do?"— Presentation transcript:

1 8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort What is a major difference between two external sorts?

2 Sorting with Disk k - way merging “mergesort” merge internal sort

3 Example 4500 records 250 records/block available memory = 3 blocks Def’n : A segment of a file is said to be a run if all the records in the segment are sorted I 135 D 1 …… 246 D 2 ……

4 3 D 1 D 2 …… 6n D 3 D 4 : the size of a run

5 Run size How many passes? 1 + log 2 r (r  # of initial runs)

6

7 k-way merging … … …… … …… log k r………………………………………………. …… # of passes 1+log k r # of I/O operations? O(nlog k r)  better than 2-way merging !!!

8 How about # of comparisons? Is k-way merging always better than 2-way merging?

9 Replacement Selection … … …… … …… ………………………………………………. …… # of passes 1+  log k r   #(P) #(P)  k  r  r  run size 

10 # of comparisons(k-way merge)

11 How many comparisons in a pass? nlog 2 k why? Total # of comparisons? (# of passes) (# of comparisons in a pass) = (log k r)(nlog 2 k) = (nlog 2 r)independent of k !!! #(c)  r 

12 How to increase run size(initial run size) x 1, x 2, x 3,…,x m, x m+1, x m+2, x m+3,…,x 2m, x 2m+1, x 2m+2, x 2m+3,… m keysm keysm keys r = # of runs =   Any improvement? Observation See p.94 in textbook !!! …...

13 4,2,32,12,18,24,91,11 (record size >> the size of pointer) why do we need this?

14 A tree of losers 4parent 2loser 32 12Updating pointers 18ptr := winner .parent; 24while ptr  nil do 91 if (ptr .loser .key < winner .key) then 11interchange(ptr .loser, winner); end {if} ptr := ptr .parent; end {while} 1191 winner 1824

15 Explain p , textbook !!! Exercise : In a complete 2-tree(T) with n leaf nodes, show that total # of nodes in T = 2n -1

16 Performance Analysis (Average size of runs) m 0  # of records in (real) memory. H. Seward (M.S. Thesis, MIT, 1954) gave a good reason to believe that a run contains more than 1.5m 0 records (no proof) E. Friend (JACM, 3, (1966)) experiment  2m 0 E. Moore (1961) Proved that 2m 0 is the expected run length.

17 Sketch of Moore’s Proof Snowplow falling snow 2m 0 m 0 uniform distribution  2m 0

18 Tape Sorting Balanced k-way merging (similar to disk sorting) Polyphase merging  Cascade merging 

19 Polyphase Merging (Motivation) –(R 1, R 2, …, R 5000 ) –length (R i )  20 bytes –Only 1000 records fitted in the internal memory at one time. (  20k bytes) –4 tapes available Balanced 2-way merge T 1 T 2 T 3 T 4 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000   R 4001,5000   R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000     R 1,5000  Total # of operations = 15000

20 Tape 1Tape 2Tape 3Tape 4 R 1,1000 R 1001,2000 R 2001,3000  R 3001,4000 R 4001,5000 (rewind) R 3001,4000 R 4001,5000  R 1,3000   R 1,5000  Total # of I/O operations = 8000 Balanced Merge is not always best !!!

21 What if only 3 tapes available? Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000  R 4001,5000 R 1,2000   R 2001,4000 R 4001,5000  R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000   R 4001,5000  R 1,4000  R 1,5000  Total # of I/O Operations = 21,000 !!!

22 Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000  R 4001,5000 R 1,2000 R 4001,5000  R 2001,4000 (rewind)  R 1,2000; 4001,5000 (rewind) R 1,5000   Total # of I/O Operations = 11,000 !!!

23 Polyphase merge T 1 T 2 T 3 T 4 T 5 T            How to assign initial runs?

24 Cascade Merge T 1 T 2 T 3 T 4 T 5 T   5 15 Pass    (  ) 15 5   Pass   5 1 (  )   Pass   (  ) Pass 

25 Polyphase Merge T 1 T 2 T 3 T 4 T 5 T 6 phase     Gilstad(1960) 51 1    {{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4}, {16,15,14,12,8},{31,30,28,24,16}} Perfect Fibonacci Distribution !!! What is the underlying rule?

26 ia i b i c i d i e i

27 (a 0 + b 0 ) (a 0 + c 0 ) (a 0 + d 0 ) (a 0 + e 0 ) a 0 (a 1 + b 1 ) (a 1 + c 1 ) (a 1 + d 1 ) (a 1 + e 1 ) a 1 (a 2 + b 2 ) (a 2 + c 2 ) (a 2 + d 2 ) (a 2 + e 2 ) a 2 na n b n c n d n e n n+1 a n + b n a n + c n a n + d n a n + e n a n a n  b n  c n  d n  e n

28 ia i b i c i d i e i output T T T T T T T T 1 T 2 T 3 T 4 T 5

29 n-1a n-1 b n-1 c n-1 d n-1 e n-1 na n-1 +b n-1 a n-1 +c n-1 a n-1 +d n-1 a n-1 +e n-1 a n-1 a n b n c n d n e n  e n = a n-1 d n = a n-1 + e n = a n-1 + a n-2 c n = a n-1 + d n-1 = a n-1 + (a n-2 + e n-2 ) = a n-1 + a n-2 + a n-3 …………. e n = a n-1 d n = a n-1 + a n-2 c n = a n-1 + a n-2 + a n-3 b n = a n-1 + a n-2 + a n-3 + a n-4 a n = a n-1 + a n-2 + a n-3 + a n-4 + a n-5 (a 0 = 1, a i = 0, i = -1, -2, -3, -4)

30 e = a n-1 d = a n-1 + a n-2 c = a n-1 + a n-2 + a n-3 b = a n-1 + a n-2 + a n-3 + a n-4 a = a n-1 + a n-2 + a n-3 + a n-4 + a n-4

31 i a i b i 0 c i 0 d i 0 e i 0

32

33 a i =, i = -4, -3, -2, -1, 0, 1, 2,... “The k th order Fibonacci number” F n k = F n-1 k + F n-2 k + …… + F n-k k 0, 0  n  k-2 F n k = 1, n = k-1 e.g) The second order Fibonacci number …… F n 2 = F n F n-2 2 0, if n = 0 F n 2 = 1, if n = 1 Fibonacci number !!! a n = F n+k-1 k if k tapes(input) are used why?

34 What if not perfect Fib. Dist’n? Use dummy runs !!! 5 input tapes and 53 initial runs. LevelT 1 T 2 T 3 T 4 T >53 (87764) ……………………………… T 1 T 2 T 3 T 4 T 5 (34) (35)(36)(37) (38)(39)(40)(41) (42)(43)(44)(45) (46)(47)(48)(49)(50) (51)(52)(53)       

35 T 1 T 2 T 3 T 4 T 5 T 6  (2)(2)(2)(3)(3)  5 8 (2)(2)(2)(3) not best but simple and good !!! For better one, see Knuth !!!

36 Example (3 tapes) T 1 T 2 T 3 (k) 8 (k) 5  (k) 3  (2k) 5  (3k) 3 (2k) 2 0, 1, 1, 2, 3, 5, 8 (5k) 2 (3k) 1  (5k) 1  (8k) 1  (13k) 1  Runs on two input tapes (k) # of runsrun size(k)# of pairs# of I/O’s 8,5 1, ,3 2, ,2 3, ,1 5, ,1 8, How many passes over the data?

37 Total number  F s for some s. of initial runs the s th Fibonacci number F s F s-1 F s-2 T 1 T 2 T 3 F s-1 F s-2  F s-3  F s-2  F s-3 F s-4 ………… See Fig. p.107, textbook !!! Total # of I/O operations =  # of passes =

38 Lemma : [proof] (By induction on S) (s=2)LHS = RHS = (s=3)LHS = RHS = (s=k)Suppose that (s=k+1) Exercise !!! See page in textbook !!!

39 From the previous lemma, # of passes = F s = r (1) why?. Golden Ratio !!! From (1),

40 Theorem: F s-1 F s-2 Polyphase merge merge 3 tapes F s = r = # of initial runs # of passes = 1.04 log 2 r

41 APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING TapesPhasesPassesPass/phaseGrowth percent ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS – APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING TapesPhasesPasses Growth ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS

42 Cascade Merge Levela i b i c i d i e i na n b n c n d n e n n+1 a n +b n +c n a n+1 b n+1 c n+1 d n+1 +d n +e n -e n -d n -c n -b n a n+1 a n Perfect dist’n for detail see Knuth Vol III !!!


Download ppt "8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do?"

Similar presentations


Ads by Google