Download presentation

Presentation is loading. Please wait.

Published byElijah Jewison Modified about 1 year ago

1
8. External Sorting Suppose that a file is so large that the whole file cannot be accommodated in the internal memory of a computer. What shall we do? Need to use EXTERNAL STORAGE DEVICE !!! External Sorting - Disk Sort - Tape Sort What is a major difference between two external sorts?

2
Sorting with Disk k - way merging “mergesort” merge internal sort

3
Example 4500 records 250 records/block available memory = 3 blocks Def’n : A segment of a file is said to be a run if all the records in the segment are sorted I 135 D 1 …… 246 D 2 ……

4
3 D 1 D 2 …… 6n D 3 D 4 : the size of a run

5
Run size How many passes? 1 + log 2 r (r # of initial runs)

6

7
k-way merging … … …… … …… log k r………………………………………………. …… # of passes 1+log k r # of I/O operations? O(nlog k r) better than 2-way merging !!!

8
How about # of comparisons? Is k-way merging always better than 2-way merging?

9
Replacement Selection … … …… … …… ………………………………………………. …… # of passes 1+ log k r #(P) #(P) k r r run size

10
# of comparisons(k-way merge)

11
How many comparisons in a pass? nlog 2 k why? Total # of comparisons? (# of passes) (# of comparisons in a pass) = (log k r)(nlog 2 k) = (nlog 2 r)independent of k !!! #(c) r

12
How to increase run size(initial run size) x 1, x 2, x 3,…,x m, x m+1, x m+2, x m+3,…,x 2m, x 2m+1, x 2m+2, x 2m+3,… m keysm keysm keys r = # of runs = Any improvement? Observation See p.94 in textbook !!! …...

13
4,2,32,12,18,24,91,11 (record size >> the size of pointer) why do we need this?

14
A tree of losers 4parent 2loser 32 12Updating pointers 18ptr := winner .parent; 24while ptr nil do 91 if (ptr .loser .key < winner .key) then 11interchange(ptr .loser, winner); end {if} ptr := ptr .parent; end {while} 1191 winner 1824

15
Explain p , textbook !!! Exercise : In a complete 2-tree(T) with n leaf nodes, show that total # of nodes in T = 2n -1

16
Performance Analysis (Average size of runs) m 0 # of records in (real) memory. H. Seward (M.S. Thesis, MIT, 1954) gave a good reason to believe that a run contains more than 1.5m 0 records (no proof) E. Friend (JACM, 3, (1966)) experiment 2m 0 E. Moore (1961) Proved that 2m 0 is the expected run length.

17
Sketch of Moore’s Proof Snowplow falling snow 2m 0 m 0 uniform distribution 2m 0

18
Tape Sorting Balanced k-way merging (similar to disk sorting) Polyphase merging Cascade merging

19
Polyphase Merging (Motivation) –(R 1, R 2, …, R 5000 ) –length (R i ) 20 bytes –Only 1000 records fitted in the internal memory at one time. ( 20k bytes) –4 tapes available Balanced 2-way merge T 1 T 2 T 3 T 4 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000 R 1,5000 Total # of operations = 15000

20
Tape 1Tape 2Tape 3Tape 4 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 (rewind) R 3001,4000 R 4001,5000 R 1,3000 R 1,5000 Total # of I/O operations = 8000 Balanced Merge is not always best !!!

21
What if only 3 tapes available? Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 R 1,2000 R 2001,4000 R 4001,5000 R 1,2000 R 2001,4000 R 4001,5000 R 1,4000 R 4001,5000 R 4001,5000 R 1,4000 R 1,5000 Total # of I/O Operations = 21,000 !!!

22
Tape 1Tape 2Tape 3 R 1,1000 R 1001,2000 R 2001,3000 R 3001,4000 R 4001,5000 R 1,2000 R 4001,5000 R 2001,4000 (rewind) R 1,2000; 4001,5000 (rewind) R 1,5000 Total # of I/O Operations = 11,000 !!!

23
Polyphase merge T 1 T 2 T 3 T 4 T 5 T How to assign initial runs?

24
Cascade Merge T 1 T 2 T 3 T 4 T 5 T 5 15 Pass ( ) 15 5 Pass 5 1 ( ) Pass ( ) Pass

25
Polyphase Merge T 1 T 2 T 3 T 4 T 5 T 6 phase Gilstad(1960) 51 1 {{1,0,0,0,0},{1,1,1,1,1},{2,2,2,2,1},{4,4,4,3,2},{8,8,7,6,4}, {16,15,14,12,8},{31,30,28,24,16}} Perfect Fibonacci Distribution !!! What is the underlying rule?

26
ia i b i c i d i e i

27
(a 0 + b 0 ) (a 0 + c 0 ) (a 0 + d 0 ) (a 0 + e 0 ) a 0 (a 1 + b 1 ) (a 1 + c 1 ) (a 1 + d 1 ) (a 1 + e 1 ) a 1 (a 2 + b 2 ) (a 2 + c 2 ) (a 2 + d 2 ) (a 2 + e 2 ) a 2 na n b n c n d n e n n+1 a n + b n a n + c n a n + d n a n + e n a n a n b n c n d n e n

28
ia i b i c i d i e i output T T T T T T T T 1 T 2 T 3 T 4 T 5

29
n-1a n-1 b n-1 c n-1 d n-1 e n-1 na n-1 +b n-1 a n-1 +c n-1 a n-1 +d n-1 a n-1 +e n-1 a n-1 a n b n c n d n e n e n = a n-1 d n = a n-1 + e n = a n-1 + a n-2 c n = a n-1 + d n-1 = a n-1 + (a n-2 + e n-2 ) = a n-1 + a n-2 + a n-3 …………. e n = a n-1 d n = a n-1 + a n-2 c n = a n-1 + a n-2 + a n-3 b n = a n-1 + a n-2 + a n-3 + a n-4 a n = a n-1 + a n-2 + a n-3 + a n-4 + a n-5 (a 0 = 1, a i = 0, i = -1, -2, -3, -4)

30
e = a n-1 d = a n-1 + a n-2 c = a n-1 + a n-2 + a n-3 b = a n-1 + a n-2 + a n-3 + a n-4 a = a n-1 + a n-2 + a n-3 + a n-4 + a n-4

31
i a i b i 0 c i 0 d i 0 e i 0

32

33
a i =, i = -4, -3, -2, -1, 0, 1, 2,... “The k th order Fibonacci number” F n k = F n-1 k + F n-2 k + …… + F n-k k 0, 0 n k-2 F n k = 1, n = k-1 e.g) The second order Fibonacci number …… F n 2 = F n F n-2 2 0, if n = 0 F n 2 = 1, if n = 1 Fibonacci number !!! a n = F n+k-1 k if k tapes(input) are used why?

34
What if not perfect Fib. Dist’n? Use dummy runs !!! 5 input tapes and 53 initial runs. LevelT 1 T 2 T 3 T 4 T >53 (87764) ……………………………… T 1 T 2 T 3 T 4 T 5 (34) (35)(36)(37) (38)(39)(40)(41) (42)(43)(44)(45) (46)(47)(48)(49)(50) (51)(52)(53)

35
T 1 T 2 T 3 T 4 T 5 T 6 (2)(2)(2)(3)(3) 5 8 (2)(2)(2)(3) not best but simple and good !!! For better one, see Knuth !!!

36
Example (3 tapes) T 1 T 2 T 3 (k) 8 (k) 5 (k) 3 (2k) 5 (3k) 3 (2k) 2 0, 1, 1, 2, 3, 5, 8 (5k) 2 (3k) 1 (5k) 1 (8k) 1 (13k) 1 Runs on two input tapes (k) # of runsrun size(k)# of pairs# of I/O’s 8,5 1, ,3 2, ,2 3, ,1 5, ,1 8, How many passes over the data?

37
Total number F s for some s. of initial runs the s th Fibonacci number F s F s-1 F s-2 T 1 T 2 T 3 F s-1 F s-2 F s-3 F s-2 F s-3 F s-4 ………… See Fig. p.107, textbook !!! Total # of I/O operations = # of passes =

38
Lemma : [proof] (By induction on S) (s=2)LHS = RHS = (s=3)LHS = RHS = (s=k)Suppose that (s=k+1) Exercise !!! See page in textbook !!!

39
From the previous lemma, # of passes = F s = r (1) why?. Golden Ratio !!! From (1),

40
Theorem: F s-1 F s-2 Polyphase merge merge 3 tapes F s = r = # of initial runs # of passes = 1.04 log 2 r

41
APPROXIMATED BEHAVIOR OF POLYPHASE MERGE SORTING TapesPhasesPassesPass/phaseGrowth percent ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS – APPROXIMATED BEHAVIOR OF CASCADE MERGE SORTING TapesPhasesPasses Growth ratio lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS lnS

42
Cascade Merge Levela i b i c i d i e i na n b n c n d n e n n+1 a n +b n +c n a n+1 b n+1 c n+1 d n+1 +d n +e n -e n -d n -c n -b n a n+1 a n Perfect dist’n for detail see Knuth Vol III !!!

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google