Chap 8. Cosequential Processing and the Sorting of Large Files

Chap 8. Cosequential Processing and the Sorting of Large Files

Chapter Objectives(1) Describe a class of frequently used processing activities known as cosequential process Provide a general object-oriented model for implementing varieties of cosequential processes Illustrate the use of the model to solve a number of different kinds of cosequential processing problems, including problems other than simple merges and matches Introduce heapsort as an approach to overlapping I/O with sorting in RAM

Chapter Objectives(2) Show how merging provides the basis for sorting very large files Examine the costs of K-way merges on disk and find ways to reduce those costs Introduce the notion of replacement selection Examine some of the fundamental concerns associated with sorting large files using tapes rather than disks Introduce UNIX utilities for sorting, merging, and cosequential processing

Contents 8.1 Cosequential operations
8.2 Application of the OO Model to a General Ledger Program 8.3 Extension of the OO Model to Include Multiway Merging 8.4 A Second Look at Sorting in Memory 8.5 Merging as a Way of Sorting Large Files on Disk 8.6 Sorting Files on Tape 8.7 Sort-Merge Packages 8.8 Sorting and Cosequential Processing in Unix

Cosequential operations
Coordinated processing of two or more sequential lists to produce a single list Kinds of operations merging, or union matching, or intersection combination of above

Matching Names in Two Lists(1)
So called “intersection operation” Output the names common to two lists Things that must be dealt with to make match procedure work reasonably initializing that is to arrange things methods that are getting and accessing the next list item synchronizing between two lists handling EOF conditions recognizing errors e.g. duplicate names or names out of sequence

Matching Names in Two Lists(2)
In comparing two names if Item(1) is less than Item(2), read the next from List 1 if Item(1) is greater than Item(2), read the next name from List 2 if the names are the same, output the name and read the next names from the two lists

Cosequential match procedure(1)
PROGRAM: match Item(1) Item(1) < Item(2) List 1 same name use input() & initialize() procedure List 2 Item(1) > Item(2) Item(2)

Cosequential match procedure(2)
int Match(char * List1, char List2, char *OutputList) { int MoreItems; // true if items remain in both of the lists // initialize input and output lists InitializeList(1, List1); InitializeList(2, List2); InitializeOutput(OutputList); // get first item from both lists MoreItems = NextItemInLIst(1) && NextItemInList(2); while (MoreItems) { // loop until no items in one of the lists if(Item(1) < Item(2) ) MoreItems = NextItemInList(1); else if (Item(1) == Item (2) ) { ProcessItem(1); // match found MoreItems = NextItemInList(1) && NextItemInList(2); } else MoreItems = NextItemInList(2); // Item(1) > Item(2) FinishUp(); return 1;

General Class for Cosequential Processing(1)
template <class ItemType> class CosequentialProcess // base class for cosequential processing { public: // the following methods provide basic list processing // these must be defined in subclasses virtual int InitializeList (int ListNumber, char *LintName) = 0; virtual int InitializeOutput (char * OutputListName) = 0; virtual int NextItemInList (int ListNumber) = 0; // advance to next item in this list virtual ItemType Item(int ListNumber) = 0; // return current item from this list virtual int ProcessItem(int ListNumber) = 0; // process the item in this list virtual int FinishUp() = 0; // complete the processing // 2-way cosequential match method virtual int Match2Lists (char *List1, char * List2, char *OutputList); };

A Subclass to support lists that are files of strings, one per line class StringListProcess : public CosequentialProcess<String &> { public: StringListProcess (int NumberOfLists); // constructor // Basic list processing methods int InitializeList (int ListNumber, char * List1); int InitializeOutput(char * OutputList); int NextItemInList (int ListNumber); // get next String & Item (int ListNumber); // return current int ProcessItem (int ListNumber); // process the item int FinishUp(); // complete the processing protected: ifstream * List; // array of list files String * Items; // array of current Item from each list ofstream OutputLsit; static const char * LowValue; //used so that NextItemInList() doesn’t // have to get the first item in an special way static const char * HighValue; };

Appendix H: full implementation An example of main #include “coseq.h” int main() { StringListProcess ListProcess(2); // process with 2 lists ListProces.Match2Lists (“list1.txt”, “list2.txt”, “match.txt”); }

Merging Two Lists(1) Based on matching operation Difference must read each of the lists completely must change MoreItems behavior keep this flag set to true as long as there are records in either list HighValue the special value (we use “\xFF”) come after all legal input values in the files to ensure both input files are read to completion

Merging Two Lists(2) Cosequential merge procedure based on a single loop This method has been added to class CosequentialProcess No modifications are required to class StringListProcess template <class ItemType> int CosequentialProcess<ItemType> :: Merge2Lists (char * List1Name, char * List2Name, char * OutputList) { int MoreItems1, MoreItems2; // true if more items in list InitializeList (1 List1Name); InitializeList (2, List2Name); InitializeOutput (OutputListName); MoreItems1 = NextItemInLIst(1); MoreItems2 = NextItemInList(2); while(MoreItems1 || MoreItems2) { (continued … )

Merging Two Lists(3) while (MoreItems(1) || MoreItems(2) ) { // if either file has more if (Item(1) < Item(2)) { // list 1 has next item to be processed ProcessItem(1); MoreItem1 = NextItemInList(1); } else if (Item(1) == Item(2) ) { MoreItems1 = NextItemInList(1); MoreItems2 = NextItemInList(2); else // Item(1) > Item(2) { ProcessItem(2); MoreItem2 = NextItemInList(2); FinishUp(); return 1;

Cosequential merge procedure(1)
PROGRAM: merge NAME_1 NAME_2 List 1 List 2 OutputList (Item(1) < Item(2) )or match Item(1) > Item(2)

Summary of the Cosequential Processing Model(1)
Assumptions two or more input files are processed in a parallel fashion each file is sorted in some cases, there must exist a high key value or a low key records are processed in a logical sorted order for each file, there is only one current record records should be manipulated only in internal memory

Essential Components initialization - reads from first logical records one main synchronization loop - continues as long as relevant records remain selection in main synchronization loop Input files & Output files are sequence checked by comparing the previous item value with new one if (Item(1) > Item(2) then else if ( Item(1) < Item(2)) then else /* current keys equal */ endif

Essential components (cont’d) substitute high values for actual key when EOF main loop terminates when high values have occurred for all relevant input files no special code to deal with EOF I/O or error detection are to be relegated to supporting method so the details of these activities do not obscure the principal processing logic

8.2 The General Ledger Program (1)
Figure 8.7 Sample journal entries Acct.No. Check No. Date Description Debit/credit /02/97 Auto expense /02/97 Tune-up and minor repair 78.70 /02/97 Rent /02/97 Rent for April /04/97 Advertising /04/97 Newspaper ad re:new product 87.50 /02/97 Office expense /02/97 Printer cartridge /02/97 Auto expense /09/97 Oil change

Figure 8.8 Sample ledger printout showing the effect of posting from the journal Checking account #1 /02/97 Auto expense /02/97 Rent /04/97 Advertising /02/97 Auto expense Prev.bal: New bal: Checking account # /02/97 Office expense Prev.bal: New bal: Advertising expense /04/97 Newspaper ad re:new product Prev.bal:25.00 New bal: Auto expenses 04/02/97 Tune-up and minor repair 78.70 /09/97 Oil change Prev.bal: New bal: 이전 잔액 현재 잔액

Figure 8.9 List of journal transactions sorted by account number Acct.No. Check No. Date Description Debit/credit /02/97 Auto expense /02/97 Rent /04/97 Advertising /02/97 Auto expense /02/97 Office expense /04/97 Newspaper ad re:new product 87.50 /02/97 Tune-up and minor repair 78.70 /09/97 Oil change /02/97 Printer cartridge /02/97 Rent for April

Figure 8.10 Conceptual view of cosequential matching of the ledger and journal files Ledger List Journal List Checking account # Auto expense Rent Advertising Auto expense Checking account # Office expense Advertising expense Newspaper ad re: new product 510 Auto expenses Tune-up and minor repair Oil change

Figure 8.11 Sample ledger printout for the first six accounts Checking account #1 /02/97 Auto expense 04/02/97 Rent /02/97 Auto expense /04/97 Advertising Prev.bal: New bal: Checking account # /02/97 Office expense Prev.bal: New bal: Advertising expense /04/97 Newspaper ad re:new product Prev.bal:25.00 New bal: Auto expenses 04/02/97 Tune-up and minor repair 78.70 /09/97 Oil change Prev.bal: New bal: Bank charges Prev.bal: 0.00 New bal: 0.00 520 Books and publications Prev.bal: New bal: 87.40

8.2 The General Ledger Program(6)
The ledger (master) account number The journal (transaction) account number Class MasterTransactionProcess (Fig 8.12) Subclass LedgeProcess (Fig 8.14)

Template <class ItemType> class MasterTransactionProcess: Public CosequentialProcess<ItemType> // a cosequential process that supports master/transaction processing {public: MasterTransactionProcess(); // constructor Virtual int ProcessNewMaster() = 0; //processing when new master read Virtual int ProcessCurrentMaster() = 0; Virtual int ProcessEndMaster() = 0; Virtual int ProcessTransactionError()= 0; //cosequential processing of master and transaction records int PostTransactions (char * MasterFileName, char * TransactionFileName, char * OutputListName); };

while (MoreMasters || MoreTransactions) if (Item(1) < Item(2)) { // 이 마스터 레코드를 끝낸다 ProcessEndMaster(); MoreMasters = NextItemInList(1); if (MoreMasters) ProcessNewMaster(); } else if (Item(1) == Item(2)) { // 마스터와 일치하는 트랜잭션 ProcessCurrentMaster(); // 마스터를 위한 또다른 트랜잭션 ProcessItem(2); // 트랜잭션 레코드 출력 MoreTransactions = NextItemInList(2); } else { // Item(1) > Item(2) 마스터를 갖지 않는 트랜잭션 ProcessTransactionError(); }

int LedgerProcess::ProcessNewMaster() { // 헤더를 프린트하고 마지막 달의 잔액을 설정 ledger.PrintHeader(OutputList); ledger.Balances[MonthNumber] = ledger.Balances[MonthNumber-1]; } int LedgerProcess::ProcessCurrentMaster() { // 이 달의 잔액에 트랜잭션의 양을 더한다 ledger.Balances[MonthNumber] += journal.Amount; int LedgerProcess::ProcessEndMaster() { // 잔액을 출력한다 PrintBalances(OutputList, ledger.Balances[MonthNumber-1], ledger.Balances[MonthNumber]);

8.3 A K-way Merge Algorithm
A very general form of cosequential file processing Merge K input lists to create a single, sequentially ordered output list Algorithm begin loop determine which list has the key with the lowest value output that key move ahead one key in that list in duplicate input entries, move ahead in each list loop again

8.3 Selection Tree for Merging Large Number of Lists
K-way merge nice if K is no larger than 8 or so if K > 8, the set of comparisons for minimum key is expensive loop of comparison (computing) Selection Tree (if K > 8) time vs. space trade off a kind of “tournament” tree the minimum value is at root node the depth of tree is log2 K

8.3 Selection Tree 7, 10, 17....List 0 7 9, 19, 23....List 1 7
11 18, 22, List 3 input 5 12, 14, List 4 5 5, 6, List 5 5 15, 20, List 6 8 8, 16, List 7

8.4 A Second Look at Sorting in Memory
Read the whole file from into memory, perform sorting, write the whole file into disk Can we improve on the time that it takes for this RAM sort? perform some of parts in parallel selection sort is good but cannot be used to sort entire file Using Heap technique! processing and I/O can occur in parallel keep all the keys in heap Heap building while reading a block Heap rebuilding while writing a block

8.4 Overlapping processing and I/O : Heapsort
a kind of binary tree, complete binary tree each node has a single key, that key is less than or equal to the key at its parent node storage for tree can be allocated sequentially so there is no need for pointers or other dynamic overhead for maintaining the heap

8.4 A heap in both its tree form and as it would be stored in an array
(1) * n, 2n, 2n+1 positions B (2) c (3) E (4) H (5) I (6) D (7) F (9) G (8) A B C E H I D G F

8.4 Class Heap and Method Insert(1)
{ public: Heap(int maxElements); int Insert (char * newKey); char * Remove(); protected: int MaxElements; int NumElements; char ** HeapArray; void Exchange (int i, int j); // exchange element i and j int Compare (int i, int j) // compare element i and j { return strcmp(Heaparray[i], HeapArray[j]); } };

8.4 Class Heap and Method Insert(2)
int Heap::Insert(char * newKey) { if (NumElements == MaxElements) return FALSE; NumElements++; // add the new key at the last position HeapAray[NumElements] = newKey; // re-order the heap int k = NumElements; int parent; while(k > 1) { // k has a parent parent = k/2; if (Compare(k, parent) >= 0) break; // HeapArray[k] is in the right place // else exchange k and parent Exchange(k, parent); k = parent; } return;

Heap Building Algorithm(1)
input key order : F D C G H I B E A New key to be inserted Heap, after insertion of the new key Selected heaps in tree form F F D D F C C C F D D F G C F D G H C F D G H (continued....)

input key order : F D C G H B E A New key to be inserted Heap, after insertion of the new key Selected heaps in tree form I C F D G H I C F D B B F C G H I D G H I E B E C F H I D G B A A B C E H I D G F F C H G I D (continued....)

input key order : F D C G H B E A New key to be inserted Heap, after insertion of the new key Selected heaps in tree form A A B C E H I D G F A C B D H I E G F

Illustration for overlapping input with heap building(1)
(Free ride of main memory processing: heap building is faster than IO!) Total RAM area allocated for heap First input buffer. First part of heap is built here. The first record is added to the heap, then the second record is added, and so forth Second input buffer. This buffer is being filled while heap is being built in first buffer.

Illustration for overlapping input with heap building(2)
(One Heap is growing during IO time!) Second part of heap is built here. The first record is added to the heap, then the second record, etc Third input buffer. This buffer is filled while heap is being built in second buffer Third part of heap is built here Fourth input buffer is filled while heap is being built in third buffer

8.4 Sorting while Writing to the File
Heap rebuilding while writing a block (Free ride of main memory processing) Retrieving the keys in order (Fig 8.20) while( there is no elements) get the smallest value put largest value into root decrease the # of elements reorder the heap Overlapping retrieve-in-order with I/O retrieve-in-order a block of records while writing this block, retrieve-in-order the next block

8.4 Method Remove char * Heap::Remove()
{ // 가장 작은 원소를 제거하고, 힙을 재정렬시키고, 가장 작은 원소를 리턴한다 // 리턴하기 위해 가장 작은 값을 val에 넣는다 char * val = HeapArray[1]; HeapArray[1] = HapArray[NumElements]; // 루트에 가장 큰 값을 넣는다 NumElements--; // 원소의 수를 감소시킨다 // 교환과 하향이동에 의해 힙을 재정렬한다 int k = 1; // 가장 큰 값을 포함하는 힙의 노드 int newK; // 가장 큰 값으로 교환되는 힙의 노드 while (2*k <= NumElements) { // k는 최소 하나의 자식을 가짐 // k의 가장 작은 자식 인덱스로 newK를 설정 if (Comapre(2*k, 2*k+1) > 0) newK = 2*k; else newK = 2*k+1; // k와 newK가 순서대로 되어 있는지를 비교 if (Compare(k, newK) < 0) break; // 순서대로 되어 있는 경우 Exchange(k, newK); // k와 newK가 순서대로 되지 않은 경우 k = newK; //트리의 아래 방향으로 진행} return val; } Figure 8.20

8.5 Merging as a Way of Sorting Large Files on Disk
Keysort: holding keys in memory Two Shortcomings of Keysort substantial cost of seeking may happen after keysort cannot sort really large files e.g. a file with 800,000 records, size of each record: 100 bytes, size of key part: 10 bytes, then 800,000 X 10 => 8G bytes! cannot even sort all the keys in RAM Multiway merge algorithm small overhead for maintaining pointers, temporary variables run: sorted subfile using heap sort for each run split, read-in, heap sort, write-back

............. Sorting through the creation of runs
and subsequential merging of runs 800,000 unsorted records 80 internal sorts 80runs, each containing 10,000 sorted records Merge 800,000 records in sorted order

Multiway merging (K-way merge-sort)
Can be extended to files of any size Reading during run creation is sequential no seeking due to sequential reading Reading & writing is sequential Sort each run: Overlapping I/O using heapsort K-way merges with k runs Since I/O is largely sequential, tapes can be used

How Much Time Does a Merge Sort Take?
Assumptions only one seek is required for any sequential access only one rotational delay is required per access Four I/Os during the sort phase reading all records into RAM for sorting, forming runs writing sorted runs out to disk during the merge phase reading sorted runs into RAM for merging writing sorted file out to disk

Four Steps(1) Step1: Reading records into RAM for sorting and forming runs assume: 10MB input buffer, 800MB file size seek time --> 8msec, rotational delay --> 3msec transmission rate --> MB/msec Time for step1: access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec Step2: Writing sorted runs out to disk writing is reverse of reading time that it takes for step2 equals to time of step1

Four Steps(2) Step3: Reading sorted runs into RAM for merging
10 MB of RAM is for storing runs. 80 runs reallocate each of 80 buffers 10MB RAM as 80 input buffers access each run 80 buffers to read all of it Each buffer holds 1/80 of a run (0.125MB) total seek & rotational time --> 80 runs X 80 seeks --> 6400 seeks X 11 msec = 70 seconds transfer time --> 60 seconds (800MB  MB/msec) total time = total seek & rotation time + transfer time

Four Steps(3) Step4: Writing sorted file out to disk
need to know how big output buffers are with 20,000-byte output buffers, total seek & rotation time = 4,000 x 11 msec transfer time is still 60 seconds Consider Table 8.1 (359pp) What if we use keysort for 800M file? --> 24hrs 26mins 40secs 80,000,000 bytes 4,000 seeks 20,000 bytes per seek

: Effect of buffering on the number of seeks required 800MB file
1st run = 80 buffers’ worth(80 accesses) 2nd run = 80 buffers’ worth(80 accesses) 800MB file 800,000 sorted records : 80 buffers(10MB) 80th run = 80 buffers’ worth(80 accesses)

Sorting a Very Large File
Two kinds of I/O Sort phase I/O is sequential if using heapsort Since sequential access is minimal seeking, we cannot algorithmically speed up I/O Merge phase RAM buffers for each run get loaded, reloaded at predictable times -> random access For performance, look for ways to cut down on the number of random accesses that occur while reading runs you can have some chance here!

The Cost of Increasing the File Size
K-way merge of K runs Merge sort = O(K2) ( merge op. -> K2 seeks ) If K is a big number, you are in trouble! Some ways to reduce time!! (8.5.4, , 8.5.6) more hardware (disk drives, RAM, I/O channel) reducing the order of merge (k), increasing buffer size of each run increase the lengths of the initial sorted runs find the ways to overlap I/O operations

Hardware-base Improvements
Increasing the amount of RAM longer & fewer initial runs fewer seeks Increasing the number of disk drives no delay due to seek time after generation of runs assign input and output to separate drives Increasing the number of I/O channels separate I/O channels, I/O can overlap Improve transmission time

Decreasing the Num of Seeks Using Multiple-step Merges
K-way merge characteristics a selection tree is used the number of comparisons is N*log K (K-way merge with N records) K is proportional to N O(N*log N) : reasonably efficient Reducing seeks is to reduce the number of runs give each run a bigger buffer space multiple-step merge provides the way without more RAM

Multiple-step merge(1)
Do not merge all runs at one time Break the original set of runs into small groups and Merge runs in these group separately Leads fewer seeks, but extra transmission time in second pass Reads every record twice to form the intermediate runs & the final sorted file Similar to have selection tree in merging n lists!!

Two-step merge of 800 runs ...... 25 sets of 32 runs each 32 runs
(25 sets X 32 runs) = 800 runs ...... 32 runs 25 sets of 32 runs each

Multiple-step merge(2)
Essence of multiple-step merging increase the available buffer space for each run extra pass vs. random access decrease Can we do even better with more than two steps? trade-offs between the seek&rotation time and the transmission time major cost in merge sort seek, rotation time, transmission time, buffer size, number of runs

Increasing Run Lengths Using Replacement Selection(1)
Facts of Life Want to use the heap sort in memory Want to allocate longer output runs Can we pack the longer output runs using the heap sort in memory? Replacement Selection Idea always select the key from memory that has the lowest value output the key replace it with a new key from the input list use 2 heaps in the memory buffer (continued...)

Increasing Run Lengths Using Replacement Selection(2)
Implementation step1: read records and sort using heap sort this heap is the primary heap step2: write out only the record with the lowest value step3: bring in new record and compare its key with that of record just output step3-a: if the new key is higher, insert new record into its proper in the primary heap along with the other records selected for output step3-b: if the new key is lower, place the record in a secondary heap with key values lower than already written out step4: repeat step 3 while there are records in the primary heap and there are records to be read in. When the primary heap is empty, make the secondary heap into the primary heap and repeat step2 & step3

Example of the principle underlying replacement selection
Input: 21, 67, 12, 5, 47, 16 Front of input string (Heap sort!) Remaining input Memory(p=3) Output run - 5 12, 5 16, 12, 5 21, 16, 12, 5 47, 21, 16, 12, 5 67, 47, 21, 16, 12, 5 21, 67, 12 21, 67 21 -

Replacement Selection(1)
What happens if a key arrives in memory too late to be output into ins proper position relative to the other keys? (if 4th key is 2 rather than 12) use of second heap, to be included in next run refer to page 335 Figure 8.25 Two questions Given P locations in memory, how long a run can we expect replacement selection to produce, on the average? On the average, we can expect a run length of 2P Knuth provides an excellent description (page ) (continued...)

Comparisons of access times required to sort 8 million records
both RAM sort and replacement selection Total Seek & Rotation Delay Time # of Records per Seek to Form Runs Size of Runs Formed # of Seeks Required to Form Runs Merge Order Used Total Number of Seeks Approach (hr) (min) 800 RAM sorts followed by an 800-way merge 10,000 10,000 800 1,600 681,600 4 58 Replacement selection followed by 534-way merge (records in random order) 2,500 15,000 534 6,400 521,134 3 48 Replacement selection followed by 200-way merge (records partially ordered) 2,500 40,000 200 200 206,400 1 30

Step-by-step op. of replacement selection with 2 heaps
working to form two sorted runs(1) Input 33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16 Front of input string (Heap sort!) Remaining input Memory(P=3) Output run(A) 33, 18, 24, 58, 14, 17, 7, 21, 67, 12 33, 18, 24, 58, 14, 17, 7, 21, 67 33, 18, 24, 58, 14, 17, 7, 21 33, 18, 24, 58, 14, 17, 7 33, 18, 24, 58, 14, 17 33, 18, 24, 58, 14 33, 18, 24, 58 ( 7) 67 (17) ( 7) (14) (17) ( 7) - 5 12, 5 16, 12, 5 21, 16, 12, 5 47, 21, 16, 12, 5 67, 47, 21, 16, 12, 5

Step-by-step op. of replacement selection
working to form two sorted runs(2) Remaining input Memory(P=3) Output run(B) First run complete; start building the second 33, 18, 24, 58 33, 18, 24 33, 18 - - - 7 14, 7 17, 14, 7 18, 17, 14, 7 24, 18, 17, 14, 7 33, 24, 18, 17, 14, 7 58, 33, 24, 18, 17, 14, 7

Replacement Selection Plus Multiple Merging
8.5 Merging as a Way of Sorting Large Files on Disk Replacement Selection Plus Multiple Merging Total number of seeks is less than for the one-step merges The two-step merge requires transferring the data two more times than do the one-step merge the two-step merges & replacement selection are still better, but the results are less dramatic refer to table of the next slide

Comparison of merges, considering transmission times(1) :1-step merge
8.5 Merging as a Way of Sorting Large Files on Disk Comparison of merges, considering transmission times(1) :1-step merge Approach Number of Records per Seek to Form Runs Merge Pattern Used Number of Seeks for Sorts and Merges Seek + Rotational Delay Time(min) Total Passes over the File Total Trans- mission Time(min) Total of Seek, Rotation, and Transmission Times(min) RAM sorts 800- way 10,000 681,700 298 4 43 341 replacement selection (records in random order) 534- way 2,500 228 4 43 341 521,134 replacement selection (records part -ially ordered) 2,500 200- way 90 4 43 206,400 341 (continued...)

Comparison of merges, considering transmission times(2) :2-step merge
8.5 Merging as a Way of Sorting Large Files on Disk Comparison of merges, considering transmission times(2) :2-step merge Approach Number of Records per Seek to Form Runs Merge Pattern Used Number of Seeks for Sorts and Merges Seek + Rotational Delay Time(min) Total Passes over the File Total Trans- mission Time(min) Total of Seek, Rotation, and Transmission Times(min) 25 x 32 -way (one 25-way) RAM sorts 10,000 127,200 56 6 65 121 replacement selection (records in random order) 19 x 28 -way (one 19-way) 2,500 55 6 65 120 124,438 20 x 10 -way (one 20-way) replacement selection (records part -ially ordered) 2,500 110,400 48 6 65 113

Using Two Disks with Replacement Selection
8.5 Merging as a Way of Sorting Large Files on Disk Using Two Disks with Replacement Selection Two disk drives input & output can overlap reduce transmission by 50% seeking is virtually eliminated Sort phase the run selection & output can overlap Merge phase output disk becomes input disk, and vice versa seeking will occur on input disk, output is sequential substantially reducing merge & transmission time

heap Memory organization for replacement selection disk1 input buffers
8.5 Merging as a Way of Sorting Large Files on Disk Memory organization for replacement selection disk1 input buffers heap disk2 output buffers

More Drives? More Processors?
8.5 Merging as a Way of Sorting Large Files on Disk More Drives? More Processors? More drives? Until I/O becomes so fast that processing cannot keep up with it More processors? mainframes vector and array processors massively parallel machines very fast local area networks

Effects of Multiprogramming
8.5 Merging as a Way of Sorting Large Files on Disk Effects of Multiprogramming Increase the efficiency of overall system by overlapping processing and I/O Effects are very hard to predict

A Concept Toolkit for External Sorting
8.5 Merging as a Way of Sorting Large Files on Disk A Concept Toolkit for External Sorting For in-RAM sorting, use heapsort Use as much RAM as possible Use a multiple-step merge when the number of initial runs is so long that seek and rotation time is much greater than transmission time Use replacement selection when possibility of partially ordered Use more than one disk drive and I/O channel read/write can overlap Look for ways to take advantage of new architecture and systems parallel processing or high-speed networks

Sorting Files on Tape Tape contains runs
Balanced Merge with several tape drivers Tape contains runs T R1 R3 R5 R7 R9 Step T R2 R4 R6 R8 R10 T T Figure 8.28 (2 way-balanced 4 tape merge) P is the number of passes, N is the number of runs, k is the number of input drivers ==> then, P = ceiling of (logkN) 4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes 20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes

Sorting Files on Tape Other ways of Balanced Merge
(Fig 8.30) T T T T4 Step Step Step Step (Fig 8.31) T T T T4 Step Step … Step … Step … Step

K-way Balanced Merge on Tapes
Some difficult questions How does one choose an initial distribution that leads readily to an efficient merge pattern? Are there algorithmic descriptions of the merge patterns, given an initial distribution? Given N runs and J tape drives, is there some way to compute the optimal merging performance so we have a yardstick against which to compare the performance of any specific algorithm?

Unix: Sorting and Cosequential Processing
Sorting in Unix The Unix sort command The qsort library routine Cosequential processing utilities in Unix Compares: cmp Difference: diff Common: comm

Let’s Review !! 8.1 Cosequential operations
8.2 Application of the Model to a General Ledger Program 8.3 Extension of the Model to Include Multiway Merging 8.4 A Second Look at Sorting in Memory 8.5 Merging as a Way of Sorting Large Files on Disk 8.6 Sorting Files on Tape 8.7 Sort-Merge Packages 8.8 Sorting and Cosequential Processing in Unix

Chap 8. Cosequential Processing and the Sorting of Large Files

Similar presentations

Presentation on theme: "Chap 8. Cosequential Processing and the Sorting of Large Files"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chap 8. Cosequential Processing and the Sorting of Large Files

Similar presentations

Presentation on theme: "Chap 8. Cosequential Processing and the Sorting of Large Files"— Presentation transcript:

Similar presentations

About project

Feedback