Presentation is loading. Please wait.

Presentation is loading. Please wait.

FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.

Similar presentations


Presentation on theme: "FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8."— Presentation transcript:

1 FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8

2 FALL 2005CENG 351 Data Management and File Structures2 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection

3 FALL 2005CENG 351 Data Management and File Structures3 External Sort External sorting is used when the file to be sorted is too big to be in the main memory in full. In this case the file is sorted using hard disk storage as well as main memory. External sorting can be done using one disk or more than one, depending on the size of the file and the availability of hard disk drives. What sort algorithm is suitable for the external sort? The algorithm that allows large files to be sorted! Heap sort seems to be appropriate for external sort, with average complexity of O(nlogn) Quick sort is appropriate for in memory sort, with avrage complexity of O(nlogn).

4 FALL 2005CENG 351 Data Management and File Structures4 Why Heap sort Large files are handled easily: –First phase involves sorting the large file in memory size chunks (file segments) –Second phase involves merging the sorted segments, on the disk, each time creating a twice as large sorted segments –Also overlapping of processing with I/O is possible with the Heap Sort… –In two disk drive case Input and output can even be overlapped as well, with further advantage…

5 FALL 2005CENG 351 Data Management and File Structures5 Sorting Segments 1.Heapsort, using two sorted segments each time Heap sort is also known as to be based on a “priority sort queue”. It is optimal, if only one disk drive is available. It can be executed by overlapping the input/output with processing. Each sorted segment will be the size of the available memory. 2.Heapsort using Replacement selection: Allows more than two sorted segments to produce a single sorted segment… Optimal for two or more disk drives. Sorted segments are twice the size of memory. Reading in and writing out can be overlapped

6 FALL 2005CENG 351 Data Management and File Structures6 Heapsort Algorithm What is a heap? A heap is a binary tree with the following properties: 1.Each node has a single key and that key is greater than or equal to the key at its parent node. 2.It is a complete binary tree. i.e. All leaves are on at most 2 levels, leaves on the lowest level are at the leftmost position. 3.Can be stored in an array; the root is at index 1, the children of node i are at indexes 2*i, and 2*i+1. Conversely, the parent of node j is stored at index  j/2  (very compact: no need to store pointers)

7 FALL 2005CENG 351 Data Management and File Structures7 How big is a heap? As big as the largest heap that can fit in memory, called one segment What is the complexity of the heap sort? –Each time a new record is added to the tree it requires a maximum of logn comparisons to reestablish the heap, if n is the number of the nodes in the tree at that moment. For example, for a tree of 1024 records, logn will be 10.

8 FALL 2005CENG 351 Data Management and File Structures8 How big is a heap? Overlapping of block input and in memory heap processing is normal: If a block contains 10 records; while the next block is being read, 100 (=10*10) comparison can be conducted, for a tree of 1024 nodes. So, larger block, better concurrency… How to conduct the second stage of the heap sort, i.e., writing the segment out to the disk? –Write the root of the heap established during the first stage, to the output buffer –Replace the root by the last record in the tree –Reestablish the heap, which has the complexity of O(logn)

9 FALL 2005CENG 351 Data Management and File Structures9 10 3520 4540 605055 2530 Heap as a binary tree: Height =  log n  Example Heap as an array: 10352045402530605055

10 FALL 2005CENG 351 Data Management and File Structures10 Heapsort Algorithm First Stage: Building the heap while reading the file: –While there is available space Get the next record from current input buffer Put the new record at the end of the heap Reestablish the heap by exchanging the new node with its parent, if it is smaller than the parent: otherwise leave it, where it should be. Repeat this step as long as heap property is violated.

11 FALL 2005CENG 351 Data Management and File Structures11 Heapsort Algorithm Second stage: Sorting while writing the heap out to the file: –While there are records in heap Put the root record in the current output buffer. Replace the root by the last record in the heap. Restore the heap again, which has the complexity of O(log n) and repeat this procedure, so overall complexity is O(nlogn).

12 FALL 2005CENG 351 Data Management and File Structures12 Example Trace the algorithm with: 48703019504510015

13 FALL 2005CENG 351 Data Management and File Structures13 Merge of sorted lists

14 FALL 2005CENG 351 Data Management and File Structures14 Multiway Merging K-way merge: we want to merge K input lists to create a single sequentially ordered output list. (K is the order of a K-way merge) K-way merge algorithm: –Keep an array of lists: list[0], list[1], … list[k-1] –Keep an array of the items that are being used from each list: item[0], item[1], … item[k-1] –The merge processing requires a call to a function (say MinIndex) to find the index of the item with the minimum value.

15 FALL 2005CENG 351 Data Management and File Structures15 Finding the minimum item When the number of lists is small sequential search among items works nicely. (O(K)) When the number of lists is large, we could place the items in a priority queue (an array heap). The min value will be at the root (1 st position in array) Replace the root with the next value from the associated list. This insert operation is O(log K)

16 FALL 2005CENG 351 Data Management and File Structures16 Merging as a way of Sorting Large Files Let us consider the following example: File to be sorted: –8,000,000 records –R = 100 bytes –Size of the key = 10 bytes Memory available as a work area: 10MB (not counting memory used to hold program, O.S., I/O buffers etc.)  Total file size = 800MB  Total number of bytes for all keys = 80MB So, we cannot do internal sorting nor key-sorting.

17 FALL 2005CENG 351 Data Management and File Structures17 Basic idea 1.Forming segments (i.e. sorted sub-files): bring as many records as possible to the main memory, sort them using heapsort, save the sorted data into a file. Repeat this until we have read all records from the original file. As a result a number of sorted files will be formed… 2.Do a multi-way merge on the sorted sub- files to form larger sorted files.

18 FALL 2005CENG 351 Data Management and File Structures18 Cost of Merge Sort-1 I/O operations are performed in the following times: 1.Reading each record into main memory for sorting and forming the sorted segments. 2.Writing sorted segments to disk. These two steps are done as follows: –Read a chunk of 10MB, write a chunk of 10Mb (repeat this 80 times) –In terms of basic disk operations, we spend: For reading: 80 seeks and rotational latency + transfer time for 800 MB Same for writing

19 FALL 2005CENG 351 Data Management and File Structures19 3.Reading segments into memory for merging. Read one chunk of each segment, so 80 chunks. Since available memory is 10MB each chunk can have (10,000,000/80)bytes = 125,000 bytes = 1250 records. Total number of s and r = Total number of chunks (counting all segments) is 80 segments * 80 chunks/segmemts = 80 2 chunks = 6400 s and r. Cost of Merge Sort-2

20 FALL 2005CENG 351 Data Management and File Structures20 Cost of Merge Sort-3: Sorting a File that is 10 times larger How is the time for merge phase affected if the file of 80 million records? –More segments: 800 segments –800-way merge in 10MB memory –i.e. divide the memory into 800 buffers. –Each buffer holds 1/800 th of a segment –So, 800 segments * 800 reads/segment = 640,000 reads.

21 FALL 2005CENG 351 Data Management and File Structures21 The cost of increasing the file size In general, for a K-way merge of K segments, the buffer size for each segment is –(1/K) * size of memory space = (1/K) * size of each segment So K reads are required for each segment. Since there are K segments, merge requires K 2 seeks. Because K is directly proportional to N it also follows that the sort merge is an O(N 2 ) operation.

22 FALL 2005CENG 351 Data Management and File Structures22 Improvements There are several ways to reduce the time: 1.Allocate more hardware (e.g. Disk drives, memory) 2.Perform merge in more than one step. 3.Algorithmically increase the lengths of the initial sorted segments 4.Find ways to overlap I/O operations.

23 FALL 2005CENG 351 Data Management and File Structures23 Multiple-step merges Instead of merging all segments at once, we break the original set of segments into small groups and merge the segments in these groups separately. –more buffer space is available for each segment; hence fewer seeks are required per segment. When all of the smaller merges are completed, a second pass merges the larger segments.

24 FALL 2005CENG 351 Data Management and File Structures24 ………… … … 25 sets of 32 segments each Two-step merge of 800 segments

25 FALL 2005CENG 351 Data Management and File Structures25 Cost of multi-step merge 25 sets of 32 segments, followed by 25-way merge: –Disadvantage: we read every record twice. –Advantage: we can use larger buffers and avoid a large number of disk seeks. Calculations: First Merge Step: –Each time read 1/32 of a segment for 32 segments => 32*32 = 1024 reads –For 25 32-way merges=> 25 * 1024 = 25,600 reads

26 FALL 2005CENG 351 Data Management and File Structures26 For each of 25 segments to read, allocate 1/25th of buffer space only. So, from each segment 4000 records (=10MB/25/100 or 1/800 segment) are read Hence, 800 reads per segment, –so we end up making 25 * 800 = 20,000 reads. Second Merge Step:

27 FALL 2005CENG 351 Data Management and File Structures27 Total number of reads for two steps: 25600 + 20000 = 45,600 Compute the total number of writes for two steps? see sections 8.5.1-8.5.5 for actual times

28 FALL 2005CENG 351 Data Management and File Structures28 Increasing segment Lengths Assume initial segments contain 200000 records.Then instead of 800-way merge we need 400-way merge. A longer initial segment means –fewer total segments, –a lower-order merge, – bigger buffers, – fewer seeks. How can we create initial segments that are twice as large as the number of records that we can hold in memory? => Replacement selection

29 FALL 2005CENG 351 Data Management and File Structures29 Merging Sorted Segments One disk drive –Merging requires all the sorted segments to be combined into a sorted file. –So, to sort an unordered file requires segmented sorting followed by a merging process to finish the job. –The merge could be k-way, where the k=2,…,m; where m is the number of segments

30 FALL 2005CENG 351 Data Management and File Structures30 k-way Merge of Sorted Segments Two-way (k=2) merge –Normally the memory size is limited to the size of a segment, thus two full segments cannot be read into the memory, to merge. –So, halves of the two segments read in first, they are merged, while being written out at the same time, then the remaining of the exhausted segment is read in, followed by the remaining of the other segment…

31 FALL 2005CENG 351 Data Management and File Structures31 2-way Merge of 2 Sorted Segments Assume each segment contain b’ blocks: 2*2*b’*btt+4(s+r)+3(s+r) –the first term includes read and write of two segments, each b’ blocks; second term includes 4 seek and rotational latency for four segment halves; 3 seek and rotational latency for writing the merged segment...

32 FALL 2005CENG 351 Data Management and File Structures32 k-way Merge Sort After each merging pass, the number of the sorted segments will be halved. So, if total numbr of blocks is b and original number of segments is m, logm passes are required, in total. –For large files, where segments become bigger and bigger, the time for one 2-way merge pass can be approximated as: 2*b*btt+2*2*m(r+s) two read and two writes for each original segment, each time requiring 2( s + r) for reading 2 (s+r) for writing, repeating for m segments

33 FALL 2005CENG 351 Data Management and File Structures33 k-way Merge Sort of entire file k-way merge time required for each pass 2*b*btt+2*k*m(r+s) Total time for sorting a pile file using one disk drive, using k-way merge: TT=2*b*btt+(log k m)*[2*k*m*(s+r)+2*b*btt] where the first term is for the creation of sorted segments, the second term is time taken by the log k m k-way merge passes. Note that m is the number of memory size segments.

34 FALL 2005CENG 351 Data Management and File Structures34 Which value of k is the best for a given m, r, and s Find the value of the k for which the above formula (total file sort time) is minimum. After replacing log k m by lnm/lnk, and taking the derivative of TT with respect to k, after some simplifications we obtain the formula k*(lnk-1)=(m*btt)/(r+s) –Solving this equation for k, for a fixed set of m, r, and s, should give optimal value… this may be a very rough estimate!

35 FALL 2005CENG 351 Data Management and File Structures35 Two Disk Drives or more for merge With two disk drives and large memory merging, under ideal conditions the input and output are overlapped, with proper arrangement of buffering. Can more than two disk drives help improving the performance? –Yes! Overlapping of seek times for different segments on different disk drives

36 FALL 2005CENG 351 Data Management and File Structures36 Heap Sort with Replacement Selection Idea –always select the key from memory that has the lowest value –output the key –replacing it with a new key from the input list

37 FALL 2005CENG 351 Data Management and File Structures37 Input: 21,67,12, 5, 47, 16 Remaining inputMemory (size P=3) Output segment 21,67,1254716_ 21,671247165 2167471612,5 _67472116,12,5 _6747_21,16,12,5 _67__47, 21,16,12,5 ____67,47, 21,16,12,5 What about a key arriving in memory too late to be output into its proper position? => use of second heap Front of input

38 FALL 2005CENG 351 Data Management and File Structures38 Trace of replacement selection Input: 33, 18, 24,58,14,17,7,21,67,12,5,47,16 Assume memory of size P=3

39 FALL 2005CENG 351 Data Management and File Structures39 Replacement Selection with two disks Algorithm: 1.Construct a heap (primary heap) in the memory, while reading records block by block from the first disk drive, 2.As we move records from the heap to output buffer, we replace those records with records from the input buffer. If some new records have keys smaller than those already written out, a secondary heap is created for them. The other new records are inserted to the primary heap. 3.Repeat step 2 as long as there are records left in the primary heap and there are records to be read. 4.When the primary heap is empty make the secondary heap into primary heap and repeat steps 1-3.


Download ppt "FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8."

Similar presentations


Ads by Google