Download presentation

Presentation is loading. Please wait.

Published byLisandro Cosby Modified over 3 years ago

1
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 1 Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut für Informatik, Saarbrücken, Germany Thanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter

2
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 2 Motivation & Related Work Sorting is important for EM problems Disks are cheaper than computers Prefetch buffers for load disk balancing and overlapping have been studied but no results which guarantee – overlapping during merging of runs – asymptotically optimal I/O volume Here: algorithm & implementation – I/O cost lower bound – guarantees almost perfect overlapping between I/O and computation

3
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 3 Motivation & Related Work [ChaudhryCormen0102] impl. of distributed memory, parallel disk Theory: Implementations: Differences: – Optimal prefetching algorithm – Overlapping of I/O and computation – Completely asynchronous implementation – Part of library, compatible with STL. Use as simple as: stxxl::vector v; long long int i=vec_size; // long long int is a 64-bit data type while(i--) v.push_back(f(i)); // fill vector with some values stxxl::sort(v.begin(), v.end()); // sort // check order using STL predicate assert(std::is_sorted(v.begin(), v.end())); [BarvGrovVit97][SandEgnKorst00][HutchVitt01][HutchSandVitt02] Here [DementievSanders03] [BarveVitter99]

4
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 4 External Multi-way Merge Sort run 0 input : O(M) chunk 0chunk 1chunk 2chunk 3chunk 4 run 1run 2run 3 chunk 5chunk 6 … B buffers BBB k=O(M/B) k-merger run 4run 5run 6run 7 BBBB k-merger chunk 7chunk 8 k run 0123 B run 4567 B buffers … These algorithms do not guarantee overlapping D write buffers 1 2 3 4 k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk D-head model N – input size M – main memory size B – block size D – number of disks

5
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 5 Multi-way Merge Sort with Overlapped I/Os Theorem 1: Theorem 1: If I/O and computation can be overlapped, N elements can be sorted in time where is the time needed for run formation, is the time needed for merging k= (M/B) : # sequences, k = O(N/M): total # runs.

6
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 6 Overlapping Run Formation Easy: Corollary 2: Corollary 2: Input of size N => sorted runs of size ~ M/2 in time 2M in the paper time read sort write read sort write 12 1 1 2 3 3 3 4 2 4 4 k-1k-1 k-1 k k k... 12 1 1 2 3 3 2 4 4 34 k-1 k k k I/O bound case compute bound case run numbers

7
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 7 Simple External Merging Smallest element of a block is a trigger k sorted runs …. merger sort pairs prediction sequence : …. k-merger ….

8
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 8 Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy: 1 B-1 23 B-1 45 B-1 6… 1 B-1 23 B-1 45 B-1 6… 1 B-1 23 B-1 45 B-1 6… k … merger

9
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 9 Overlapping I/O and Merging Prediction of delay between issuing two reads is not easy: Solution: 23 B-1 45 B-1 6… 23 B-1 45 B-1 6… 23 B-1 45 B-1 6… k … merger D write buffers 1 2 3 4 k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk 2D write buffers 1 2 3 4 k ….. merge elements D blocks merge buffers merging D prefetch buffers D-head disk k+3D overlap buffers

10
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 10 Overlapping I/O and Merging [contd.] Theorem 4: Theorem 4: Merging k sequences with a total of N elements takes time – : time to produce one element of output – L: time to input or output D arbitrary blocks Our I/O thread strategy: – DB elements in write buffer => output step – input step To prove Theorem 4 we distinguish two cases – I/O bound case: 2L DB – Compute bound case: 2L< DB

11
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 11 Compute Bound Case Lemma 6: Lemma 6: In compute bound case after k/D+1 steps, the merging thread never blocks until all elements are merged. Notation: – # elem. in write buffer – r # of elem. in the overlap and merge buffers Our I/O thread strategy: – DB elements in write buffer => output step – input step r kB kB+DB kB+2DB kB+3DB kB+y L/ kB+2 L/ 0DB 2DB- L/ 2DB merge thread blocked I/O blocking fetch output

12
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 12 I/O Bound Case Lemma 7: Lemma 7: In I/O bound case the I/O thread never blocks until all input blocks are fetched. Our I/O thread strategy (the same): – DB elements in write buffer => output step – input step Proof: Proof: similar to compute bound case r blocking fetch output DB- L/ DB2DB- L/ 2DB kB+3DB kB+2DB+ L/ kB+2DB kB+DB+ L/ kB+DB 0 kB+ L/

13
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 13 Multi-head Model Multi-disk Model Randomized Cyclic Allocation [VitHut01] makes accesses in balanced Optimal prefetching from [HutchSanVitt01,SanEgnKor00] Corollary 8: Corollary 8: For any > 0, prefetch buffer of size m= (D/ ) merging k sequences with a total N elements can be implemented to run in time … … … … -ping 2D write buffers … O(D) prefetch buffers k+O(D) overlap buffers 1 2 3 4 k ….. merge elements read buffers D blocks merge buffers merging overlap-disk scheduling

14
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 14 Implementation Issues Implementation (part of library) – Forming runs: Key sorting, efficient two passes MSD radix sort – Multi-way merging: [Sanders00] – prefetch buffer + overlap buffer = read buffer – Asynchronous I/O (POSIX threads) – No superfluous copying User blocks are transferred by DMA ( O_DIRECT flag) Buffers are passed by pointer between pools Applications Sorting, Priority queues, Search trees OS Caching, Mem. management Disk scheduling POSIX threads, File system STL + more structure

15
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 15 Hardware 3000 Euro, 375 MByte/s, July 2002

16
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 16 Single Disk Performance LEDA-SM, TPIE 2 GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 I/O bandwidth 45.4 Mbyte/s = 95% of peak bandwidth of the disk

17
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 17 Multiple Disk Performance 2 GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 TPIE, LEDA-SM – Linux Soft-RAID 8X128KByte stripes Bandwidth of 315 MByte/s

18
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 18 Element Size 16GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 64 merging is I/O bound 128 run formation is I/O bound Small elements require special treatment

19
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 19 Optimal Prefetching 2D write buffers … k+O(D) overlap buffers 1 2 3 4 k ….. merge read buffers D blocks merge buffers merging 2D write buffers … O(D) prefetch buffers k+O(D) overlap buffers 1 2 3 4 k ….. merge read buffers D blocks merge buffers merging

20
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 20 Block Size Block sizes of several MB => good performance Leave room for read and write buffers Naive striping implies factor 8 block size reduction Dramatic run time increase

21
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 21 Large Inputs One-pass, curves go up because of – Slower zones – Seek times

22
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 22 Summary Parallel disk sorting with high performance on state of art hardware with theoretical performance guarantees Bandwidth is no longer a limiting factor for external memory algorithms Price performance ratio can improve by adding disks

23
R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting 23 ?

Similar presentations

OK

1 Joe Meehean. Ordered collection of items Not necessarily sorted 0-index (first item is item 0) Abstraction 2 Item 0 Item 1 Item 2 … Item N.

1 Joe Meehean. Ordered collection of items Not necessarily sorted 0-index (first item is item 0) Abstraction 2 Item 0 Item 1 Item 2 … Item N.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Download ppt on nutrition in plants and animals Ppt on fibonacci series Ppt on classical economics today Marketing mix ppt on sony blu-ray Ppt on low level language goals Ppt on eia report in malaysia Ppt on machine translation vs human Ppt on national education day images Ppt on economic order quantity pdf Ppt on youth leadership