Presentation is loading. Please wait.

Presentation is loading. Please wait.

Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department.

Similar presentations


Presentation on theme: "Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department."— Presentation transcript:

1 Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston @cs.uh.edu 2/17/12 Vishwanath Venkatesan 1

2 Outline I/O Challenge in HPC MPI File I/O Non-blocking Collective Operations Non-blocking Collective I/O Operations Experimental results Conclusion 2/17/12 Vishwanath Venkatesan2

3 I/O Challenge in HPC A 2005 Paper from LLNL [1] states – Applications on leadership class machines require 1 GB/s I/O Bandwidth per teraflop of computing capability Jaguar of ORNL, (Fastest in 2008) – Excess 250 Teraflops peak compute performance with peak I/O performance of 72 GB/s [3] Fastest Supercomputer K, (2011) – 10 Petaflops (nearly) peak compute performance with realized I/O bandwidth of 96 GB/s [2] 2/17/12 Vishwanath Venkatesan3 [1] Richard Hedges, Bill Loewe, T. McLarty, and Chris Morrone. Parallel File System Testing for the Lunatic Fringe: the care and feeding of restless I/O Power Users, In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (2005) [2] Shinji Sumimoto. An Overview of Fujitsu’s Lustre Based File System. Technical report, Fujitsu, 2011 [3] M. Fahey, J. Larkin, and J. Adams. I/O performance on a massively parallel Cray XT3/XT4. In Parallel and Distributed Processing,

4 MPI File I/O MPI has been de-facto standard for parallel programming in the last decade MPI I/O – File view: portion of a file visible to a process – Individual and collective I/O operations – Example to illustrate the advantage of collective I/O ` 2/17/12 Vishwanath Venkatesan4 4 processes accessing a 2D matrix stored in row-major format MPI-I/O can detect this access pattern and issue one large I/O request followed by a distribution step for the data among the processes

5 Non-blocking Collective Operations Non-blocking Point-to-Point Operations – Asynchronous data transfer operation – Hide communication latency by overlapping with computation – Demonstrated benefits for a number of applications [1] Non-blocking collective communication operations were implemented using LibNBC [2] – Schedule based design: a process-local schedule of p2p operations is created – Schedule execution is represented as a state machine (with dependencies) – State and schedule are attached to every request Non-blocking collective communication operations voted into the upcoming MPI-3 specification [2] Non-blocking collective I/O operations not (yet) added to the document. 2/17/12 Vishwanath Venkatesan5 [1] Buettner. D, Kunkel. J, and Ludwig. T. 2009. Using Non-blocking I/O Operations in High Performance Computing to Reduce Execution Times. In Proceedings of the 16th European PVM/MPI Users [2] Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI, Supercomputing 2007/

6 Non-blocking collective I/O Operations MPI_File_iwrite_all (MPI_File file,void *buf, int cnt, MPI_Datatyep dt,MPI_Request *request) Different from Non-blocking collective communication operations – Every process is allowed to provide different amounts of data per collective read/write operation – No process has a ‘global’ view how much data is read/written Create a schedule for a non-blocking All-gather(v) – Determine the overall amount of data written across all processes – Determine the offsets for each data item within each group Upon completion: – Create a new schedule for the shuffle and I/O steps – Schedule can consist of multiple cycles 2/17/12 Vishwanath Venkatesan6

7 Experimental Evaluation 2/17/12 Vishwanath Venkatesan7 Crill cluster at the University of Houston – Distributed PVFS2 file system using with 16 I/O servers – 4x SDR InfiniBand message passing network (2 ports per node) – Gigabit Ethernet I/O network – 18 nodes, 864 compute cores LibNBC integrated with OpenMPI trunk rev. 24640 Focusing on collective write operations

8 Latency I/O Overlap Tests Overlapping non-blocking coll. I/O operation with equally expensive compute operation – Best case: overall time = max (I/O time, compute time) Strong dependence on ability to make progress – Best case: time between subsequent calls to NBC_Test = time to execute one cycle of coll. I/O 2/17/12 Vishwanath Venkatesan8 No. of processesI/O timeTime spent in computation Overall time 6485.69 sec 85.80 sec 128205.39 sec 205.91 sec

9 Parallel Image Segmentation Application Used to assist in diagnosing thyroid cancer Based on microscopic images obtained through Fine Needle Aspiration (FNA) [1] Executes convolution operation for different filters and writes data Code modified to overlap write of iteration i with computations of iteration i+1 Two code versions generated: – NBC: Additional calls to progress engine added between different code blocks – NBC w/FFTW: Modified FFTW to insert further calls to progress engine 2/17/12 Vishwanath Venkatesan9 [1] Edgar Gabriel,Vishwanath Venkatesan and Shishir Shah, Towards High Performance Cell Segmentation in Multispectral Fine Needle Aspiration Cytology of Thyroid Lesions. Computer Methods and Programs in Biomedicine, 2009.

10 Application Results 2/17/12 Vishwanath Venkatesan10 8192 x 8192 pixels, 21 spectral channels 1.3 GB input data, ~3 GB output data 32 aggregators with 4 MB cycle buffer size

11 Conclusions Specification of non-blocking collective I/O operations straight forward Implementation challenging, but doable Results show strong dependence on the ability to make progress – (Nearly) perfect for micro benchmark – Mostly good results with application scenario Is up for first voting in the MPI Forum. 2/17/12 Vishwanath Venkatesan11


Download ppt "Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department."

Similar presentations


Ads by Google