Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department.

Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston @cs.uh.edu 2/17/12 Vishwanath Venkatesan 1

Outline I/O Challenge in HPC MPI File I/O Non-blocking Collective Operations Non-blocking Collective I/O Operations Experimental results Conclusion 2/17/12 Vishwanath Venkatesan2

I/O Challenge in HPC A 2005 Paper from LLNL [1] states – Applications on leadership class machines require 1 GB/s I/O Bandwidth per teraflop of computing capability Jaguar of ORNL, (Fastest in 2008) – Excess 250 Teraflops peak compute performance with peak I/O performance of 72 GB/s [3] Fastest Supercomputer K, (2011) – 10 Petaflops (nearly) peak compute performance with realized I/O bandwidth of 96 GB/s [2] 2/17/12 Vishwanath Venkatesan3 [1] Richard Hedges, Bill Loewe, T. McLarty, and Chris Morrone. Parallel File System Testing for the Lunatic Fringe: the care and feeding of restless I/O Power Users, In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (2005) [2] Shinji Sumimoto. An Overview of Fujitsu’s Lustre Based File System. Technical report, Fujitsu, 2011 [3] M. Fahey, J. Larkin, and J. Adams. I/O performance on a massively parallel Cray XT3/XT4. In Parallel and Distributed Processing,

MPI File I/O MPI has been de-facto standard for parallel programming in the last decade MPI I/O – File view: portion of a file visible to a process – Individual and collective I/O operations – Example to illustrate the advantage of collective I/O ` 2/17/12 Vishwanath Venkatesan4 4 processes accessing a 2D matrix stored in row-major format MPI-I/O can detect this access pattern and issue one large I/O request followed by a distribution step for the data among the processes

Non-blocking Collective Operations Non-blocking Point-to-Point Operations – Asynchronous data transfer operation – Hide communication latency by overlapping with computation – Demonstrated benefits for a number of applications [1] Non-blocking collective communication operations were implemented using LibNBC [2] – Schedule based design: a process-local schedule of p2p operations is created – Schedule execution is represented as a state machine (with dependencies) – State and schedule are attached to every request Non-blocking collective communication operations voted into the upcoming MPI-3 specification [2] Non-blocking collective I/O operations not (yet) added to the document. 2/17/12 Vishwanath Venkatesan5 [1] Buettner. D, Kunkel. J, and Ludwig. T. 2009. Using Non-blocking I/O Operations in High Performance Computing to Reduce Execution Times. In Proceedings of the 16th European PVM/MPI Users [2] Hoefler, T., Lumsdaine, A., Rehm, W.: Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI, Supercomputing 2007/

Non-blocking collective I/O Operations MPI_File_iwrite_all (MPI_File file,void *buf, int cnt, MPI_Datatyep dt,MPI_Request *request) Different from Non-blocking collective communication operations – Every process is allowed to provide different amounts of data per collective read/write operation – No process has a ‘global’ view how much data is read/written Create a schedule for a non-blocking All-gather(v) – Determine the overall amount of data written across all processes – Determine the offsets for each data item within each group Upon completion: – Create a new schedule for the shuffle and I/O steps – Schedule can consist of multiple cycles 2/17/12 Vishwanath Venkatesan6

Experimental Evaluation 2/17/12 Vishwanath Venkatesan7 Crill cluster at the University of Houston – Distributed PVFS2 file system using with 16 I/O servers – 4x SDR InfiniBand message passing network (2 ports per node) – Gigabit Ethernet I/O network – 18 nodes, 864 compute cores LibNBC integrated with OpenMPI trunk rev. 24640 Focusing on collective write operations

Latency I/O Overlap Tests Overlapping non-blocking coll. I/O operation with equally expensive compute operation – Best case: overall time = max (I/O time, compute time) Strong dependence on ability to make progress – Best case: time between subsequent calls to NBC_Test = time to execute one cycle of coll. I/O 2/17/12 Vishwanath Venkatesan8 No. of processesI/O timeTime spent in computation Overall time 6485.69 sec 85.80 sec 128205.39 sec 205.91 sec

Parallel Image Segmentation Application Used to assist in diagnosing thyroid cancer Based on microscopic images obtained through Fine Needle Aspiration (FNA) [1] Executes convolution operation for different filters and writes data Code modified to overlap write of iteration i with computations of iteration i+1 Two code versions generated: – NBC: Additional calls to progress engine added between different code blocks – NBC w/FFTW: Modified FFTW to insert further calls to progress engine 2/17/12 Vishwanath Venkatesan9 [1] Edgar Gabriel,Vishwanath Venkatesan and Shishir Shah, Towards High Performance Cell Segmentation in Multispectral Fine Needle Aspiration Cytology of Thyroid Lesions. Computer Methods and Programs in Biomedicine, 2009.

Application Results 2/17/12 Vishwanath Venkatesan10 8192 x 8192 pixels, 21 spectral channels 1.3 GB input data, ~3 GB output data 32 aggregators with 4 MB cycle buffer size

Conclusions Specification of non-blocking collective I/O operations straight forward Implementation challenging, but doable Results show strong dependence on the ability to make progress – (Nearly) perfect for micro benchmark – Mostly good results with application scenario Is up for first voting in the MPI Forum. 2/17/12 Vishwanath Venkatesan11

Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department.

Similar presentations

Presentation on theme: "Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department.

Similar presentations

Presentation on theme: "Design and Evaluation of Non-Blocking Collective I/O Operations Vishwanath Venkatesan 1, Edgar Gabriel 1 1 Parallel Software Technologies Laboratory, Department."— Presentation transcript:

Similar presentations

About project

Feedback