Introduction I/O is one of the main bottlenecks in HPC applications. Many applications or higher level libraries rely on MPI-I/O for doing parallel I/O. Several optimizations have been introduced in MPI-I/O to meet the needs of application – Non-blocking individual I/O – Different collective I/O algorithms
Motivation Routines for non-blocking individual I/O operations exist (MPI_File_i(read/write)(_at) Non-blocking point-to-point (existing) and collective (to be added) communication operations have demonstrated benefits. Split collective I/O operations have their restrictions and limitations. What’s keeping us from adding non-blocking collective I/O operations? – Implementation
Usecase (I) HDF5 operations that modify metadata: – Collective to keep the cache among all processes synchronized. – The Metadata cache uses an LRU eviction scheme. – Items at the bottom of the list are evicted in a collective write call to disk. The amount of data written is usually small (< 1KB). – Non blocking collective I/O would allow us to fire off those writes and go do other stuff avoiding the I/O overhead.
Usecase (II) HDF5 Raw data operations: – Chunking data in file is a key optimization HDF5 uses for parallel I/O. – If HDF5 can detect a pattern in the way chunks are accessed, we can pre-fetch those chunks from disk. – Asynchronous I/O operations would hide the cost of I/O operations. Chunk cache for writes (currently disabled for parallel HDF5): – Similar concept to the metadata cache
New Routines MPI_File_iread_all (MPI_File fh, void *buf, int count, MPI_Datatype type, MPI_Request *req); MPI_File_iwrite_all (MPI_File fh, void *buf, int count, MPI_Datatype type, MPI_Request *req); MPI_File_iread_at_all (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype type, MPI_Request *req); MPI_File_iwrite_at_all (MPI_File fh, MPI_Offset offset, void *buf, int count, MPI_Datatype type, MPI_Request *req); Ordered read/write (add non blocking or deprecate ordered) Deprecate split collectives Straw Vote: 22 - 0 - 0
Challenges Major difference between collective communication and collective I/O operations: – Each process is allowed to provide different volumes of data to a collective I/O operation, without having knowledge on the data volumes provided by other processes. Collective I/O algorithms do aggregation.
Implementation Need non-blocking collective communication Integrate with the progress engine – Test/Wait on the request like other non-blocking operations – Explicit or Implicit progress? Different collective I/O algorithms
Implementation A recent implementation was done within an Open MPI specific I/O library (OMPIO) and uses LibNBC: – leverages the same concept of a schedule for non blocking collective communication operations – work is still at preliminary stages so a large scale evaluation is not available – done at PSTL at the University of Houston (Edgar Gabriel) in collaboration with Torsten – paper accepted at EuroMPI 2011: Design and Evaluation of Nonblocking Collective I/O Operations
Other MPI I/O Operations Several MPI I/O functions are considered expensive other than read/write functions: – Open/Close – Sync – Set view – Set/Get size It would be valuable to have non-blocking versions of some of those functions too.
Usecases Applications that open a file but don’t touch the file until a certain amount of computation has been done – Cost of opening a file will be hidden Non-blocking sync would also provide great advantages in case we flush data items to disk before we go do computation. The intention is to hide the cost (whenever possible) of all the expensive MPI I/O operations.
Conclusion The need for non-blocking collective I/O is fairly high. Implementation is the non-easy part. Performance benefits can be substantial. Users would also benefit from non-blocking versions of some MPI I/O operations that are considered fairly time consuming.