Presentation on theme: "Large Scale Parallel I/O with HDF5 Darren Adams, NCSA Considerations for Parallel File Systems on Distributed HPC Platforms."— Presentation transcript:
Large Scale Parallel I/O with HDF5 Darren Adams, NCSA Considerations for Parallel File Systems on Distributed HPC Platforms
Survey of Parallel I/O Techniques Root Process Gather and write: Yes, this is still common… N-N: Write independent file from each process N-1: Write to a single file from all processes. – MPIIO based Parallel I/0 – Higher level I/O Library HDF5,NetCFD,Silo,etc.
Advantages of N-1 The resulting file (can be) independent of process count. Single file is easier to analyze with other applications More conceptually compatible with the “global” data set that has been computed in a distributed fashion.
Disadvantages of N-1 Usually results in a write performance penalty vs. N-N. Without particular care, can loose information about how the data was distributed when it was created. File size can become a problem. Some approaches can result in pathologically bad IO patterns vs. N-N.
A Couple of New Approaches to Optimize Performance PLFS –Virtual Parallel Log Structured File System – Sits in between the application and the file system and intercepts the application's I/O requests. – Optimizes I/O for the underlying file system by aligning stripe boundaries ADIOS – ADaptable IO System – Change the IO behavior though an XML file that is independent of the application. – Allows Fortran READ and WRITE statements to be used in the application.
Other Considerations (besides parallel I/O performance) Portability across architecture and OS. Ability to read and manipulate data with other applications. Stability of software (library) and availability used to create file format. Sharing data with collaborators.
Reasons to use HDF5 Provides a rich API that is capable of creating a file system in a file Supports Parallel IO Is highly stable and backward compatible. Platform Independent. Widely used and supported (will likely be around for a long time)
Reasons Not to Use HDF5 (directly) Complicated, general purpose API No standard is imposed on the data format created. – While HDF5 has the capability to create a self- describing well-formed data format, it does not require these practices. More of a general toolkit to build a higher application-level API (like NetCDF). More time and effort is needed for development of I/O routines.
Reasons to Use HDF5 (directly) Maximum control over file format. Access to all of the performance settings in the underlying MPIIO layer. Ability to restructure the data layout if needed to improve performance of either the application writing the data OR visualization or other analysis software Can incrementally expose portions of the HDF5 API that are needed to the application rather than only having access to what a higher level API exposes (square hole – round peg).
Remarks on (P)HDF5 HDF5 can be compiled with parallel MPIIO support. – PHDF5 must be linked separately from a serial (non-mpi) HDF5 build. PHDF5 opens up an interface to the MPI I/O layer via MPIIO Hints and various internal MPI- specific properties. Using HDF5 directly gives full access to all of these settings
Data Requirements for DNS Application The science code is a CFD solver that used domain- decomposed MPI processes to parallelize the computation. Visualization Data – The main data product for analyzing the scalar and vector field data included 3D scalar and 3D – 3vector field data. – Important to have a workable data format that can be read with tools such as Visit Restart Data – State dump, only used by the simulation code can be highly optimized, but should still be portable across architectures. Statistics Data – Smaller data sets and Integral quantities
Parallel I/O Approach for DNS Code Chose to use HDF5 directly – Developed simple I/O library that encapsulates the HDF5 file structure Exposes a simpler interface to the simulation code. Allows I/O routines to be re-used in analysis programs Can use the simple library to implement a Visit Database reader. – Exposes only a very well maintained and widely supported library dependency to the application.
Parallel I/O Approach for DNS Code Need to mitigate performance degradation of N-1 file I/O. – As long as I/O performance is “reasonable”, code execution time will not be significantly impacted. There simply is not enough storage capacity on the file system to allow I/O to dominate. Hmmm, what does “reasonable” mean… – Structured usable files are worth some level of performance sacrifice to researchers.
HDF5 Datasets in Parallel Applications Use Hyperslabs to define a subset of a global data set for each process. – In CFD domain decomposition, each MPI rank has data that is physically adjacent to another rank’s. – The “global” data set, in this context, is achieved when data from all rank is “glued” together. – This approach creates a single glued-together data set which can be supplemented by MPI rank info, but does not need to be. Let each process write to a separate data set. – Information about how to glue the data together MUST be provided.
Considerations for the Cray XT System Kraken Lustre File System with ~160 OSTs ?. Experience shows that Collective N-1 I/O can lead to pathological performance degradation. – This is often due to network contention introduced when processes overlap the Lustre stripe boundaries. In effect each process writes to several OST’s rather than striking a balance where all OSTs work in concert.
Lustre and MPI-IO: ADIO to the Rescue Recent additions to the MPICH ROMIO layer upon which most MPI distributions are based allow Lustre stripe size and count to be set via the MPI_Info object. – Setting these parameters will set the actual Lustre striping parameters to the new file when it is created. The only other way to do this is to set striping parameter ahead of time to the I/O directory, or use the lustre c library an c-style open calls. – Perhaps equally as important these parameters are sued by the MPIIO collective buffering stack. The MPI hint “CO” is introduced to set the client to OST ratio. – These improvements directly address the shortcomings of N-1 file I/O cited by developers of PLFS and ADIOS. But, do they work?
Lustre MPI-IO Tweaks From “man mpi” on Kraken: MPICH_MPIIO_CB_ALIGN If set to 2, an algorithm is used to divide the I/O workload into Lustre stripe-sized pieces and assigns them to collective buffering nodes (aggregators), so that each aggregator always accesses the same set of stripes and no other aggregator accesses those stripes. This is generally the optimal collective buffering mode as it minimizes the Lustre file system extent lock contention and thus reduces system I/O time. However, the overhead associated with dividing the I/O workload can in some cases exceed the time otherwise saved by using this method.
striping_factor Specifies the number of Lustre file system stripes (stripe count) to assign to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2. Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe count of the directory to a value other than the system default.
striping_unit Specifies in bytes the size of the Lustre file system stripes (stripe size) assigned to the file. This has no effect if the file already exists when the MPI_File_open() call is made. File striping cannot be changed after a file is created. Currently this hint applies only when MPICH_MPIIO_CB_ALIGN is set to 2. Default: the default value for the Lustre file system, or the value for the directory in which the file is created if the lfs setstripe command was used to set the stripe size of the directory to a value other than the system default.
Conclusions Log File formats may be the best approach to optimize I/O performance if performance is the only goal. More traditional file formats may not achieve the performance peaks of a log file format, but are more usable after the run. Recent Lustre ADIO improvements provide tools to mitigate poor performance for N-1 I/O patterns enough to still justify their use. Additional improvements can be made when approaching Peta- scale while maintaining a consistent file format. – Deploy separate I/O aggregator processes with (perhaps) very large buffer settings. – Implement a “N-M” approach where the single file is broken up, but not to full N-N per-process. Can use internal HSDF5 file structure and linking to preserve the “global” process-independent dataset.
Future Work Develop routines for complete state dumps for DNS restarts using HDF5 and a N-1 approach – Want single file should be readable from different process counts across platforms – Pull out all stops to get achieve peak write performance – Is a log file format or I/O imposition layer really needed? Document optimal settings for large scale grids on Kraken at 1024 processes and beyond.
References PLFS: A Checkpoint Filesystem for Parallel Applications http://institute.lanl.gov/plfs/plfs.pdfhttp://institute.lanl.gov/plfs/plfs.pdf ADIOS http://adiosapi.org/index.php5?title=Main_Page http://adiosapi.org/index.php5?title=Main_Page Lustre Technical White Paper: Lustre ADIO collective write driver http://wiki.lustre.org/images/6/65/Lustre_ADIO_ Driver_Whitepaper_0926.pdf http://wiki.lustre.org/images/6/65/Lustre_ADIO_ Driver_Whitepaper_0926.pdf