HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.

HDF5 collective chunk IO A Working Report

Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel HDF5 IO module with chunking storage. ► Found that parallel HDF5 does not support MPI-IO collective write and read. ► Had some time left in MEAD project.

Software Stack for Parallel HDF5 Application Application Parallel HDF5 MPI-IO(ROM-IO, etc.) Parallel File System(GPFS, PVFS, Lustre) Hardware(Myrinet, infinite band etc.)

Why collective chunk IO? ► Why using chunk storage? 1. Better performance when subsetting 2. Dataset with unlimited dimension 2. Dataset with unlimited dimension 3. Filters to be added ► Why collective IO? Take advantage of the performance optimization provided by MPI-IO.

MPI-IO Basic Concepts ► collective IO Contrary to independent IO, all processes must participate in doing IO. MPI-IO can do optimization to improve IO performance by using MPI_FILE_SET_VIEW with collective IO.

P0’s view P1’s view P2’s view P3’s view An Example with 4 processes When doing independent IO, for worst case it may require 8 individual IO access.

With collective IO P0P1 P2 P3 It may only need one IO access to the disk. check http://hdf.ncsa.uiuc.edu/apps/WRF-ROMS/parallel-netcdf.pdf http://hdf.ncsa.uiuc.edu/apps/WRF-ROMS/parallel-netcdf.pdf and the reference of that report for more information.

Challenges to support collective IO with chunking storage inside HDF5 ► Have to fully understand how chunking is implemented inside HDF5. ► Have to fully understand how MPI-IO is supported inside HDF5, especially how collective IO works with contiguous storage inside HDF5. ► Have to find out how difficult to implement collective chunk IO inside HDF5.

Strategy to do the project ► First to see whether we can implement collective chunk IO for some special cases such as one big chunk to cover all singular hyperslab selections. ► Then to gradually increase the complexity of the problem until we can solve the general case.

Case 1: One chunk covers all singular hyperslabs P0P2All selection in a chunkP1P3

Progresses made so far ► Unexpected easy connection between HDF5 chunk code and collective IO code. ► Found that the easy connection works for more general test cases than I expected. ► Wrote the test-suite and checked into HDF5 CVS in both 1.6 and 1.7 branch. ► Tackled with more general cases.

Special cases to work with ► One chunk to cover all singular hyperslab selections for different processes. ► One chunk to cover all regular hyperslab selections for different processes. ► All hyperslab selections are singular and the number of chunks inside each hyperslab selection should be the same.

Case 1: One chunk covers all singular hyperslabs This case can be used at WRF-PHDF5 module and it was verified that this works. P0P2All selection in a chunkP1P3

Case 2: One chunk covers all regular hyperslabs chunk P0 P2 P1 P3 Whether MPI collective chunk IO can optimize this pattern is another question and is out of our discussion.

Case 3: Multiple chunks cover singular hyperslabs each chunk size P0 P2 P1 P3 Condition for this case: number of chunks for each process must be equal.

More general case ► Hyperslab does not need to be singular. ► One chunk does not need to cover all hyperslab selections for one process. ► Number of chunks does NOT have to be the same to cover hyperslab selections for processes. ► How about irregular hyperslab selection?

What does it look like? CHUNK hyperslab selection

More details In each chunk, the overall selection becomes irregular, we cannot use the contiguous MPI collective IO code to describe the above shape.

A little more thought ► The current HDF5 implementation needs an individual IO access for data stored in each chunk. With large number of chunks, that will cause bad performance. ► Can we avoid the above case in Parallel HDF5 layer? ► Is it possible to do some optimization and push the above problem into MPI-IO layer?

What should we do ► Build MPI derived datatype to describe this pattern for each chunk and we hope that when MPI-IO obtains the whole picture, it will figure out that this is a regular hyperslab selection and do the optimized IO. ► To understand how MPI Derived data type works, please check “Derived Data Types with MPI” from http://www.msi.umn.edu/tutorial/scicomp/general /MPI/content6.html at supercomputing institute of University of Minnesota. http://www.msi.umn.edu/tutorial/scicomp/general /MPI/content6.html http://www.msi.umn.edu/tutorial/scicomp/general /MPI/content6.html

MPI Derived Datatype ► Why? To provide a portable and efficient way to describe non-contiguous or mixed types in a message. To provide a portable and efficient way to describe non-contiguous or mixed types in a message. ► What? Built from the basic MPI datatypes; A sequence of basic datatypes and displacements.

How to construct the DDT ► MPI_Type_contiguous ► MPI_Type_vector ► MPI_Type_indexed ► MPI_Type_struct

MPI_TYPE_INDEXED ► parameters: count, blocklens[], offsets[], oldtype, *newtype count, blocklens[], offsets[], oldtype, *newtype count: number of blocks to be added blocklens: number of elements in block offsets: displacements for each block oldtype: datatype of each element newtype: handle(pointer) for new derived type

MPI_TYPE_INDEXED 123456789101112131415 blocklengths[0] = 4; displacements[0] = 5; count = 2; blocklengths[1] = 2; displacements[1] = 12; MPI_TYPE_INDEXED(count,blocklengths,displacements,MPI_INT,&indextype);

Approach ► We will build a one MPI Derived Data type for each process Use MPI_TYPE_STRUCT or MPI_TYPE_INDEXED to generate the derived data type ► Then use MPI_TYPE_STRUCT to generate the final MPI derived data type for each process ► Set MPI file set view ► Inside MPI-IO layer to let MPI-IO figure out how to optimize this

Approach (continued) ► Start with building the “basic” MPI derived data type inside one chunk ► Use “basic” MPI derived data types to build an “advanced” MPI derived data type for each process ► Use MPI_Set_file_view to glue this together. Done! Obtain hyperslab Selection information Build “basic” MPI derived data type PER CHUNK based on selection information Build “advanced” MPI derived data type PER PROCESS based on “basic” MPI derived data type Set MPI File View based on “advanced” MPI derived data type Send to MPI-IO layer, done

Schematic for MPI Derived Data Types to support collective chunk IO inside Parallel HDF5... P0 chunk 1 chunk 2 chunk ichunk n... P1 chunk n+1 chunk n+2chunk n+ichunk n+m

How to start ► HDF5 used Span-tree to implement general hyperslab selection ► The starting point is to build an MPI derived data type for irregular hyperslab selection with contiguous layout ► After this step is finished, we will build an MPI derived data type for chunk storage following the previous approach

Now a little off-track for the original project ► We are trying to build MPI derived data type for irregular hyperslab with contiguous storage. If this is solved, HDF5 can support collective IO for irregular hyperslab selection. ► It may also improve the performance for independent IO. ► Then we will build an advanced MPI derived data type for chunk storage.

How to describe this hyperslab selection? Span-tree should handle this well.

Span tree handling with the overlapping of the hyperslab + +

Some Performance Hints ► It was well-known that Performance was not very good with MPI derived data type. People use MPI PACK and MPI UNPACK in order to gain performance in real application. ► The recent performance study shows that using MPI derived data type can achieve comparable performance compared with MPI Pack and MPI Unpack. (http://nowlab.cis.ohio-state.edu/publications/tech- reports/2004/TR19.pdf).

HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.

Similar presentations

Presentation on theme: "HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.

Similar presentations

Presentation on theme: "HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel."— Presentation transcript:

Similar presentations

About project

Feedback