Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.hdfgroup.org The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.

Similar presentations


Presentation on theme: "Www.hdfgroup.org The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1."— Presentation transcript:

1 www.hdfgroup.org The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1

2 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI2 Outline Dataset metadata and array data storage layouts Types of dataset storage layouts Factors affecting I/O performance I/O with compact datasets I/O with contiguous datasets I/O with chunked datasets Variable length data and I/O

3 www.hdfgroup.org HDF5 Layers May 30-31, 2012 HDF5 Application HDF5 Internals VFD Layer HDF5 file Application buffer HDF5 Object Layer (API) H5Dwrite is called Data is prepared for I/O SEC2 driver performs I/O HDF5 Workshop at PSI3

4 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI4 Goal of this talk Present what is happening to data inside the HDF5 library Show how application can control the HDF5 library behavior Specifically: -Describe some basic operations and data structures and explain how they affect performance and storage sizes -Give some “recipes” for how to improve performance

5 www.hdfgroup.org HDF5 DATASET METADATA May 30-31, 2012HDF5 Workshop at PSI5

6 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI6 HDF5 Dataset Data array Also called raw data Metadata -Dataspace -Rank, dimensions of dataset array -Datatype - Information on how to interpret data -Storage Properties -How array is organized on disk -Attributes -User-defined metadata (optional)

7 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI7 HDF5 dataset components Dataset data array Dataset header Dataspace 3 Rank Dim_2 = 5 Dim_1 = 4 Dimensions Time = 32. 4 Pressure = 987 Temp = 56 Attributes Chunked Compressed Dim_3 = 7 Storage info IEEE 32-bit float Datatype MetadataRaw data

8 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI8 HDF5 metadata Information about HDF5 objects used by the HDF5 library Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, super- block, etc. Usually small compared to raw data sizes (KB vs. MB-GB)

9 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI9 HDF5 metadata cache Application memory Metadata cache (MDC) HDF5 File Dataset array data HDF5 metadata Dataset array data Dataset header Dataset header resides in MDC. MDC is handled by HDF5 library Metadata is mixed with raw data in HDF5 file

10 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI10 HDF5 metadata cache Metadata cache Space allocated to handle pieces of the HDF5 metadata Allocated by the HDF5 library in application’s memory space Allocated per file; released when file is closed Metadata cache behavior affects overall performance Metadata cache implementation prior to HDF5 1.6.5 could cause performance degradation for some applications

11 www.hdfgroup.org HDF5 DATASET STORAGE LAYOUTS May 30-31, 2012HDF5 Workshop at PSI11

12 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI12 HDF5 datasets storage layouts Contiguous External Chunked Compact

13 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI13 Contiguous storage layout Contiguous storage layout is a default storage layout for an HDF5 dataset Dataset raw data is stored in one contiguous block in HDF5 file

14 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI14 Contiguous storage layout Application memory Metadata cache (MDC) Dataset array data Dataset header HDF5 File Dataset array data Dataset header Raw data is stored in one contiguous block in HDF5 file

15 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI15 External storage layout Dataset raw data is stored in an external file(s) that should be kept together with the HDF5 file Layout in the external file is specified by an application An easy way to make legacy data available to HDF5 library

16 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI16 External storage layout Metadata cache (MDC) Dataset array data Dataset header HDF5 file Unix/Windows file Metadata is stored in HDF5 file. Raw data is stored in a separate file as specified by application Dataset header Application memory

17 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI17 Chunked storage layout Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks Each chunk is stored as contiguous block HDF5 library treats each chunk as atomic object for I/O Greatly affects performance and file sizes Use for extendible datasets and datasets with filters applied (checksum, compression) Use for sub-setting of big datasets

18 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI18 Chunked storage layout Application memory Metadata cache (MDC) Dataset array data Dataset header HDF5 File Dataset header Chunk index A BCD CABD Raw data is stored in separate chunks in HDF5 file

19 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI19 Compact storage layout Raw data is stored in a dataset object header Raw data read/written with the header Use for small (few K) datasets to minimize small I/O operations

20 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI20 Compact storage layout Application memory Metadata cache (MDC) Dataset array data Dataset header HDF5 File Dataset header Raw data is stored in a dataset object header Dataset array data

21 www.hdfgroup.org FACTORS AFFECTING I/O PERFORMANCE May 30-31, 2012HDF5 Workshop at PSI21

22 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI22 HDF5 data structures Data structures used by HDF5 library B-trees (groups, dataset chunks) Hash tables Local and global heaps (variable length data: link names, strings, etc.) Other concepts HDF5 metadata cache HDF5 chunk cache Free space management data structure Etc.

23 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI23 Operations on data inside HDF5 library Copying to/from internal buffers Datatype conversion, e.g., Float to integer Little-endian to big-endian 64-bit integer to 16-bit integer Variable-length data conversion from memory to file Scattering - gathering Data is scattered/gathered from/to application buffers into internal buffers for datatype conversion and partial I/O

24 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI24 Operations on data inside HDF5 library Data transformation (filters, compression) -Checksum on raw data and metadata -Algebraic transform -GZIP and SZIP compressions -HDF5 and user-defined data transformations

25 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI25 I/O performance I/O performance depends on many factors Storage layouts Dataset storage properties Chunking strategy Metadata cache performance Datatype conversion performance Other filters, such as compression Access patterns

26 www.hdfgroup.org I/O WITH DIFFERENT STORAGE LAYOUTS May 30-31, 2012HDF5 Workshop at PSI26

27 www.hdfgroup.org WRITING COMPACT DATASET May 30-31, 2012HDF5 Workshop at PSI27

28 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI28 Writing compact dataset Application memory Metadata cache (MDC) Dataset array data Dataset header HDF5 File Dataset header Raw data is written when object header is written

29 www.hdfgroup.org WRITING CONTIGUOUS DATASET May 30-31, 2012HDF5 Workshop at PSI29

30 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI30 Writing contiguous dataset Application memory Metadata cache (MDC) Dataset array data Dataset header HDF5 File Dataset array data Dataset header Raw data is written first. The header is written when flushed to file (H5Dclose, H5Fflush, or MDC flush done by the HDF5 library)

31 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI31 Writing contiguous dataset with conversion Application memory Metadata cache (MDC) Dataset array data Dataset header HDF5 File Dataset header Raw data goes through conversion buffer. The header is written when flushed to file (H5Dclose, H5Fflush, or MDC flush done by HDF5 library) 1MB conversion buffer

32 www.hdfgroup.org PARTIAL I/O FOR CONTIGUOUS DATASET May 30-31, 2012HDF5 Workshop at PSI32

33 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI33 Sub-setting of contiguous dataset Series of adjacent rows HDF5 File Application data in memory Subset is contiguous in file One I/O operation M rows N N elements

34 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI34 Sub-setting of contiguous dataset Adjacent, partial rows HDF5 File Application data in memory Subset is in M contiguous blocks in file Several I/O operation M rows N elements

35 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI35 Sub-setting of contiguous dataset Extreme case: writing a column HDF5 File Application data in memory Subset data is scattered in a file in M different locations Several small I/O operation M rows 1 element

36 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI36 Sub-setting of contiguous dataset Data sieve buffer HDF5 File M … Application data in memory 1 element Data is copied to a sieve buffer in memory (64K) memcopy One write operation

37 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI37 Performance tuning for contiguous dataset Datatype conversion Avoid for better performance Use H5Pset_buffer function to customize conversion buffer size Partial I/O Write/read in big contiguous blocks Use H5Pset_sieve_buf_size to improve performance for complex sub-setting Caution: Sieve buffer is allocated when the first write occurs and is released when the dataset is closed. Memory will grow if there are a lot opened datasets.

38 www.hdfgroup.org I/O FOR CHUNKED DATASET May 30-31, 2012HDF5 Workshop at PSI38

39 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI39 Recall: Chunked storage layout Application memory Metadata cache (MDC) Dataset array data Dataset header HDF5 File Dataset header Chunk index A BCD CABD Raw data is stored in separate chunks in HDF5 file

40 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI40 HDF5 chunking HDF5 library treats each chunk as atomic object Compression is applied to each chunk Datatype conversion, other filters applied per chunk Chunk size greatly affects performance Chunk overhead adds to file size Chunk processing involves many steps

41 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI41 HDF5 chunk cache Chunk cache (general points, details later) Caches chunks for better performance; remains allocated across multiple calls Created for each chunked dataset Size of chunk cache is set for file (default size 1MB) Each chunked dataset has its own chunk cache Chunk may be too big to fit into cache Memory may grow if application keeps opening datasets

42 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI42 HDF5 chunk cache Application memory Metadata cache Default size is 1MB Metadata cache (MDC) Dataset header Chunking B-tree nodes Chunk caches ( per dataset)

43 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI43 Writing chunked dataset CB A Datatype conversion is performed before chunked placed in cache Chunk is written when evicted from cache Compression and other filters are applied on eviction ABC C HDF5 File Chunk cacheChunked dataset Filter pipeline Application memory space Conversion buffer

44 www.hdfgroup.org PARTIAL I/O FOR CHUNKED DATASET May 30-31, 2012HDF5 Workshop at PSI44

45 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI45 Partial I/O for chunked dataset Example: write the green subset from the dataset, converting the data Dataset is stored as six chunks in the file. The subset spans four chunks, numbered 1-4 in the figure. Hence four chunks must be written to the file. But first, the four chunks must be read from the file, to preserve those parts of each chunk that are not to be overwritten. 12 34

46 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI46 Partial I/O for chunked dataset For each of the four chunks: Read chunk from file into chunk cache, unless it’s already there. Determine which part of the chunk will be replaced by the selection. Move those elements to conversion buffer and perform conversion Move data elements to write from application buffer to conversion buffer Move those elements back from conversion buffer to chunk cache. Apply filters (compression) when chunk is flushed from chunk cache For each element 3 memcopy performed

47 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI47 Partial I/O for chunked dataset 3 Conversion buffer memcopy Application memory Chunk cache HDF5 File Chunk Compress and write to file memcopy

48 www.hdfgroup.org I/O FOR VARIABLE-LENGTH DATASET May 30-31, 2012HDF5 Workshop at PSI48

49 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI49 Examples of variable length data String A[0] “the first string we want to write” ………………………………… A[N-1] “the N-th string we want to write” Each element is a record of variable-length A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10] A[1] (0,0,110,2005) [length = 4] ……………………….. A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M]

50 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI50 Variable length data in HDF5 Variable length description in HDF5 application typedef struct { size_t length; void *p; }hvl_t; Base type can be any HDF5 type H5Tvlen_create(base_type) ~ 20 bytes overhead for each element Data cannot be compressed

51 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI51 How variable length data is stored in HDF5 Globa l heap Actual variable length data Dataset with variable length elements Pointer into global heap HDF5 File Dataset header

52 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI52 Variable length datasets and I/O Elements from application buffer “transferred” to/from heaps in the metadata cache during I/O Globa l heap Application buffer Raw data Metadata cache Pointers

53 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI53 There may be more than one global heap Globa l heap Application buffer Raw data Globa l heap Pointers

54 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI54 VL dataset and I/O Globa l heap Application buffer Globa l heap HDF5 File Memory Conversion buffers

55 www.hdfgroup.org May 30-31, 2012HDF5 Workshop at PSI55 Hints for variable length data I/O Avoid closing/opening a file while writing VL datasets Global heap information is lost Global heaps may have unused space Avoid alternately writing different VL datasets Data from different datasets will go into to the same heap If maximum length of the record is known, consider using fixed-length records and compression

56 www.hdfgroup.org The HDF Group Thank You! Questions? May 30-31, 2012HDF5 Workshop at PSI 56


Download ppt "Www.hdfgroup.org The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1."

Similar presentations


Ads by Google