Www.hdfgroup.org The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.

Slides:



Advertisements
Similar presentations
April 17-19HDF/HDF-EOS Workshop XV1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 15 th HDF and HDF-EOS Workshop April 17, 2012.
Advertisements

The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF-Java Products Peter Cao The HDF Group The 13 th HDF and HDF-EOS Workshop.
The HDF Group HDF Group Support for NPP/JPSS Mike Folk, Elena Pourmal, Larry Knox, Albert Cheng The HDF Group The 15 th HDF and HDF-EOS.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 Using visualization tools to access HDF data via OPeNDAP Joe Lee and Kent Yang.
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
The HDF Group Introduction to HDF5 Barbara Jones The HDF Group The 13 th HDF & HDF-EOS Workshop November 3-5, HDF/HDF-EOS Workshop.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
NetCDF4 Performance Benchmark. Part I Will the performance in netCDF4 comparable with that in netCDF3? Will the performance in netCDF4 comparable with.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
HDF5 Tools Update Peter Cao - The HDF Group November 6, 2007 This report is based upon work supported in part by a Cooperative Agreement.
Parallel HDF5 Introductory Tutorial May 19, 2008 Kent Yang The HDF Group 5/19/20081SCICOMP 14 Tutorial.
HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004.
The HDF Group April 17-19, 2012HDF/HDF-EOS Workshop XV1 Introduction to HDF5 Barbara Jones The HDF Group The 15 th HDF and HDF-EOS Workshop.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
NASA EOS DATA COMPRESSION WITH HDF5 SCALEOFFSET FILTER This work was funded by the NASA Earth Science Technology Office under NASA award AIST and.
HDF5 A new file format & software for high performance scientific data management.
Sep , 2010HDF/HDF-EOS Workshop XIV1 HDF5 Advanced Topics Neil Fortner The HDF Group The 14 th HDF and HDF-EOS Workshop September 28-30, 2010.
The Metadata Cache in HDF5 Changes in the HDF5 metadata cache since
The HDF Group Parallel HDF5 Design and Programming Model May 30-31, 2012HDF5 Workshop at PSI 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
1 Introduction to HDF5 Data Model, Programming Model and Library APIs HDF and HDF-EOS Workshop VIII October 26, 2004.
The future of MINC Robert D. Vincent
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
HDF Dimension Scales in HDF5 HDF-EOS Workshop IX San Francisco, CA November 30 - December 2, 2005 Pedro Vicente Nunes THG/NCSA Champaign-Urbana, IL HDF.
Support for NPP/NPOESS by The HDF Group Mike Folk The HDF Group HDF and HDF-EOS Workshop XII October 17, 2008 Oct HDF and HDF-EOS Workshop XII1.
The HDF Group November 3-5, 2009 HDF-OPeNDAP Project Update HDF/HDF-EOS Workshop XIII1 Joe Lee and Kent Yang The HDF Group James Gallagher.
October 15, 2008HDF and HDF-EOS Workshop XII1 What will be new in HDF5?
1 N-bit and ScaleOffset filters MuQun Yang National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Urbana, IL
1 HDF5 Life cycle of data Boeing September 19, 2006.
May 30-31, 2012 HDF5 Workshop at PSI May Shared Object Headers Dana Robinson The HDF Group Efficient Use of HDF5 With High Data Rate X-Ray Detectors.
The HDF Group Support for NPP/NPOESS by The HDF Group Mike Folk, Elena Pourmal, Peter Cao The HDF Group November 5, 2009 November 3-5,
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
May 30-31, 2012 HDF5 Workshop at PSI May Partial Edge Chunks Dana Robinson The HDF Group Efficient Use of HDF5 With High Data Rate X-Ray Detectors.
March 9, th International LCI Conference - HDF5 Tutorial1 HDF5 Advanced Topics.
The HDF Group 10/17/15 1 HDF5 vs. Other Binary File Formats Introduction to the HDF5’s most powerful features ICALEPCS 2015.
Intro to Parallel HDF5 10/17/151ICALEPCS /17/152 Outline Overview of Parallel HDF5 design Parallel Environment Requirements Performance Analysis.
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 The New HDF-EOS Web Site - How it can help you Kent Yang, Joe Lee The HDF Group.
Support for NPP/NPOESS by The HDF Group Mike Folk, Elena Pourmal The HDF Group Annual HDF Briefing to ESDIS March 31, 2009 March Annual HDF Briefing.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
The HDF Group Single Writer/Multiple Reader (SWMR) 110/17/15.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
- 1 - Overview of Parallel HDF Overview of Parallel HDF5 and Performance Tuning in HDF5 Library NCSA/University of Illinois at Urbana- Champaign.
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
The HDF Group Introduction to HDF5 Session Three HDF5 Software Overview 1 Copyright © 2010 The HDF Group. All Rights Reserved.
University of Maryland Baltimore County
HDF and HDF-EOS Workshop XII
Module 11: File Structure
Parallel HDF5 Introductory Tutorial
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
HDF5 Metadata and Page Buffering
File System Structure How do I organize a disk into a file system?
CSCE 990: Advanced Distributed Systems
HDF and HDF-EOS Workshop XII
HDF5 Virtual Dataset Elena Pourmal Copyright 2017, The HDF Group.
Computer Architecture
Peter Cao The HDF Group November 28, 2006
Storage Structure and Efficient File Access
CSE 451: Operating Systems Autumn 2005 Memory Management
Chapter 13: Data Storage Structures
Hierarchical Data Format (HDF) Status Update
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management
Presentation transcript:

The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS Workshop November 3-5, 2009

The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII2 Chunking in HDF5

Goal November 3-5, 2009 To help you with understanding of how HDF5 chunking works, so you can efficiently store and retrieve data from HDF5 3HDF/HDF-EOS Workshop XIII

3-5, 2009HDF/HDF-EOS Workshop XIII4 Recall from Intro: HDF5 Dataset Dataset dataMetadata Dataspace 3 Rank Dim_2 = 5 Dim_1 = 4 Dimensions Time = 32.4 Pressure = 987 Temp = 56 Attributes Chunked Compressed Dim_3 = 7 Storage info IEEE 32-bit float Datatype

3-5, 2009HDF/HDF-EOS Workshop XIII5 Contiguous storage layout Metadata header separate from dataset data Data stored in one contiguous block in HDF5 file Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data

3-5, 2009HDF/HDF-EOS Workshop XIII6 What is HDF5 Chunking? Data is stored in chunks of predefined size Two-dimensional instance may be referred to as data tiling HDF5 library always writes/reads the whole chunk Contiguous Chunked

3-5, 2009HDF/HDF-EOS Workshop XIII7 What is HDF5 Chunking? Dataset data is divided into equally sized blocks (chunks). Each chunk is stored separately as a contiguous block in HDF5 file. Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data ADCB header Chunk index A B CD

3-5, 2009HDF/HDF-EOS Workshop XIII8 Why HDF5 Chunking? Chunking is required for several HDF5 features Enabling compression and other filters like checksum Extendible datasets

3-5, 2009HDF/HDF-EOS Workshop XIII9 Why HDF5 Chunking? If used appropriately chunking improves partial I/O for big datasets Only two chunks are involved in I/O

3-5, 2009HDF/HDF-EOS Workshop XIII10 Creating Chunked Dataset 1.Create a dataset creation property list. 2.Set property list to use chunked storage layout. 3.Create dataset with the above property list. dcpl_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 200; H5Pset_chunk(dcpl_id, rank, ch_dims); dset_id = H5Dcreate (…, dcpl_id); H5Pclose(dcpl_id);

3-5, 2009HDF/HDF-EOS Workshop XIII11 Creating Chunked Dataset Things to remember: Chunk always has the same rank as a dataset Chunk’s dimensions do not need to be factors of dataset’s dimensions Caution: May cause more I/O than desired (see white portions of the chunks below)

Quiz time November 3-5, 2009 Why shouldn’t I make a chunk with dimension sizes equal to one? Can I change chunk size after dataset was created? 12HDF/HDF-EOS Workshop XIII

3-5, 2009HDF/HDF-EOS Workshop XIII13 Writing or Reading Chunked Dataset 1.Chunking mechanism is transparent to application. 2.Use the same set of operation as for contiguous dataset, for example, H5Dopen(…); H5Sselect_hyperslab (…); H5Dread(…); 3.Selections do not need to coincide precisely with the chunks boundaries.

3-5, 2009HDF/HDF-EOS Workshop XIII14 HDF5 Chunking and compression Chunking is required for compression and other filters HDF5 filters modify data during I/O operations Filters provided by HDF5: Checksum (H5Pset_fletcher32) Data transformation (in 1.8.*) Shuffling filter (H5Pset_shuffle) Compression (also called filters) in HDF5 Scale + offset (in 1.8.*) (H5Pset_scaleoffset) N-bit (in 1.8.*) (H5Pset_nbit) GZIP (deflate) (H5Pset_deflate) SZIP (H5Pset_szip)

3-5, 2009HDF/HDF-EOS Workshop XIII15 HDF5 Third-Party Filters Compression methods supported by HDF5 User’s community LZO lossless compression (PyTables) BZIP2 lossless compression (PyTables) BLOSC lossless compression (PyTables) LZF lossless compression H5Py

3-5, 2009HDF/HDF-EOS Workshop XIII16 Creating Compressed Dataset 1.Create a dataset creation property list 2.Set property list to use chunked storage layout 3.Set property list to use filters 4.Create dataset with the above property list crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); H5Pset_deflate(crp_id, 9); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id);

The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII17 Performance Issues or What everyone needs to know about chunking, compression and chunk cache

Accessing a row in contiguous dataset November 3-5, 2009HDF/HDF-EOS Workshop XIII18 One seek is needed to find the starting location of row of data. Data is read/written using one disk access.

Accessing a row in chunked dataset November 3-5, 2009HDF/HDF-EOS Workshop XIII19 Five seeks is needed to find each chunk. Data is read/written using five disk accesses. Chunking storage is less efficient than contiguous storage.

Quiz time November 3-5, 2009 How might I improve this situation, if it is common to access my data in this way? 20HDF/HDF-EOS Workshop XIII

Accessing data in contiguous dataset November 3-5, 2009HDF/HDF-EOS Workshop XIII21 M seeks are needed to find the starting location of the element. Data is read/written using M disk accesses. Performance may be very bad. M rows

Motivation for chunking storage November 3-5, 2009HDF/HDF-EOS Workshop XIII22 Two seeks are needed to find two chunks. Data is read/written using two disk accesses. For this pattern chunking helps with I/O performance. M rows

Quiz time November 3-5, 2009 If I know I shall always access a column at a time, what size and shape should I make my chunks? 23HDF/HDF-EOS Workshop XIII

Motivation for chunk cache November 3-5, 2009HDF/HDF-EOS Workshop XIII24 Selection shown is written by two H5Dwrite calls (one for each row). Chunks A and B are accessed twice (one time for each row). If both chunks fit into cache, only two I/O accesses needed to write the shown selections. AB H5Dwrite

Motivation for chunk cache November 3-5, 2009HDF/HDF-EOS Workshop XIII25 Question: What happens if there is a space for only one chunk at a time? AB H5Dwrite

HDF5 raw data chunk cache Improves performance whenever the same chunks are read or written multiple times. Current implementation doesn’t adjust parameters automatically (cache size, size of hash table). Chunks are indexed with a simple hash table. Hash function = (cindex mod nslots), where cindex is the linear index into a hypothetical array of chunks and nslots is the size of hash table. Only one of several chunks with the same hash value stays in cache. Nslots should be a prime number to minimize the number of hash value collisions. November 3-5, HDF/HDF-EOS Workshop XIII

HDF5 Chunk Cache APIs H5Pset_chunk_cache sets raw data chunk cache parameters for a dataset H5Pset_chunk_cache (dapl, rdcc_nslots, rdcc_nbytes, rdcc_w0); H5Pset_cache sets raw data chunk cache parameters for all datasets in a file H5Pset_cache (fapl, 0, nslots, 5*1024*1024, rdcc_w0); November 3-5, HDF/HDF-EOS Workshop XIII

Hints for Chunk Settings Chunk dimension sizes should align as closely as possible with hyperslab dimensions for read/write Chunk cache size ( rdcc_nbytes ) should be large enough to hold all the chunks in a selection If this is not possible, it may be best to disable chunk caching altogether (set rdcc_nbytes to 0) rdcc_slots should be a prime number that is at least 10 to 100 times the number of chunks that can fit into rdcc_nbytes rdcc_w0 should be set to 1 if chunks that have been fully read/written will never be read/written again November 3-5, 2009HDF/HDF-EOS Workshop XIII28

The Good and The Ugly: Reading a row November 3-5, 2009HDF/HDF-EOS Workshop XIII29 The Good: If both chunks fit into cache, 2 disks accesses are needed to read the data. The Ugly: If one chunk fits into cache, 2M disks accesses are needed to read the data (compare with M accesses for contiguous storage). AB M rows Each row is read by a separate call to H5Dread

Case study: Writing Chunked Dataset 1000x100x100 dataset 4 byte integers Random values x100x100 chunks (20 total) Chunk size: 2 MB Write the entire dataset using 1x100x100 slices Slices are written sequentially Chunk cache size 1MB (default) compared with chunk cache size is 5MB November 3-5, HDF/HDF-EOS Workshop XIII

Test Setup 20 Chunks 1000 slices Chunk size ~ 2MB Total size ~ 40MB Each plane ~ 40KB November 3-5, HDF/HDF-EOS Workshop XIII

Aside: Writing dataset with contiguous storage 1000 disk accesses to write 1000 planes Total size written 40MB November 3-5, HDF/HDF-EOS Workshop XIII

Writing chunked dataset Example: Chunk fits into cache Chunk is filled in cache and then written to disk 20 disk accesses are needed Total size written 40MB November 3-5, HDF/HDF-EOS Workshop XIII 1000 disk access for contiguous

Writing chunked dataset November 3-5, HDF/HDF-EOS Workshop XIII Example: Chunk doesn’t fit into cache For each chunk (20 total) 1.Fill chunk in memory with the first plane and write it to the file 2.Write 49 new planes to file directly End For Total disk accesses 20 x(1 + 49)= 1000 Total data written ~80MB (vs. 40MB) A November 3-5, HDF/HDF-EOS Workshop XIII B A A Chunk cache B

Writing compressed chunked dataset Example: Chunk fits into cache For each chunk (20 total) 1.Fill chunk in memory, compress it and write it to file End For Total disk accesses 20 Total data written less than 40MB November 3-5, HDF/HDF-EOS Workshop XIII A November 3-5, B A Chunk cache B Chunk in a file B A

Writing compressed chunked dataset Example: Chunk doesn’t fit into cache For each chunk (20 total) Fill chunk with the first plane, compress, write to a file For each new plane (49 planes) Read chunk back Fill chunk with the plane Compress Write chunk to a file End For Total disk accesses 20 x(1+2x49)= 1980 Total data written and read ? (see next slide) Note: HDF5 can probably detect such behavior and increase cache size 36HDF/HDF-EOS Workshop XIIINovember 3-5, 2009

Effect of Chunk Cache Size on Write Cache sizeI/O operationsTotal data written File size 1 MB (default) MB38.15 MB 5 MB MB38.15 MB No compression, chunk size is 2MB Gzip compression Cache sizeI/O operationsTotal data written File size 1 MB (default) MB ( MB read) MB 5 MB MB November 3-5, HDF/HDF-EOS Workshop XIII

Effect of Chunk Cache Size on Write With the 1 MB cache size, a chunk will not fit into the cache All writes to the dataset must be immediately written to disk With compression, the entire chunk must be read and rewritten every time a part of the chunk is written to Data must also be decompressed and recompressed each time Non sequential writes could result in a larger file Without compression, the entire chunk must be written when it is first written to the file If the selection were not contiguous on disk, it could require as much as 1 I/O disk access for each element November 3-5, HDF/HDF-EOS Workshop XIII

Effect of Chunk Cache Size on Write With the 5 MB cache size, the chunk is written only after it is full Drastically reduces the number of I/O operations Reduces the amount of data that must be written (and read) Reduces processing time, especially with the compression filter November 3-5, HDF/HDF-EOS Workshop XIII

Conclusion It is important to make sure that a chunk will fit into the raw data chunk cache If you will be writing to multiple chunks at once, you should increase the cache size even more Try to design chunk dimensions to minimize the number you will be writing to at once November 3-5, HDF/HDF-EOS Workshop XIII

Reading Chunked Dataset Read the same dataset, again by slices, but the slices cross through all the chunks 2 orientations for read plane Plane includes fastest changing dimension Plane does not include fastest changing dimension Measure total read operations, and total size read Chunk sizes of 50x100x100, and 10x100x100 1 MB cache November 3-5, HDF/HDF-EOS Workshop XIII

Chunks Read slices Vertical and horizontal Test Setup November 3-5, HDF/HDF-EOS Workshop XIII

Repeat 100 times for each plane Repeat 1000 times Read a row Seek to the beginning of the next read Total 10 5 disk accesses Aside: Reading from contiguous dataset November 3-5, HDF/HDF-EOS Workshop XIII read seek

No compression; chunk fits into cache For each plane (100 total) For each chunk (20 total) Read chunk Extract 50 rows End For Total 2000 disk accesses Chunk doesn’t fit into cache Data is read directly from the file 10 5 disk accesses Reading chunked dataset November 3-5, HDF/HDF-EOS Workshop XIII

Compression Cache size doesn’t matter in this case For each plane (100 total) For each chunk (20 total) Read chunk, uncompress Extract 50 rows End Total 2000 disk accesses Reading chunked dataset November 3-5, HDF/HDF-EOS Workshop XIII

Results Read slice includes fastest changing dimension Chunk sizeCompressionI/O operationsTotal data read 50Yes MB 10Yes MB 50No MB 10No MB November 3-5, HDF/HDF-EOS Workshop XIII

Repeat for each plane (100 total) Repeat for each column (1000 total) Repeat for each element (100 total) Read element Seek to the next one Total 10 7 disk accesses Aside: Reading from contiguous dataset November 3-5, HDF/HDF-EOS Workshop XIII seek

No compression; chunk fits into cache For each plane (100 total) For each chunk (20 total) Read chunk, uncompress Extract 50 columns End Total 2000 disk accesses Chunk doesn’t fit into cache Data is read directly from the file 10 7 disk operations Reading chunked dataset November 3-5, HDF/HDF-EOS Workshop XIII

Compression; cache size doesn’t matter For each plane (100 total) For each chunk (20 total) Read chunk, uncompress Extract 50 columns End Total 2000 disk accesses Reading chunked dataset November 3-5, HDF/HDF-EOS Workshop XIII

Results (continued) Read slice does not include fastest changing dimension Chunk sizeCompressionI/O operationsTotal data read 50Yes MB 10Yes MB 50No MB 10No MB November 3-5, HDF/HDF-EOS Workshop XIII

Effect of Cache Size on Read When compression is enabled, the library must always read entire chunk once for each call to H5Dread (unless it is in cache) When compression is disabled, the library’s behavior depends on the cache size relative to the chunk size. If the chunk fits in cache, the library reads entire chunk once for each call to H5Dread If the chunk does not fit in cache, the library reads only the data that is selected More read operations, especially if the read plane does not include the fastest changing dimension Less total data read November 3-5, HDF/HDF-EOS Workshop XIII

Conclusion On read cache size does not matter when compression is enabled. Without compression, the cache must be large enough to hold all of the chunks to get good preformance. The optimum cache size depends on the exact shape of the data, as well as the hardware, as well as access pattern. November 3-5, HDF/HDF-EOS Workshop XIII

The HDF Group Thank You! November 3-5, 2009HDF/HDF-EOS Workshop XIII53

Acknowledgements This work was supported by cooperative agreement number NNX08AO77A from the National Aeronautics and Space Administration (NASA). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author[s] and do not necessarily reflect the views of the National Aeronautics and Space Administration. November 3-5, 2009HDF/HDF-EOS Workshop XIII54

The HDF Group Questions/comments? November 3-5, 2009HDF/HDF-EOS Workshop XIII55