Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.

Slides:



Advertisements
Similar presentations
More on File Management
Advertisements

Part IV: Memory Management
File Systems.
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
Chapter 12: File System Implementation
HDF4 and HDF5 Performance Preliminary Results Elena Pourmal IV HDF-EOS Workshop September
NetCDF4 Performance Benchmark. Part I Will the performance in netCDF4 comparable with that in netCDF3? Will the performance in netCDF4 comparable with.
Chapter 5 Part 2 Secondary Storage Mgt. File Mgt. in Popular OSs
File Systems (1). Readings r Silbershatz et al: 10.1,10.2,
HDF 1 HDF5 Advanced Topics Object’s Properties Storage Methods and Filters Datatypes HDF and HDF-EOS Workshop VIII October 26, 2004.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
The HDF Group April 17-19, 2012HDF/HDF-EOS Workshop XV1 Introduction to HDF5 Barbara Jones The HDF Group The 15 th HDF and HDF-EOS Workshop.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
HDF5 A new file format & software for high performance scientific data management.
Sep , 2010HDF/HDF-EOS Workshop XIV1 HDF5 Advanced Topics Neil Fortner The HDF Group The 14 th HDF and HDF-EOS Workshop September 28-30, 2010.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
February 2-3, 2006SRB Workshop, San Diego P eter Cao, NCSA Mike Wan, SDSC Sponsored by NLADR, NFS PACI Project in Support of NCSA-SDSC Collaboration Object-level.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
HDF 1 New Features in HDF Group Revisions HDF and HDF-EOS Workshop IX November 30, 2005.
April 28, 2008LCI Tutorial1 Introduction to HDF5 Tools Tutorial Part II.
CSC 322 Operating Systems Concepts Lecture - 20: by Ahmed Mumtaz Mustehsan Special Thanks To: Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
©Silberschatz, Korth and Sudarshan11.1Database System Concepts Chapter 11: Storage and File Structure File Organization Organization of Records in Files.
File Storage Organization The majority of space on a device is reserved for the storage of files. When files are created and modified physical blocks are.
1 HDF5 Life cycle of data Boeing September 19, 2006.
Chapter 13 Disk Storage, Basic File Structures, and Hashing. Copyright © 2004 Pearson Education, Inc.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA
The HDF Group November 3-5, 2009HDF/HDF-EOS Workshop XIII1 HDF5 Advanced Topics Elena Pourmal The HDF Group The 13 th HDF and HDF-EOS.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.
1/14/2005Yan Huang - CSCI5330 Database Implementation – Storage and File Structure Storage and File Structure II Some of the slides are from slides of.
September 9, 2008SPEEDUP Workshop - HDF5 Tutorial1 Introduction to HDF5 Command-line Tools.
The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems File systems.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems File systems.
March 9, th International LCI Conference - HDF5 Tutorial1 HDF5 Advanced Topics.
The HDF Group 10/17/15 1 HDF5 vs. Other Binary File Formats Introduction to the HDF5’s most powerful features ICALEPCS 2015.
April 28, 2008LCI Tutorial1 Parallel HDF5 Tutorial Tutorial Part IV.
Chapter 5 Record Storage and Primary File Organizations
FILE SYSTEM IMPLEMENTATION 1. 2 File-System Structure File structure Logical storage unit Collection of related information File system resides on secondary.
The HDF Group Introduction to HDF5 Session 7 Datatypes 1 Copyright © 2010 The HDF Group. All Rights Reserved.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
The HDF Group Introduction to HDF5 Session ? High Performance I/O 1 Copyright © 2010 The HDF Group. All Rights Reserved.
File System Implementation
HDF and HDF-EOS Workshop XII
Module 11: File Structure
Chapter 11: File System Implementation
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
FileSystems.
Chapter 11: Storage and File Structure
HDF5 Metadata and Page Buffering
File System Structure How do I organize a disk into a file system?
Database Management Systems (CS 564)
9/12/2018.
Main Memory Management
What NetCDF users should know about HDF5?
Lecture 10: Buffer Manager and File Organization
Chapter 8: Main Memory.
Chapter 11: File System Implementation
Disk Storage, Basic File Structures, and Hashing
File System B. Ramamurthy B.Ramamurthy 11/27/2018.
Presentation transcript:

Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5

Copyright © 2010 The HDF Group. All Rights Reserved2 Software stack Life cycle: What happens to data when it is transferred from application buffer to HDF5 file and from HDF5 file to application buffer? File or other “storage” Virtual file I/O Library internals Object API Application Data buffer H5Dwrite ? Unbuffered I/O Data in a file

Copyright © 2010 The HDF Group. All Rights Reserved3 Goals Understanding of what is happening to data inside the HDF5 library will help to write efficient applications Goals of this talk: Describe some basic operations and data structures, and explain how they affect performance and storage sizes Give some “recipes” for how to improve performance

Copyright © 2010 The HDF Group. All Rights Reserved4 Topics Dataset metadata and array data storage layouts Types of dataset storage layouts Factors affecting I/O performance I/O with compact datasets I/O with contiguous datasets I/O with chunked datasets Variable length data and I/O

Copyright © 2010 The HDF Group. All Rights Reserved5 HDF5 dataset metadata and array data storage layouts

Copyright © 2010 The HDF Group. All Rights Reserved6 HDF5 Dataset Metadata Dataspace: Rank, dimensions of dataset array Datatype: Information on how to interpret data Storage Properties: How array is organized on disk Attributes: User-defined metadata (optional) Data array Ordered collection of identically typed data items distinguished by their indices

Copyright © 2010 The HDF Group. All Rights Reserved7 HDF5 Dataset Dataset dataMetadata Dataspace 3 Rank Dim_2 = 5 Dim_1 = 4 Dimensions Time = 32.4 Pressure = 987 Temp = 56 Attributes Chunked Compressed Dim_3 = 7 Storage info IEEE 32-bit float Datatype

Copyright © 2010 The HDF Group. All Rights Reserved8 Metadata cache and dataset data Dataset data typically kept in application memory Dataset header in separate space – metadata cache Application memory Metadata cache File Dataset data Dataset header Dataset data Dataset header …………. Datatype Dataspace …………. Attributes …

Copyright © 2010 The HDF Group. All Rights Reserved9 Metadata and metadata cache HDF5 metadata Information about HDF5 objects used by the library Examples: object headers, B-tree nodes for group, B-Tree nodes for chunks, heaps, super-block, etc. Usually small compared to raw data sizes (KB vs. MB-GB)

Copyright © 2010 The HDF Group. All Rights Reserved10 Metadata and metadata cache Metadata cache Space allocated to handle pieces of the HDF5 metadata Allocated by the HDF5 library in application’s memory space Cache behavior affects overall performance

Copyright © 2010 The HDF Group. All Rights Reserved11 Types of data storage layouts

Copyright © 2010 The HDF Group. All Rights Reserved12 HDF5 datasets storage layouts Contiguous Chunked Compact External

Copyright © 2010 The HDF Group. All Rights Reserved13 Contiguous storage layout Metadata header separate from dataset data Data stored in one contiguous block in HDF5 file Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data

Copyright © 2010 The HDF Group. All Rights Reserved14 Chunked storage Chunking – storage layout where a dataset is partitioned in fixed-size multi-dimensional tiles or chunks Used for extendible datasets and datasets with filters applied (checksum, compression) HDF5 library treats each chunk as atomic object Greatly affects performance and file sizes

Copyright © 2010 The HDF Group. All Rights Reserved15 Chunked storage layout Dataset data divided into equal sized blocks (chunks) Each chunk stored separately as a contiguous block in HDF5 file Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data ADCB header Chunk index A B CD

Copyright © 2010 The HDF Group. All Rights Reserved16 Compact storage layout Dataset data and metadata stored together in the object header File Application memory Dataset header …………. Datatype Dataspace …………. Attributes … Metadata cacheDataset data

Copyright © 2010 The HDF Group. All Rights Reserved17 External storage layout Metadata header separate from dataset data Data stored in one or more blocks in another file(s) HDF5 File Application memory Dataset header …………. Datatype Dataspace …………. Attributes … Metadata cacheDataset data External File List File1 1 File2 2 File3 3 File4 4 header External File List

Copyright © 2010 The HDF Group. All Rights Reserved18 Factors affecting I/O performance

Copyright © 2010 The HDF Group. All Rights Reserved19 What goes on inside the library? Operations on data inside the library Copying to/from internal buffers Datatype conversion Scattering - gathering Data transformation (filters, compression) Data structures used B-trees (groups, dataset chunks) Hash tables Local and Global heaps (variable length data: link names, strings, etc.) Other concepts HDF5 metadata, metadata cache Chunking, chunk cache

Copyright © 2010 The HDF Group. All Rights Reserved20 Operations on data inside the library Copying to/from internal buffers Datatype conversion, such as float  integer Little-endian  big-endian 64-bit integer to 16-bit integer Scattering - gathering Data is scattered/gathered from/to application buffers into internal buffers for datatype conversion and partial I/O Data transformation (filters, compression) Checksum on raw data and metadata (in 1.8.0) Algebraic transform GZIP and SZIP compressions User-defined filters

Copyright © 2010 The HDF Group. All Rights Reserved21 I/O performance I/O performance depends on Storage layouts Dataset storage properties Chunking strategy Metadata cache performance Datatype conversion performance Other filters, such as compression Access patterns

Copyright © 2010 The HDF Group. All Rights Reserved22 I/O with different storage layouts

Copyright © 2010 The HDF Group. All Rights Reserved23 Writing a compact dataset Application memory Dataset header …………. Datatype Dataspace …………. Attributes … File Metadata cache Dataset data One write to store header and dataset data Dataset data

Copyright © 2010 The HDF Group. All Rights Reserved24 Writing contiguous dataset – no conversion Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data No sub-setting in memory or a file is performed

Copyright © 2010 The HDF Group. All Rights Reserved25 Writing a contiguous dataset with datatype conversion Dataset header …………. Datatype Dataspace …………. Attribute 1 Attribute 2 ………… Application memory Metadata cache File Conversion buffer 1MB Dataset data No sub-setting in memory or a file is performed

Copyright © 2010 The HDF Group. All Rights Reserved26 Partial I/O with contiguous datasets

Copyright © 2010 The HDF Group. All Rights Reserved27 Writing whole dataset – contiguous rows File Application data in memory Data is contiguous in a file One I/O operation M rows M N

Copyright © 2010 The HDF Group. All Rights Reserved28 Sub-setting of contiguous dataset Series of adjacent rows File N Application data in memory Subset – contiguous in a file One I/O operation M rows M Entire dataset – contiguous in a file

Copyright © 2010 The HDF Group. All Rights Reserved29 Sub-setting of contiguous dataset Adjacent, partial rows File N M … Application data in memory Data is scattered in a file in M contiguous blocks Several small I/O operation N elements

Copyright © 2010 The HDF Group. All Rights Reserved30 Sub-setting of contiguous dataset Extreme case: writing a column N M Application data in memory Subset data is scattered in a file in M different locations Several small I/O operation … 1 element

Copyright © 2010 The HDF Group. All Rights Reserved31 Sub-setting of contiguous dataset Data sieve buffer File N M … Application data in memory Data is scattered in a file 1 element Data in a sieve buffer (64K) in memory memcopy

Copyright © 2010 The HDF Group. All Rights Reserved32 Performance tuning for contiguous dataset Datatype conversion Avoid for better performance Use H5Pset_buffer function to customize conversion buffer size Partial I/O Write/read in big contiguous blocks Use H5Pset_sieve_buf_size to improve performance for complex subsetting

Copyright © 2010 The HDF Group. All Rights Reserved33 I/O with Chunking

Copyright © 2010 The HDF Group. All Rights Reserved34 Chunked storage layout Raw data divided into equal sized blocks (chunks) Each chunk stored separately as a contiguous block in a file Application memory Metadata cache Dataset header …………. Datatype Dataspace …………. Attributes … File Dataset data AD CB header Chunk index A B CD

Copyright © 2010 The HDF Group. All Rights Reserved35 Information about chunking HDF5 library treats each chunk as atomic object Compression and other filters are applied to each chunk Datatype conversion is performed on each chunk Chunk size greatly affects performance Chunk overhead adds to file size Chunk processing involves many steps Chunk cache Caches chunks for better performance Size of chunk cache is set for file (default size 1MB) Each chunked dataset has its own chunk cache Chunk may be too big to fit into cache Memory may grow if application keeps opening datasets

Copyright © 2010 The HDF Group. All Rights Reserved36 Chunk cache Dataset_1 header ………… Application memory Metadata cache Chunking B-tree nodes Chunk cache Default size is 1MB Dataset_N header ………… ………

Copyright © 2010 The HDF Group. All Rights Reserved37 Writing chunked dataset CB A ………….. Filters including compression are applied when chunk is evicted from cache ABC C File Chunk cacheChunked dataset Filter pipeline

Copyright © 2010 The HDF Group. All Rights Reserved38 Partial I/O with Chunking

Copyright © 2010 The HDF Group. All Rights Reserved39 Partial I/O for chunked dataset Example: write the green subset from the dataset, converting the data Dataset is stored as six chunks in the file. The subset spans four chunks, numbered 1-4 in the figure. Hence four chunks must be written to the file. But first, the four chunks must be read from the file, to preserve those parts of each chunk that are not to be overwritten

Copyright © 2010 The HDF Group. All Rights Reserved40 Partial I/O for chunked dataset For each of four chunks on writing: Read chunk from file into chunk cache, unless it’s already there Determine which part of the chunk will be replaced by the selection Move those elements from application buffer to conversion buffer Perform conversion Replace that part of the chunk in the cache with the corresponding elements from the conversion buffer Apply filters (compression) when chunk is flushed from chunk cache For each element 3 (or more) memcopy operations are performed 12 34

Copyright © 2010 The HDF Group. All Rights Reserved41 Partial I/O for chunked dataset 3 Application memory conversion buffer Application buffer Chunk Elements participating in I/O are gathered into corresponding chunk after going through conversion buffer Chunk cache 3

Copyright © 2010 The HDF Group. All Rights Reserved42 Partial I/O for chunked dataset 3 Conversion buffer Application memory Chunk cache File Chunk Apply filters and write to file Application buffer

Copyright © 2010 The HDF Group. All Rights Reserved43 Variable length data and I/O

Copyright © 2010 The HDF Group. All Rights Reserved44 Examples of variable length data String A[0] “the first string we want to write” ………………………………… A[N-1] “the N-th string we want to write” Each element is a record of variable-length A[0] (1,1,0,0,0,5,6,7,8,9) [length = 10] A[1] (0,0,110,2005) [length = 4] ……………………….. A[N] (1,2,3,4,5,6,7,8,9,10,11,12,….,M) [length = M]

Copyright © 2010 The HDF Group. All Rights Reserved45 Variable length data in HDF5 Variable length description in HDF5 application typedef struct { size_t length; void *p; }hvl_t; Base type can be any HDF5 type H5Tvlen_create(base_type) ~ 20 bytes overhead for each element Data cannot be compressed

Copyright © 2010 The HDF Group. All Rights Reserved46 Variable length data storage in HDF5 Globa l heap Actual variable length data Dataset with variable length elements Pointer into global heap File Dataset header

Copyright © 2010 The HDF Group. All Rights Reserved47 Variable length datasets and I/O When writing variable length data, elements in application buffer always go through conversion and are copied to the global heaps in a metadata cache before ending in a file Global heap Application buffer Metadata cache conversion buffer

Copyright © 2010 The HDF Group. All Rights Reserved48 There may be more than one global heap Global heap Raw data Global heap Metadata cache Application memory Raw VL data Application buffer Conversion buffer Raw VL data On a write request, VL data goes through conversion and is written to a global heap; elements of the same dataset may be written to different heaps.

Copyright © 2010 The HDF Group. All Rights Reserved49 Variable length datasets and I/O File Global heap Raw data Global heap Metadata cache Application memory Raw VL data Application buffer Conversion buffer Raw VL data

Copyright © 2010 The HDF Group. All Rights Reserved50 VL chunked dataset in a file File Dataset header Chunk B-tree Dataset chunksHeaps with VL data

Copyright © 2010 The HDF Group. All Rights Reserved51 Writing chunked VL datasets Dataset header ………… Application memory Metadata cache B-tree nodes Chunk cache ……… Conversion buffer VL data Raw dat a Global heap Chunk cache Data in application buffers File Filter pipeline hvl_t pointers 1 4

Copyright © 2010 The HDF Group. All Rights Reserved52 Hints for variable length data I/O Avoid closing/opening a file while writing VL datasets Global heap information is lost Global heaps may have unused space Avoid alternately writing different VL datasets Data from different datasets will go into to the same heap If maximum length of the record is known, consider using fixed-length records and compression

Copyright © 2010 The HDF Group. All Rights Reserved53 Questions?