Presentation is loading. Please wait.

Presentation is loading. Please wait.

DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon.

Similar presentations


Presentation on theme: "DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon."— Presentation transcript:

1 DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C

2 DM_PPT_NP_v01 SESIP_0715_JP The Technology The HDF5 hierarchical data file format and API is flexible—it supports self-describing, portable, and compact storage, as well as efficient I/O. 2 July 14, 2015 It is a well-described and well-supported format that is used in a wide variety of disciplines.

3 DM_PPT_NP_v01 SESIP_0715_JP The Problem The HDF5 API does not include mechanisms to efficiently find and access data based on data values, like one would perform a query on a relational database. 3 Members of the HDF Community have developed this capability so that their applications can quickly access targeted pieces of data— rapidly search and select interesting portions of data based on ad hoc search criteria.

4 DM_PPT_NP_v01 SESIP_0715_JP A Solution Solutions to this problem are called indexing. This is done by adding a layer between the HDF5 API and an application that builds a index on one or more parameters, saving enough information in the index to more efficiently find and retrieve specific parts of one or more datasets in an HDF5 file. 4 July 14, 2015 HDF5 File Application HDF5 API Index Query

5 DM_PPT_NP_v01 SESIP_0715_JP Implementations Implementations exist for adding indexed access to HDF5 files. A few of them are: 5 July 14, 2015 PyTables FastQuery / FastBit Alacrity HDF5 (prototype) Other experimental work in progress

6 DM_PPT_NP_v01 SESIP_0715_JP PyTables Uses the Python programming language Built on top of the HDF5 library and the NumPy package Uses Optimized Partially Sorted Index (OPSI) technology designed for fast access to very large (>100M rows) tables 6 July 14, 2015

7 DM_PPT_NP_v01 SESIP_0715_JP PyTables Example –create a table: table = h5file.create_table(group, 'readout', Particle, "Readout example”) –Query a table: condition = '(name == "Particle: 5") | (name == "Particle: 7")’ for record in table.where(condition): # do something with "record” 7 July 14, 2015

8 DM_PPT_NP_v01 SESIP_0715_JP PyTables Limitations No support for relationships between datasets Future work: No specifics; a continuing effort that welcomes additional developers, testers, and users Future maintenance and extended development proposals underway The HDF Group is very interested in taking a significant role in this work as it moves forward. 8 July 14, 2015

9 DM_PPT_NP_v01 SESIP_0715_JP Alacrity Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying Exploits the representation of floating-point values by binning on significant bits, using an inverted index to map each bin The software is a research vessel for a group at University of North Carolina 9 July 14, 2015

10 DM_PPT_NP_v01 SESIP_0715_JP FastQuery / FastBit FastQuery is an extension to HDF5 from the visualization Group at Lawrence Berkley National Laboratory (LBNL) Based on LBNL’s FastBit, an efficient searching technology that uses bitmap indexing for processing complex, multi-dimensional ad hoc queries on read-only numeric data Extends HDF5’s hyperslab selection mechanism to allow arbitrary range conditions on the data values contained in the datasets Compound queries can span multiple datasets 10 July 14, 2015

11 DM_PPT_NP_v01 SESIP_0715_JP FastQuery / FastBit Assumptions Data is: –0-3 dimensional block-structured –Limited datatypes: float, double, int32, int64, byte Two-level hierarchical organization: TimeStep, VariableName Future work: Arbitrary nesting More data schemas (unstructured, AMR, etc.) 11 July 14, 2015

12 DM_PPT_NP_v01 SESIP_0715_JP HDF5 Data Analysis Extensions The HDF Group is developing support for indexing and querying to enable application developers to create complex and high-performance queries on both metadata and data elements within an HDF5 container. These are in the form of objects and associated APIs: –Query Objects: The H5Q API is used to define a query and apply it to an HDF5 container –View Objects: The H5V API is used to generate a selection from a query –Index Objects: The H5X API is used to attach / build an index to data; it is plug-in based to leverage multiple technologies 12 July 14, 2015 Note: These extensions were developed under Intel’s subcontract with Lawrence Livermore National Security, LLC under U.S. Department of Energy contract DE-AC52-07NA27344.

13 DM_PPT_NP_v01 SESIP_0715_JP HDF5 Data Analysis Extensions Example July 14, 2015 Add index to existing dataset dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); /* Add indexing information */ H5Xcreate(dataset, H5X_PLUGIN_FASTBIT, H5P_DEFAULT); H5Dclose(dataset); Create and apply query float query_lb = 39.1f, query_ub = 42.6f; hid_t query, query1, query2; /* Create a simple query:39.1 < x */ query1 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_GREATER_THAN, H5T_NATIVE_FLOAT, &query_lb); /* Create a second simple query: x < 42.1 */ query2 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_LESS_THAN, H5T_NATIVE_FLOAT, &query_ub); /* Combine query: 39.1 < x < 42.1 */ query = H5Qcombine(query1, H5Q_COMBINE_AND, query2); /* Use query to get selection */ dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); H5Dquery(dataset, query, &dataspace); /* Read data here using dataspace */ H5Dclose(dataset); 13

14 DM_PPT_NP_v01 SESIP_0715_JP HDF5 Data Analysis Extensions Status Phase I status (2014): Prototype implementations for H5Q, H5V, H5X APIs H5X API plugins for Alacrity and FastBit technologies Incremental update of data is not supported by indexing packages Current work (started July 1): Views generated from queries to abstract selection results on multiple objects Support for indexing on chunked datasets Support for compound types Support for parallel indexing Query optimization Additional indexing plugins 14 July 14, 2015

15 DM_PPT_NP_v01 SESIP_0715_JP Summary A variety of index methods exist that can be used to speed targeted access to data in HDF5 files. Capabilities and underlying technologies differ so use the best fit for your application. Work is ongoing… let developers know of your needs and experiences! 15 July 14, 2015

16 DM_PPT_NP_v01 SESIP_0715_JP 16 References & Sources 16 PyTables http://www.pytables.org/index.html Alacrity J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel, N. Shah, S. Ethier, C.-S.Chang, J. Chen, H. Kolla, R. Ross, S. Klasky, N. Samatova, “ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying,” Transactions on Large-Scale Data- and Knowledge-Centered Systems, Vol 10 (2013). FastQuery / FastBit http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/ K. Wu, “FastBit: an efficient indexing technology for accelerating data-intensive science,” Journal of Physics: Conference Series, vol. 16, no. 1 (2005) HDF5-FastQuery: An API for Simplifying Access to Data Storage, Retrieval, Indexing and Querying. - Report Number: LBNL/PUB-958 (2006) HDF Data Analysis Extensions J. Soumagne, Q. Koziol, RFC: Data Analysis Extensions, RFC THG 2014-07-17.v4; The HDF Group (2014)

17 DM_PPT_NP_v01 SESIP_0715_JP 17

18 DM_PPT_NP_v01 SESIP_0715_JP 18 This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C


Download ppt "DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon."

Similar presentations


Ads by Google