HDF5 FastQuery Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices John Shalf, Wes Bethel LBNL Visualization Group Kensheng Wu, Kurt.

Slides:



Advertisements
Similar presentations
Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.
Advertisements

1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.
Recent Work in Progress
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Systems.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Presented by Russell Myers Paper by Ming-Chuan Wu and Alejandro P. Buchmann.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
B+-tree and Hashing.
NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.
Quick Review of Apr 17 material Multiple-Key Access –There are good and bad ways to run queries on multiple single keys Indices on Multiple Attributes.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
COMP 451/651 Multiple-key indexes
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Status of netCDF-3, netCDF-4, and CF Conventions Russ Rew Community Standards for Unstructured Grids Workshop, Boulder
Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory.
DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
HDF5 A new file format & software for high performance scientific data management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
The HDF Group HDF5 Datasets and I/O Dataset storage and its effect on performance May 30-31, 2012HDF5 Workshop at PSI 1.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
The netCDF-4 data model and format Russ Rew, UCAR Unidata NetCDF Workshop 25 October 2012.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
The STAR Grid Collector and TBitmapIndex John Wu Kurt Stockinger, Rene Brun, Philippe Canal – TBitmapIndex Junmin Gu, Jerome Lauret, Arthur M. Poskanzer,
 Three-Schema Architecture Three-Schema Architecture  Internal Level Internal Level  Conceptual Level Conceptual Level  External Level External Level.
Features-based Object Recognition P. Moreels, P. Perona California Institute of Technology.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
File Storage Organization The majority of space on a device is reserved for the storage of files. When files are created and modified physical blocks are.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
1 HDF5 Life cycle of data Boeing September 19, 2006.
NetCDF Data Model Issues Russ Rew, UCAR Unidata NetCDF 2010 Workshop
Sec 14.7 Bitmap Indexes Shabana Kazi. Introduction A bitmap index is a special kind of index that stores the bulk of its data as bit arrays (commonly.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
The HDF Group Introduction to netCDF-4 Elena Pourmal The HDF Group 110/17/2015.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Introduction to Active Directory
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,
File Systems May 12, 2000 Instructor: Gary Kimura.
NetCDF Data Model Details Russ Rew, UCAR Unidata NetCDF 2009 Workshop
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
File Systems and Disk Management
Moving from HDF4 to HDF5/netCDF-4
Multidimensional Access Structures
File System Implementation
Database Management Systems (CS 564)
COMP 430 Intro. to Database Systems
Hash-Based Indexes Chapter 11
Introduction to Database Systems
File Systems and Disk Management
Presentation transcript:

HDF5 FastQuery Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices John Shalf, Wes Bethel LBNL Visualization Group Kensheng Wu, Kurt Stockinger LBNL SDM Center Luke Gosink, Ken Joy UC Davis IDAV

Motivation and Problem Statement Too much data. Visualization “meat grinders” not especially responsive to needs of scientific research community. What scientific users want: –Scientific Insight –Quantitative results –Feature detection, tracking, characterization –(lots of bullets here omitted) See: s-LBNL pdf workshop-04/Final-report.pdf

Motivation and Problem Statement Too much data. Visualization “meat grinders” not especially responsive to needs of scientific research community. What scientific users want: –Scientific Insight –Quantitative results –Feature detection, tracking, characterization –(lots of bullets here omitted) See: s-LBNL pdf workshop-04/Final-report.pdf

What is FastBit? (what is it’s role in data analysis?)

Using Indexing Technology to Accelerate Data Analysis Use cases for indexed datasets –Support Compound Range Queries: eg. Get me all cells where Temperature > 300k AND Pressure is < 200 millibars –Subsetting: Only load data that corresponds to the query. Get rid of visual “clutter” Reduce load on data analysis pipeline –Quickly find and label connected regions –Do it really fast! Applications –Astrophysics: Remove clutter from messy supernova explosions –Combustion: Locate and track ignition kernels –Particle Accelerator Modeling: identify and select errant electrons –Network Security Data: Pose queries against enormous packet logs Identify candidate security events

Architecture Overview: Generic Visualization Pipeline Data Vis / Analysis Display

Architecture Overview: Query-Driven Vis. Pipeline Vis / Analysis Display Index Data Query FastBit

Query-Driven Subsetting of Combustion Data Set b) Q: temp < 3 c) Q: CH4 > 0.3 AND temp < 3 d) Q: CH4 > 0.3 AND temp < 4 a) Query: CH4 > 0.3

DEX Visualization Pipeline Data Query Visualization Toolkit (VTK) 3D visualization of a Supernova explosion

Architecture Overview: Query-Driven Analysis Pipeline Vis / Analysis Display Index Data Query FastBit HDF4 NetCDF Binary

Architecture Overview: Query-Driven Analysis Pipeline Vis / Analysis Display HDF5 Data+Index Query FastBit

How do Fast Bitmap Indices Work?

Why Bitmap Indices? Goal: efficient search of multi-dimensional read-only (append- only) data: –E.g. temp 10 7 AND density < 45.6 Commonly-used indices are designed to be updated quickly B-Trees –E.g. family of B-Trees –Sacrifice search efficiency to permit dynamic update Most multi-dimensional indices suffer curse of dimensionality R-tree, Quad-trees, KD-trees –E.g. R-tree, Quad-trees, KD-trees, … –Don’t scale to large number of dimensions ( < 10) –Are efficient only if all dimensions are queried Bitmap indices –Sacrifice update efficiency to gain more search efficiency –Are efficient for multi-dimensional queries –Query response time scales linearly in the actual number of dimensions in the query

What is a Bitmap Index? Compact: one bit per distinct value per object. Easy and fast to build: O(n) vs. O(n log n) for trees. Efficient to query: use bitwise logical operations. (0.0 < H 2 O < 0.1) AND (1000 < temp < 2000) Efficient for multidimensional queries. –No “curse of dimensionality” What about floating-point data? –Binning strategies. Data values b0b0 b1b1 b2b2 b3b3 b4b4 b5b5

Bitmap Index Encoding Equality encoding compresses very well Range encoding optimized for one-sided range queries, e.g. temp < 3 List of Attributes Equality EncodingRange Encoding

Performance

Bitmap Index Query Complexity and Space Requirements How Fast are Queries Answered? –Let N denote the number of objects and H denote the number of hits of a condition. –Using uncompressed bitmap indices, search time is O(N) –With a good compression scheme, the search time is O(H) – the theoretical optimum How Big are the Indices? –In the worst case (completely random data), the bitmap index requires about 2x in data size for one variable (typically 0.3x). –In contrast, 4x space requirement not uncommon for tree-based methods for one variable. –Curse of dimensionality: for N points in D dimensions: Bitmap index size: O(D*N) Tree-based method: O(N**D)!!!

Compressed Bitmap Index Query Performance FastBit Word-Aligned Hybrid (WAH) compression performance better than commercial systems. Different bitmap compression technologies have different performance characteristics.

Queries Using Bitmap Indices are Fast Log-log plot of query processing time for different size queries The compressed bitmap index is at least 10X faster than B-tree and 3X faster than the projection index

Size of Bitmap Index vs. Base Data (Combustion) Compressed bitmap index with 100 range-encoded bins is about same size as base data. Note: B-tree index is about 3 times the size of the base data. Building the index takes ~5 seconds for 100Megs on P4 2.4GHz workstation

Size of Bitmap Index vs. Base Data (Astrophysics) Size of compressed bitmap index is only 57% of base data. Building an index for all attributes takes ~17 seconds for 340 Megs.

Region Growing and Connected Component Labeling The result of the bitmap index query is a set of blocks. Given a set of blocks, find connected regions and label them. Region growing scales linearly with the number of cells selected.

HDF5 - FastQuery File Organization

File Organization Current –Data in HDF4, NetCDF converted to raw binary –One file per species + one file per index –ASCII file for metadata –One directory per timestep –Non-portable binary (must byte-swap data) HDF5 FastQuery –Indices + data all in same file –Machine independent binary representation –Multiple time-steps per file –Pose queries against data stored in “indexed” HDF5 file

Some Simplifying Assumptions Block structured data –0-3 Dimensional topology (arbitrary geometry) –Limited Datatypes: float, double, int32, int64, byte –Vectors and Tensors identified via metadata Two Level hierarchical organization –TimeStep –VariableName –Queries can be posed implicitly across time dimension Future –Arbitrary nesting AMR “Level” CalibrationSet –More Data Schemas Unstructured AMR NetLogs

/H5_UC Descriptor for Variable 1 Descriptor for Variable X Descriptor for Variable 0 … Time Step 1Time Step YTime Step 0 … Variable Descriptors Variable Data D0 D1 DX D0 D1 DX D0 D1 DX ……… HDF5 Data Organization to Support FastQuery HDF5 ROOT Group Symbol Key HDF5 group: Contains user retrievable information about sub-groups and datasets HDF5 dataset: Contains the actual data array of a given variable X at a time step Y X variable descriptors for X datasets TOC

File Organization Bitmap Indices D2 (Base Data) Time Step 0 D0 D1

File Organization Bitmap Indices D2 (Base Data) Name=“Pressure” Dims={64,64,64} Type=Double Name=“Pressure.idx” Dims=0.3*datasize Type=Int32 Time Step 0 D0 D1

File Organization Bitmap Indices Bins Base Data Offsets

File Organization Bitmap Indices Bins Base Data Offsets Attribute Name=“offsets” Dims=nbins-1 Type=uInt64 Attribute Name=“bins” Dims=2*nbins or nbins Type=Double (same type as data)

File Organization Base Data idx binsoffsetsbitmaps Query: Pressure <

File Organization Base Data idx binsoffsetsbitmaps Query: Pressure <

File Organization Base Data idx binsoffsetsbitmaps Query: Pressure <

File Organization Base Data idx binsoffsetsbitmaps Query: Pressure <

Final Notes Need for Higher level data organization –Demonstrated simple convention for index storage –Require higher level data organization to support more complex queries demanded by our scientific applications –Adoption of higher-level schema is a sociological problem rather than a technical problem Top Down (the Grand Unified Data Model) –DMF: Describe everything in the known universe Bottom up (community building) –Research Group: Store data fro Cactus –Scientific Community: eg. HDF-EOS, NetCDF, FITS –?

Final Notes Need for Higher level data organization –Demonstrated simple convention for index storage –Require higher level data organization to support more complex queries demanded by our scientific applications –Adoption of higher-level schema is a sociological problem rather than a technical problem Top Down (the Grand Unified Data Model) –DMF: Describe everything in the known universe Bottom up (community building) –Research Group: Store data fro Cactus –Scientific Community: eg. HDF-EOS, NetCDF, FITS –World Domination

Questions?

Performance of Event Catalog The Event Catalog uses compressed bitmap indices –The most commonly used index is B-tree –The most efficient one is often the projection index The following table reports the size and the average query processing time –1-attribute, 2-attribute, and 5-attribute refer to the number of attributes in a query Compressed bitmap indices are about half the size of B-trees, and are 10 times faster Compressed bitmap indices are larger than projection indices, but are 3 times faster 2.2 Million Events 12 common attributes B-treeProjection index Bitmap index Size (MB) Query processing (seconds) 1-attribute attribute attribute