SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su, Gagan Agrawal, Jon Woodring†

Slides:

Advertisements

Similar presentations

Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.

Advertisements

Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.

Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.

Final Project of Information Retrieval and Extraction by d 吳蕙如.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.

Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.

NetCDF An Effective Way to Store and Retrieve Scientific Datasets Jianwei Li 02/11/2002.

1 Virtual Memory vs. Physical Memory So far, all of a job’s virtual address space must be in physical memory However, many parts of programs are never.

Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

ASCR Scientific Data Management Analysis & Visualization PI Meeting Exploration of Exascale In Situ Visualization and Analysis Approaches LANL: James Ahrens,

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

HDF5 A new file format & software for high performance scientific data management.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,

Sujayyendhiren RS, Kaiqi Xiong and Minseok Kwon Rochester Institute of Technology Motivation Experimental Setup in ProtoGENI Conclusions and Future Work.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.

1 Address Translation Memory Allocation –Linked lists –Bit maps Options for managing memory –Base and Bound –Segmentation –Paging Paged page tables Inverted.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.

Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

May 30-31, 2012 HDF5 Workshop at PSI May Shared Object Headers Dana Robinson The HDF Group Efficient Use of HDF5 With High Data Rate X-Ray Detectors.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

The HDF Group HDF5 Chunking and Compression Performance tuning 10/17/15 1 ICALEPCS 2015.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 2.

Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,

Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu and Gagan Agrawal Enabling.

CS4432: Database Systems II

Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

Getting the Most out of Scientific Computing Resources

Getting the Most out of Scientific Computing Resources

COMP 430 Intro. to Database Systems

CSCE 990: Advanced Distributed Systems

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Roee Ebenstein & Gagan Agrawal ebenstei /

Hash-Based Improvements to A-Priori

Chapter 2: The Linux System Part 2

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Storage Structure and Efficient File Access

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

New (Applications of) Compiler Techniques for Data Grids

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Presentation transcript:

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring† *The Ohio State University †Los Alamos National Laboratory

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Subsetting Large-Scale Data  Data subsetting is needed for efficient scientific large-scale visualization and analysis  Post-processing is still needed for some visualization and analysis scenarios (global time analysis, exploratory visualization, etc.)  Slow I/O and network bandwidth  Memory footprint decreasing per core at exascale  New data generated during analysis process increases the memory footprint

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Data Subsetting in ParaView  Load an entire data set (or spatial subset)  Apply a series filters and/or the data selection interface  Multiple filters keep multiple data copies in memory  New grids may be created increasing memory and time  Rectilinear Grid readers: Slow (can be fast for spatial)  Extract Subset/Slice/Clip Filter: Fast  Threshold Filter: Very Slow (new unstructured grid)  Unstructured Grid readers: Slow  Extract Subset/Slice/Clip Filter: Medium  Threshold Filter: Slow (new unstructured grid)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA A Faster Solution  Subset at the I/O level  Reduced I/O times and memory footprint  User specifies the subset in one query for both space and value ranges  SQL queries in ParaView reader modules  Standard: A flexible language for specifying subsets  Efficiency: Smaller data load from disk to memory  Functionality: One query is equal to multiple filters

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Broader Context  Sl ide 5  Automatic Data Virtualization Research at Ohio State  Virtual Relational/XML view on low-level scientific data  Support for flat-files and HDF5 in the past  A light-weight database management solution No loading data in a specific database

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

ParaView Reader and SQL Queries user input query parse query retrieve data the rest of the VTK pipeline

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Retrieval/Indexing Functionality  Spatial queries are relatively easy  NetCDF and HDF5 support spatial queries in their API  Unstructured data is harder (but feasible)  Value queries harder: Bitmap Indexing!  One truth bit for each data element and value range (1 if datum is in value range, 0 if not) N x B bit matrix Value bin cardinality B (range quantization) is a tradeoff for indexing time and space – and there are bitmap compression techniques (WAH, multi-hashing, etc.)  Fast bitwise operations on bitmap index determine the data point selection sets from queries

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Experimental Test Setup  POP (Parallel Ocean Program) NetCDF data files  3600x2400x42 structured grid  1.4 GB per variable – 4 variables (5.6 GB)  SQL + NetCDF API + Bitmap indexing vs. reader modules + multiple VTK filters (single threaded)  Type 1: Spatial queries (skipped – it’s as fast as the NetCDF API can service a spatial query)  Type 2: Value queries (100 random queries)  Type 3: Space + Value queries (100 random queries)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Experiment: Value Only Queries  Spatial query ignored in this test  Two-level bitmap indexing  One-level bitmap indexing  Read whole data + multiple VTK threshold filters  2-level indexing  Coarse grain index for values for a first pass (fewer bins), followed by a finer grain indexing per bin in a second pass  Throws out a large number of candidates on first pass – more efficient than 1-level indexing in many cases

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA m1 – 2-level indexing and read only query data m2 – 1-level indexing and read only query data m3 – read all data and use multiple thresholds query data size compared to whole data size grid creation

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Memory Usage on Value Queries Memory occupied by reading the entire data set

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Experiment: Space + Value Queries  Subsetting on IDs (space) combined with values  2-level indexing + NetCDF API  1-level indexing + NetCDF API  VTK NetCDF reader + subset filter + threshold filter Reader is smart that it only reads requested spatial subsets  Dominant factor for indexing is still in values  SELECT temp FROM DATASET WHERE t_lat=0.5 AND t_lon = 0.5 AND temp<100;  SELECT temp FROM DATASET WHERE temp=0 AND t_lat>0;

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA % of values vs. % of IDs (space) in query result

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Remove Unstructured Grid Generation  The first experiment showed that ParaView/VTK spends a lot of time generating new grids  This could be said for many filters – VTK tends to generate new grids too often, wasting time and space  New filtering idea (seems obvious but it was “Aha!”)  Instead of generating new grids after filtering, mark “unqualified” data values as NaN  Data stay on original grid  No extra grid generation (saves time and space)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Indexing vs. new Threshold filter (set unqualified values to NaN)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Conclusion  Use SQL queries to support flexible data subsetting  Translate query into operations  Use NetCDF/HDF5 API to support spatial subset  Use bitmap indexing to support value subsets  Reduced memory and time for query  Multi-level indexing can drastically improve time  Skipping extra grid generation steps can improve time and memory usage as well

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA  Sl ide 18

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA Bitmap Compression (Bitmaps can be large)  WAH compression  Compress the bit vectors based on continuous 0s or 1s  Can’t subset the by the IDs (spatial dimensions) before the value indexing operation (assuming we don’t add x, y, z to the bitmap index)  Multi-Hash compression  Use multiple hash functions to set 1s for hash(id, value) for each 1 in a bit vector  Supports subsetting over both IDs (space) and values  Hash clashes for (id, value) to same array position (false positives)