Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Slides:

Advertisements

Similar presentations

1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.

Advertisements

CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.

Chapter 2: Algorithm Discovery and Design

Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Chapter 2: Algorithm Discovery and Design

Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.

Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.

Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

Chapter 2: Algorithm Discovery and Design Invitation to Computer Science, C++ Version, Third Edition.

Invitation to Computer Science, Java Version, Second Edition.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Efficient XSLT Processing in Relational Database System Zhen Hua Liu Anguel Novoselsky Oracle Corporation VLDB 2006.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Efficiently Mining Source Code with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University.

Chapter 2: Algorithm Discovery and Design Invitation to Computer Science.

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Efficient Evaluation of XQuery over Streaming Data

Code Optimization.

Spark Presentation.

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Chapter 15 QUERY EXECUTION.

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Grid Based Data Integration with Automatic Wrapper Generation

Query Optimization.

The Ohio State University

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

New (Applications of) Compiler Techniques for Data Grids

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

LCPC02 Wei Du Renato Ferreira Gagan Agrawal

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets Swarup Kumar Sahoo Gagan Agrawal

Ohio State University Department of Computer Science and EngineeringRoadmap Motivation Introduction System Overview XQuery, Low and High Level schema and HDF5 storage Compiler Analysis and Algorithm Experiment Summary and Future Work

Ohio State University Department of Computer Science and EngineeringMotivation Emergence of grid-based data repositories –Can enable sharing of data Emergence of applications that process large datasets –Complicated by complex and specialized storage formats Need for easily portable applications –Compatibility with web/grid services

Ohio State University Department of Computer Science and Engineering Data Virtualization An abstract view of data dataset Data Service Data Virtualization By Global Grid Forum’s DAIS working group: A Data Virtualization describes an abstract view of data. A Data Service implements the mechanism to access and process data through the Data Virtualization

Ohio State University Department of Computer Science and Engineering Introduction : Automatic Data Virtualization Goal : Enable Automatic creation of efficient data services – Support a high-level or abstract view of data – Data is stored in low-level format Application development: –assume a high-level or virtual view Application Execution: –On actual low-level layout

Ohio State University Department of Computer Science and Engineering Overview of Our Automatic Data Virtualization Work Previous work on XML Based virtualization –Techniques for XQuery Compilation (Li and Agrawal, ICS 2003, DBPL 2003) –Supporting XML Based high-level abstraction on flat-file datasets (LCPC 2003, XIME-P 2004) Relational Table/SQL Based Implementation –Supporting SQL Select and Where (HPDC 2004) –Supporting SQL-3 Aggregations (LCPC 2004)

Ohio State University Department of Computer Science and Engineering XML-based Virtualization TEXT … NetCDF RDBMS HDF5 XML XQuer y ???

Ohio State University Department of Computer Science and Engineering Challenges and Contributions Challenges –Compiler generates efficient data processing code »Uses the information about the low-level layout and mapping between virtual and low-level layout –Challenge in compilation »High level to low level »to ensure high locality in processing of large datasets Contributions of this paper –An improved data- centric transformation algorithm –An implementation specific to HDF5 as the low-level format

Ohio State University Department of Computer Science and Engineering System Overview High level XML Schema Mapping Schema XQuery Source Code Compiler Generated Code Processor and Disk System Overview Low level XML Schema HDF5 Library

Ohio State University Department of Computer Science and Engineering XQuery and HDF5 High-level declarative languages ease application development –XQuery is a high-level language for processing XML datasets –Derived from database, declarative, and functional languages! HDF5: –Hierarchical Data Format –Widely used in scientific communities –A case study with a format which has optimized access libraries

Ohio State University Department of Computer Science and Engineering Use of XML Schemas High-level schema – XML is used to provide a virtual view of the dataset Low-level schema –reflects actual physical layout in HDF5 Mapping schema: –describes mapping between each element of high-level schema and low-level schema

Ohio State University Department of Computer Science and Engineering Oil Reservoir Simulation Support cost-effective Oil Production Simulations on a 3-D grid 17 variables and cell locations in 3-D grid at each time step Computation of bypassed regions –Expression to determine if a cell is bypassed for a time-step –Within a spatial region and range of time steps –Grid cells that are bypassed for every time-step in the range Oil Reservoir management

Ohio State University Department of Computer Science and Engineering High-Level Schema

Ohio State University Department of Computer Science and Engineering High-Level XQuery Code Of Oil Reservoir management unordered( for $i in ($x1 to $x2) for $j in ($y1 to $y2) for $k in ($z1 to $z2) let $p := document("OilRes.xml")/data where ($p/x=$i) and ($p/y = $j) and ($p/z = $k) and ($p/time >= $tmin) and ($p/time <= $tmax) return {$i, $j, $k} { analyze($p) } )

Ohio State University Department of Computer Science and Engineering Low-Level Schema integer 1 [1] float 1 [x]

Ohio State University Department of Computer Science and Engineering Mapping Schema //high/data/velocity //low/info/data/velocity //high/data/time //low/info/data/time //high/data/mom //low/info/data/mom [index(//low/info/data/velocity, 1)] //high/data/x //low/coord/x [index(//low/info/data/velocity, 1)]

Ohio State University Department of Computer Science and Engineering Compiler Analysis Problem with direct translation : –Each let expression involves complete scan over dataset –So final code will need several passes over the data Solution : –Apply Data Centric Transformations to read a portion HDF5 dataset only once

Ohio State University Department of Computer Science and Engineering Na ï ve Strategy DatasetOutput Requires 3 Scans

Ohio State University Department of Computer Science and Engineering Data Centric Strategy DatasetsOutput Requires just one scan

Ohio State University Department of Computer Science and Engineering Data Centric Transformation Overall Idea in Data-Centric Transformation –Iterate over each data element in actual storage –Find out iterations of the original loop in which they are accessed. –Execute computation corresponding to those iterations. Previous Work –Pingali et al.: blocking –Ferreira and Agrawal: data-parallel Java on disk-resident datasets –Li and Agrawal: XQuery, invert getData functions Our contribution: –Use Low-Level and Mapping Schema –Extend the idea when multiple datasets need to be accessed

Ohio State University Department of Computer Science and Engineering Data Centric Transformation Mapping Function T : Iteration space → High-Level data Mapping Function C : High-Level data → Low-Level data Mapping Function C · T = M : Iteration space → Low-Level data Our Goal is to compute M -1.

Ohio State University Department of Computer Science and Engineering Data Centric Transformation Choose one dataset as base dataset S 1 from n datasets to be accessed Apply M 1 -1 to compute set of iterations. The expression M i · M 1 -1 gives the portion of dataset S i that needs to be accessed along with S 1 Choice of base dataset might impact the data locality.

Ohio State University Department of Computer Science and Engineering Choice of Base Dataset Min-IO-Volume Strategy –Minimize repeated access to any dataset Min-Seek-Time Strategy –Minimize any discontinuity in access

Ohio State University Department of Computer Science and Engineering Template for Generated Code Generated_Query { Create an abstract iteration space using Source code. Allocate and initialize an array of output element corresponding to iteration space. For k = 1, …, NO_OF_CHUNKS { Read k th chunk of dataset S 1 using HDF5 functions and structural tree. Foreach of the other datasets S 2, …, S n access the required chunk of the dataset. Foreach data element in the chunks of data { compute the iteration instance. apply the reduction computation and update the output. }

Ohio State University Department of Computer Science and EngineeringExperiment 200*200*200 grid with 10 time steps (1.28 GB) 50*50*50 Storage Chunk Size

Ohio State University Department of Computer Science and EngineeringExperiment 50*50*50 grid with 200 time steps (400 MB) 25*25*25 Storage Chunk Size

Ohio State University Department of Computer Science and Engineering Key Observations Overall minimum execution time –Min-IO-Volume strategy when read chuck size matches storage chunk size Execution time –Very sensitive to Read Chunk-Size in Min-IO-Volume Strategy –Not sensitive to Read Chunk-Size in Min-Seek-Time Strategy due to buffering of Storage chunks

Ohio State University Department of Computer Science and EngineeringSummary Compiler techniques –Support High-level abstractions on complex low-level data formats –Enables use of the same source code across a variety of data formats –Perform data centric transformations automatically –Experimental result shows minor change in strategy can affect performance significantly Future Work –Cost models to guide strategy and chunk size selection –Compare performance with manual implementations –parallelizing data processing –extend applicability of the algorithm to more general class of queries