Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
The Big Picture Scientific disciplines have developed a computational branch Models without closed form solutions solved numerically This has lead to.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Advanced Spectrum Management in Multicell OFDMA Networks enabling Cognitive Radio Usage F. Bernardo, J. Pérez-Romero, O. Sallent, R. Agustí Radio Communications.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
IMAGE PROCESSING is the use of computer algorithms to perform image process on digital images   It is used for filtering the image and editing the digital.
Dense-Region Based Compact Data Cube
Three-Dimension (3D) Whole-slide Histological Image Analytics
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Applying Control Theory to Stream Processing Systems
Parallel Data Laboratory, Carnegie Mellon University
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
Database Performance Tuning and Query Optimization
CSCE 990: Advanced Distributed Systems
Sameh Shohdy, Yu Su, and Gagan Agrawal
Supporting Fault-Tolerance in Streaming Grid Applications
E. Borovikov, A. Sussman, L. Davis, University of Maryland
Chapter 15 QUERY EXECUTION.
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Applying Twister to Scientific Applications
On Spatial Joins in MapReduce
Communication and Memory Efficient Parallel Decision Tree Construction
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Chapter 11 Database Performance Tuning and Query Optimization
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
Resource Allocation for Distributed Streaming Applications
Parallel Feature Identification and Elimination from a CFD Dataset
Efficient Aggregation over Objects with Extent
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng

Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 5/27/2019

Data Grids Datasets Data-intensive applications Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Multi-dimensional datasets Dimension attributes, measure attributes Data-intensive applications Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization 5/27/2019

Motivating Applications Digitized Microscopy Image Analysis Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope … 5/27/2019

Two Challenges In view of large dataset sizes, geographic distribution of users and resources, and complex analysis, we concentrated on the two critical challenges – Low-level and specialized data formats Various query types and increasing number of clients 5/27/2019

Contributions Data virtualization system Realizing data virtualization through automatically generated data services (HPDC2004) Supporting complex data analysis processing by SQL-3 query and aggregations (LCPC2004) Replica selection module Designing new techniques toward efficient execution of data analysis queries using partial replicas. (CCGRID2005) Generalizing the functionalities of the replica selection module according to two significant extensions. (ICPP2006) Efficient execution of multiple queries Exploring the performance optimization potential of multiple queries. (this paper) 5/27/2019

Automatic Data Virtualization System (HPDC2004) SELECT < Data Elements > SELECT * FROM < Dataset Name > FROM IPARS WHERE < Expression > WHERE REL in (0,6,26,27) AND TIME>1000 AND Filer( < Data Element> ); AND TIME<1100 AND SOIL>0.7 AND SPEED(OILVX, OILVY,OILVZ)<30.0; 5/27/2019

Data Analysis in Data Virtualization System (LCPC2004) 5/27/2019

Replica Selection Module (CCGRID2005, ICPP2006) 5/27/2019

Automatic Data Virtualization System An abstract view of data dataset Data Virtualization Data Service Design a meta-data descriptor Automatic data virtualization using our meta-data descriptor 5/27/2019

Problem The requirements of efficient access and high-performance processing The challenge for various query types and increasing number of clients Harnessing an optimization technique Partial Replication 5/27/2019

Our Approach – Using Partial Replicas How to assemble queried data efficiently from replicas and the original dataset Computing goodness value Replica selection algorithm comprising a greedy strategy and one extension. The Replica Selection Module is coupled tightly with our prior work on supporting SQL Select queries on scientific datasets in a cluster environment. 5/27/2019

Partial Replicas Considered Replica information file describes the replicas created by users. Space partitioned partial replicas Contain all data attributes of a hot portion of the original dataset. Hot range Use a group of representative queries to identify the portions of the dataset to be replicated. Chunking Allow flexible chunk shapes and sizes. Affect data read cost. Dimension order Layout chunks following different dimension sequences. Affect data seek cost. 5/27/2019

Motivating Application Mouse Placenta Data Analysis Analyzing digital microscopic images and studying the phenotype change. Querying an irregular polygonal region. Using five adjacent query regions to approximate the boundary of mouse placenta Two overlapping regions are interesting due to the density of red cells. 160GB data in total 5/27/2019

Problem Characteristics of scientific datasets and applications Large size of distributed multidimensional data Large amount of I/O retrieval operation Two interested scenarios An irregular sub-region of multi-dimensional data space Multiple different exploratory queries for overlapping regions 5/27/2019

Our Approach Building on our previous work of performance optimization using partial replicas. Propose a cost model incorporating the effect of data locality Design the greedy algorithm using the cost model Implement three important sub- procedures for generating execution plans 5/27/2019

Computing Goodness Value Exploiting two sources of chunk reuses across different queries Temporal locality Spatial locality goodnessper-chunk = useful dataper-chunk / costper-chunk Cost chunk = tread*nread+tseek+tfilter*nfilter+tsplit*nsplit tsplit : number of useful tuples if one chunk exhibits locality, or 0 if one chunk does not show any locality nsplit : average comparison time for judging the query range one tuple belongs to. 5/27/2019

One Example – Using partial replicas for answering multiple queries {Q1, Q2, Q3} 4 chunks show temporal locality 2 chunks show spatial locality A coalescing and aggregating global query space 5/27/2019

Detecting interesting fragments Input Q , R, D Calculate the global query range for multiple queries Detecting interesting fragments Q : multiple queries R : partial replica set D : original dataset F : interesting fragment set F’ : output of interesting fragment set with calculated goodness values Generating Execution Plans Divide the output single list of candidate fragments into multiple ones regarding on respective queries. Generate and index memory stored replicas Avoid the buffering of duplicate data attributes Find the interesting fragment set F For the global query range Foreach Fi in F Identify whether Fi has locality Tuple(Fi) = 0 , Cost(Fi) = 0 Foreach chunk C in Fi Factor in the cost of split operation Tuple(Fi) = Tuple(Fi) + Tuple(C) Cost(Fi) = Cost(Fi) + Cost(C) Goodness(Fi) = Tuple(Fi) / Cost(Fi) Output F’ 5/27/2019

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has two AMD Opteron(tm) 2411MHz CPU, 8GB main Memory, and two 250GB SATA disks. Performance improvement using our proposed approach Scalability test when increasing the number of nodes hosting dataset; Performance test when data query sizes are varied; 5/27/2019

160GB Mouse Data 5/27/2019

5/27/2019

The scalability is affected by unbalanced filtering operation and splitting operation. Seek operation while increasing the number of nodes starts to dominate the I/O cost. 5/27/2019

Conclusions We have designed and implemented our automatic data virtualization system and our replication selection module for providing a light weight layer over large distributed scientific datasets. The complexity of manipulating scientific datasets and efficient processing is shielded from users to the underlying system. Our experimental results demonstrated the efficiency of our system, performance improvement using partial replication, and good scalability under parallel configurations. 5/27/2019