Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Automatic and Efficient Data Virtualization System on Scientific Datasets
Li Weng

Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 5/27/2019

Data Grids Datasets Data-intensive applications Large volume
Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Multi-dimensional datasets Dimension attributes, measure attributes Data-intensive applications Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization 5/27/2019

Motivating Applications
Digitized Microscopy Image Analysis Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope … 5/27/2019

Two Challenges In view of large dataset sizes, geographic distribution of users and resources, and complex analysis, we concentrated on the two critical challenges – Low-level and specialized data formats Various query types and increasing number of clients 5/27/2019

Contributions Data virtualization system
Realizing data virtualization through automatically generated data services (HPDC2004) Supporting complex data analysis processing by SQL-3 query and aggregations (LCPC2004) Replica selection module Designing new techniques toward efficient execution of data analysis queries using partial replicas. (CCGRID2005) Generalizing the functionalities of the replica selection module according to two significant extensions. (ICPP2006) Efficient execution of multiple queries Exploring the performance optimization potential of multiple queries (this paper) 5/27/2019

Automatic Data Virtualization System (HPDC2004)
SELECT < Data Elements > SELECT * FROM < Dataset Name > FROM IPARS WHERE < Expression > WHERE REL in (0,6,26,27) AND TIME>1000 AND Filer( < Data Element> ); AND TIME<1100 AND SOIL>0.7 AND SPEED(OILVX, OILVY,OILVZ)<30.0; 5/27/2019

Data Analysis in Data Virtualization System (LCPC2004)
5/27/2019

Replica Selection Module (CCGRID2005, ICPP2006)
5/27/2019

Automatic Data Virtualization System
An abstract view of data dataset Data Virtualization Data Service Design a meta-data descriptor Automatic data virtualization using our meta-data descriptor 5/27/2019

Problem The requirements of efficient access and high-performance processing The challenge for various query types and increasing number of clients Harnessing an optimization technique Partial Replication 5/27/2019

Our Approach – Using Partial Replicas
How to assemble queried data efficiently from replicas and the original dataset Computing goodness value Replica selection algorithm comprising a greedy strategy and one extension. The Replica Selection Module is coupled tightly with our prior work on supporting SQL Select queries on scientific datasets in a cluster environment. 5/27/2019

Partial Replicas Considered
Replica information file describes the replicas created by users. Space partitioned partial replicas Contain all data attributes of a hot portion of the original dataset. Hot range Use a group of representative queries to identify the portions of the dataset to be replicated. Chunking Allow flexible chunk shapes and sizes. Affect data read cost. Dimension order Layout chunks following different dimension sequences. Affect data seek cost. 5/27/2019

Motivating Application
Mouse Placenta Data Analysis Analyzing digital microscopic images and studying the phenotype change. Querying an irregular polygonal region. Using five adjacent query regions to approximate the boundary of mouse placenta Two overlapping regions are interesting due to the density of red cells. 160GB data in total 5/27/2019

Problem Characteristics of scientific datasets and applications
Large size of distributed multidimensional data Large amount of I/O retrieval operation Two interested scenarios An irregular sub-region of multi-dimensional data space Multiple different exploratory queries for overlapping regions 5/27/2019

Our Approach Building on our previous work of performance optimization using partial replicas. Propose a cost model incorporating the effect of data locality Design the greedy algorithm using the cost model Implement three important sub- procedures for generating execution plans 5/27/2019

Computing Goodness Value
Exploiting two sources of chunk reuses across different queries Temporal locality Spatial locality goodnessper-chunk = useful dataper-chunk / costper-chunk Cost chunk = tread*nread+tseek+tfilter*nfilter+tsplit*nsplit tsplit : number of useful tuples if one chunk exhibits locality, or 0 if one chunk does not show any locality nsplit : average comparison time for judging the query range one tuple belongs to. 5/27/2019

One Example – Using partial replicas for answering multiple queries
{Q1, Q2, Q3} 4 chunks show temporal locality 2 chunks show spatial locality A coalescing and aggregating global query space 5/27/2019

Detecting interesting fragments
Input Q , R, D Calculate the global query range for multiple queries Detecting interesting fragments Q : multiple queries R : partial replica set D : original dataset F : interesting fragment set F’ : output of interesting fragment set with calculated goodness values Generating Execution Plans Divide the output single list of candidate fragments into multiple ones regarding on respective queries. Generate and index memory stored replicas Avoid the buffering of duplicate data attributes Find the interesting fragment set F For the global query range Foreach Fi in F Identify whether Fi has locality Tuple(Fi) = 0 , Cost(Fi) = 0 Foreach chunk C in Fi Factor in the cost of split operation Tuple(Fi) = Tuple(Fi) + Tuple(C) Cost(Fi) = Cost(Fi) + Cost(C) Goodness(Fi) = Tuple(Fi) / Cost(Fi) Output F’ 5/27/2019

Experimental Setup & Design
A Linux cluster connected via a Switched Fast Ethernet. Each node has two AMD Opteron(tm) 2411MHz CPU, 8GB main Memory, and two 250GB SATA disks. Performance improvement using our proposed approach Scalability test when increasing the number of nodes hosting dataset; Performance test when data query sizes are varied; 5/27/2019

160GB Mouse Data 5/27/2019

5/27/2019

The scalability is affected by unbalanced filtering operation and splitting operation.
Seek operation while increasing the number of nodes starts to dominate the I/O cost. 5/27/2019

Conclusions We have designed and implemented our automatic data virtualization system and our replication selection module for providing a light weight layer over large distributed scientific datasets. The complexity of manipulating scientific datasets and efficient processing is shielded from users to the underlying system. Our experimental results demonstrated the efficiency of our system, performance improvement using partial replication, and good scalability under parallel configurations. 5/27/2019

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Similar presentations

Presentation on theme: "Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Similar presentations

Presentation on theme: "Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng."— Presentation transcript:

Similar presentations

About project

Feedback