Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Similar presentations


Presentation on theme: "High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering."— Presentation transcript:

1 High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering (joint work with Liang Chen, Wei Du, Leo Glimcher, Ruoming Jin, Xiaogang Li, Swarup Sahoo, Li Weng, and Xuan Zhang) (Funded by ACI-9733520, EIA-9703088, ACI- 9982087, ACI-0130437, EIA-0203846, ACI-0234273, DoD PET program)

2 Data-Driven Applications Becoming increasingly important Can be extremely hard to develop for a grid- environment We need to focus on end-users who have used Matlab / SQL like systems for data retrieval and analysis Some issues to consider Different data layouts and formats Flexibly exploit different forms of parallelism Adapting to available resources

3 Research Projects Automatic Data Virtualization XML-based high-level abstractions and use of XQuery SQL-based front-end for the STORM system FREERIDE (Framework for Rapid Implementation of Datamining Engines) High-level specification of a parallel data mining algorithm Flexibly exploit different forms of parallelism GATES (Grid-based AdapTive Execution on Streams) OGSA based Support for processing distributed streams in a grid environment Self Adaptation to meet real-time constraints Compiler-based front-end to DataCutter Includes support for program adaptation More details through four student posters this afternoon

4 Automatic Data Virtualization Data virtualization refers to an abstract view of data for access and processing Data Services are methods that implement a virtual view of data Our focus: using compiler techniques to automatically generate data services to support data virtualization Two separate ongoing implementations Using XML Schema based high-level abstractions and XQuery (ICS 2003, LCPC 2003, DBPL 2003, prior compiler work in ICS 2002, PACT 2001) Supporting SQL front-end for data subsetting operations (jointly with Saltz, Kurc, Catalyurek, et al.)

5 TEXT …. NetCDF RMDB HDF5 XML XQuer y ??? Project Overview

6 External Schema XQuery/XPath Compiler XML Mapping Service System Architecture logical XML schemaphysical XML schema C++/C

7 SQL-Based Front-end

8 FREERIDE Overview Framework for Rapid Implementation of Datamining engines Demonstrated for a variety of standard mining algos Targets distributed memory parallelism, shared memory parallelism, and combination Can be used as basis for scalable grid-based data mining implementations Developed on top of Active Data Repository (ADR) from Saltz’s group at Maryland Publications: SDM 01,02,03,Sigmetrics 02, Ipdps 04, TKDE 04

9 Key Observation from Mining Algorithms Most popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware Parallelism of different forms and execution on disk-resident datasets While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

10 Applications of FREERIDE Apriori and FP-tree based association mining distributed memory, shared memory, combination K-means and EM Clustering distributed memory, shared memory, combination Nearest-neighbor search RainForest-based decision tree construction shared memory A new decision tree algorithms – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) distributed memory, shared memory, combination

11 Applying FREERIDE for Scientific Data Mining Joint work with Machiraju and Parthasarathy Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. A feature is a region of interest in a dataset A suite of algorithms for extracting and tracking features Vortex Extraction on FREERIDE

12 FREERIDE forms a basis for supporting high-level interfaces Data Parallel Java – lcpc 2002, IPDPS 2003 Matlab / mining operators – planned in the future

13 GATES Grid-based AdapTive Execution on Streams Targets (distributed) processing of (distributed) data streams Built on OGSA model Self adaptation to meet real- time constraint on processing

14 GATES: Motivation Many applications involve high-volume data streams Data from large scale experiments / simulations Digitized images from a movie camera Network traffic Data may arise from distributed sources Analysis / consumption of results may be distributed Many users wanting different analyses/results Insufficient compute power at one site

15 Self Adaptation in GATES Goal: Achieve the best accuracy with available resources, subject to real-time constraint GATES approach: Programmer exposes certain parameters in processing of each stage Examples include: rate of sampling, size of summary structure Programmer also specifies direction of sensitivity e.g. larger summary structure means more computation/communication Parameters adjusted at runtime Currently based upon size of buffers: signal previous stage to become faster/slower if buffer too small / too large Future possibilities: use profiling / performance models …

16 Summary Application development in a grid environment is hard Need novel runtime techniques and middleware Innovative applications of compiler technology can help Equipment needs: Need a controlled distributed environment Need high-bandwidth connectivity – need to simulate High rate of data arrival External clients with ability to receive data at high rates Scaling work to systems with Tera-bytes of storage


Download ppt "High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering."

Similar presentations


Ads by Google