L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
FREERIDE-G: Framework for Developing Grid-Based Data Mining Applications L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher ICPP’06

Distributed Data-Intensive Science
Compute Cluster ? User Data Repository Cluster ICPP’06

Challenges for Application Development
Analysis of large amounts of disk resident data Incorporating parallel processing into analysis Processing needs to be independent of other elements and easy to specify Coordination of storage, network and computing resources required Transparency of data retrieval, staging and caching is desired ICPP’06

FREERIDE-G Goals Support High-End Processing
Enable efficient processing of large scale data mining computations Ease Use of Parallel Configurations Support shared and distributed memory parallelization starting from a common high-level interface Hide Details of Data Movement and Caching Data staging and caching (when feasible/appropriate) needs to be transparent to application developer ICPP’06

System architecture and overview Applications used for evaluation
Presentation Road Map Motivation and goals System architecture and overview Applications used for evaluation Experimental evaluation Related work in distributed data-intensive science Conclusions and future work ICPP’06

FREERIDE-G Architecture
User cluster Data Repository Data Processing Data Retrieval Caching Retrieval Communication Data Distribution Compute Nodes Communication Data server Data Processing Caching Retrieval Communication Data server Data Processing Caching Retrieval Data Retrieval Communication Data Distribution Compute Nodes Communication Data Processing Caching Retrieval Communication ICPP’06

Data Server Functionality
Data retrieval: data chunks read from repository disks Data distribution: each chunk assigned a processing node destination in user cluster Data communication: each chunk forwarded to destination processing node Data server runs on every on-line data repository node, automating data delivery to the end-user ICPP’06

Compute Node Functionality
Data communication: data chunks received from corresponding data node Computation: application specific processing performed on each chunk Data caching & retrieval: for multi-pass algorithms data cached locally on 1st pass and retrieved locally for sub-sequent passes Compute server runs on every processing node to receive data and process it in an application specific way ICPP’06

Processing structure of FREERIDE-G
Built on FREERIDE KEY observation: most algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Supports: Disk resident datasets Shared & Distributed Memory While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. ICPP’06

Summary of implementation issues
Managing and communicating remote data: 2-way coordination required Load distribution: if compute cluster bigger than data cluster Parallel processing on compute cluster: FREERIDE-G supports generalized reductions Caching: benefits multi-pass algorithms ICPP’06

Managing data communication:
Remote Data Issues Managing data communication: ADR library used for scheduling and performing data retrieval at repository site communication timing coordinated between source and destination Caching: local file system used for caching avoids redundant communication of data for (P-1)/P iterations ICPP’06

Parallel data processing issues
Load distribution: Needed when more compute nodes are available then data nodes Hashing on unique chunk ID Parallel processing on compute cluster: After data is distributed, local reduction performed on every node Reduction object gathered at Master node Global combination (reduction) performed on Master node ICPP’06

Application Summary Data Mining: K-Nearest Neighbor search
K-means clustering EM clustering Scientific Feature Mining: Vortex detection in the fluid flow dataset Molecular defect detection in the molecular dynamics dataset ICPP’06

Goals for Experimental Evaluation
Evaluation parallel scalability of applications developed: Numbers of data and compute nodes kept equal with variable parallel configurations Evaluating scalability of compute nodes: Number of compute nodes kept independent of number of data nodes Evaluating benefits of caching: Multi-pass algorithms evaluated ICPP’06

Evaluating Overall Scalability
Cluster of 700 MHz Pentiums Connected through Myrinet LANai 7.0 (no access to high bandwidth network) Equal number of repository and compute nodes ICPP’06

All 5 applications tested: High parallel efficiency
Overall Scalability All 5 applications tested: High parallel efficiency Good scalability with respect to: problem size processing node number ICPP’06

Evaluating Scalability of Compute Nodes
Compute cluster size is greater than data repository cluster size. Applications (single pass only): kNN search, molecular defect detection, vortex detection (next slide), Parallel configurations: Data nodes: 1 to 8 Compute nodes: 1 to 16. ICPP’06

Compute Node Scalability
Only data processing work parallelized Data retrieval and communication times not effected Speedups are sub-linear Better resource utilization leads to analysis time decrease ICPP’06

Evaluating effects of caching
Network bandwidth simulated: 500 KB/sec Caching vs. non-caching versions compared Comparing data communication times (P passes): factor of P decrease from caching Caching benefit depends on: application network bandwidth ICPP’06

Related Work Support for grid-based data mining:
Knowledge Grid toolset Grid-Miner toolkit Discovery Net layer DataMiningGrid framework No interface for easing parallelization and abstracting data movement GRIST – support for astronomy related mining on the grid Specific to the astronomical domain FREERIDE-G is built directly on top of FREERIDE. ICPP’06

FREERIDE-G supports remote data analysis from high-level interface
Conclusions FREERIDE-G supports remote data analysis from high-level interface Evaluated on variety of algorithms Demonstrated scalability in terms of: Even data-compute scale-up Compute node scale-up (only processing time) Multi-pass algorithms benefit from data caching ICPP’06

Continuing Work on FREERIDE-G
High bandwidth network evaluation Performance prediction based resource selection Resource allocation More sophisticated caching and data communication mechanisms (SRB) Data format issues: wrapper integration Higher-level front-end to further ease development of data analysis tools for the grid ICPP’06

ICPP’06

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Similar presentations

Presentation on theme: "L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Similar presentations

Presentation on theme: "L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher"— Presentation transcript:

Similar presentations

About project

Feedback