L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Slides:



Advertisements
Similar presentations
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
1 Supporting Dynamic Migration in Tightly Coupled Grid Applications Liang Chen Qian Zhu Gagan Agrawal Computer Science & Engineering The Ohio State University.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Self Healing and Dynamic Construction Framework:
Spark Presentation.
File System Implementation
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Linchuan Chen, Peng Jiang and Gagan Agrawal
Communication and Memory Efficient Parallel Decision Tree Construction
SDM workshop Strawman report History and Progress and Goal.
Data-Intensive Computing: From Clouds to GPU Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Resource Allocation in a Middleware for Streaming Data
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
A Grid-Based Middleware for Scalable Processing of Remote Data
Database System Architectures
MapReduce: Simplified Data Processing on Large Clusters
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
Map Reduce, Types, Formats and Features
Presentation transcript:

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher FREERIDE-G: Framework for Developing Grid-Based Data Mining Applications L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher {glimcher@cse.ohio-state.edu} ICPP’06

Distributed Data-Intensive Science Compute Cluster ? User Data Repository Cluster ICPP’06

Challenges for Application Development Analysis of large amounts of disk resident data Incorporating parallel processing into analysis Processing needs to be independent of other elements and easy to specify Coordination of storage, network and computing resources required Transparency of data retrieval, staging and caching is desired ICPP’06

FREERIDE-G Goals Support High-End Processing Enable efficient processing of large scale data mining computations Ease Use of Parallel Configurations Support shared and distributed memory parallelization starting from a common high-level interface Hide Details of Data Movement and Caching Data staging and caching (when feasible/appropriate) needs to be transparent to application developer ICPP’06

System architecture and overview Applications used for evaluation Presentation Road Map Motivation and goals System architecture and overview Applications used for evaluation Experimental evaluation Related work in distributed data-intensive science Conclusions and future work ICPP’06

FREERIDE-G Architecture User cluster Data Repository Data Processing Data Retrieval Caching Retrieval Communication Data Distribution Compute Nodes Communication Data server Data Processing Caching Retrieval Communication Data server Data Processing Caching Retrieval Data Retrieval Communication Data Distribution Compute Nodes Communication Data Processing Caching Retrieval Communication ICPP’06

Data Server Functionality Data retrieval: data chunks read from repository disks Data distribution: each chunk assigned a processing node destination in user cluster Data communication: each chunk forwarded to destination processing node Data server runs on every on-line data repository node, automating data delivery to the end-user ICPP’06

Compute Node Functionality Data communication: data chunks received from corresponding data node Computation: application specific processing performed on each chunk Data caching & retrieval: for multi-pass algorithms data cached locally on 1st pass and retrieved locally for sub-sequent passes Compute server runs on every processing node to receive data and process it in an application specific way ICPP’06

Processing structure of FREERIDE-G Built on FREERIDE KEY observation: most algorithms follow canonical loop Middleware API: Subset of data to be processed Reduction object Local and global reduction operations Iterator Supports: Disk resident datasets Shared & Distributed Memory While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. ICPP’06

Summary of implementation issues Managing and communicating remote data: 2-way coordination required Load distribution: if compute cluster bigger than data cluster Parallel processing on compute cluster: FREERIDE-G supports generalized reductions Caching: benefits multi-pass algorithms ICPP’06

Managing data communication: Remote Data Issues Managing data communication: ADR library used for scheduling and performing data retrieval at repository site communication timing coordinated between source and destination Caching: local file system used for caching avoids redundant communication of data for (P-1)/P iterations ICPP’06

Parallel data processing issues Load distribution: Needed when more compute nodes are available then data nodes Hashing on unique chunk ID Parallel processing on compute cluster: After data is distributed, local reduction performed on every node Reduction object gathered at Master node Global combination (reduction) performed on Master node ICPP’06

Application Summary Data Mining: K-Nearest Neighbor search K-means clustering EM clustering Scientific Feature Mining: Vortex detection in the fluid flow dataset Molecular defect detection in the molecular dynamics dataset ICPP’06

Goals for Experimental Evaluation Evaluation parallel scalability of applications developed: Numbers of data and compute nodes kept equal with variable parallel configurations Evaluating scalability of compute nodes: Number of compute nodes kept independent of number of data nodes Evaluating benefits of caching: Multi-pass algorithms evaluated ICPP’06

Evaluating Overall Scalability Cluster of 700 MHz Pentiums Connected through Myrinet LANai 7.0 (no access to high bandwidth network) Equal number of repository and compute nodes ICPP’06

All 5 applications tested: High parallel efficiency Overall Scalability All 5 applications tested: High parallel efficiency Good scalability with respect to: problem size processing node number ICPP’06

Evaluating Scalability of Compute Nodes Compute cluster size is greater than data repository cluster size. Applications (single pass only): kNN search, molecular defect detection, vortex detection (next slide), Parallel configurations: Data nodes: 1 to 8 Compute nodes: 1 to 16. ICPP’06

Compute Node Scalability Only data processing work parallelized Data retrieval and communication times not effected Speedups are sub-linear Better resource utilization leads to analysis time decrease ICPP’06

Evaluating effects of caching Network bandwidth simulated: 500 KB/sec Caching vs. non-caching versions compared Comparing data communication times (P passes): factor of P decrease from caching Caching benefit depends on: application network bandwidth ICPP’06

Related Work Support for grid-based data mining: Knowledge Grid toolset Grid-Miner toolkit Discovery Net layer DataMiningGrid framework No interface for easing parallelization and abstracting data movement GRIST – support for astronomy related mining on the grid Specific to the astronomical domain FREERIDE-G is built directly on top of FREERIDE. ICPP’06

FREERIDE-G supports remote data analysis from high-level interface Conclusions FREERIDE-G supports remote data analysis from high-level interface Evaluated on variety of algorithms Demonstrated scalability in terms of: Even data-compute scale-up Compute node scale-up (only processing time) Multi-pass algorithms benefit from data caching ICPP’06

Continuing Work on FREERIDE-G High bandwidth network evaluation Performance prediction based resource selection Resource allocation More sophisticated caching and data communication mechanisms (SRB) Data format issues: wrapper integration Higher-level front-end to further ease development of data analysis tools for the grid ICPP’06

ICPP’06