Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Slides:



Advertisements
Similar presentations
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Advertisements

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
FLANN Fast Library for Approximate Nearest Neighbors
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.
Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Parallel Programming By J. H. Wang May 2, 2017.
Pattern Parallel Programming
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Supporting Fault-Tolerance in Streaming Grid Applications
Applying Twister to Scientific Applications
Scalable Data Mining: Algorithms, System Support, and Applications
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Data-Intensive Computing: From Clouds to GPU Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Fast and Exact K-Means Clustering
A Grid-Based Middleware for Scalable Processing of Remote Data
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature Mining Application Using a Cluster Middleware. Leonid Glimcher Xuan Zhang Gagan Agrawal

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 2 ipdps’04 Presentation Road Map Motivation. Description of middleware and functionality. Description of sequential vortex detection algorithm. Parallelization challenges and solution Experimental results. Conclusions and future work.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 3 ipdps’04 Motivation for Middleware. Problem: –Data is growing in size exponentially –Extracting knowledge out of data is increasingly difficult. Solution: –Parallelizing data mining algorithms to make extracting knowledge out of data more efficient. But developing parallel datamining applications is no routine task (tedious and time consuming).

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 4 ipdps’04 FREERIDE Framework for Rapid Implementation of Datamining Engines (created by Jin-Agrawal et al.) Distributed and shared memory parallelization functionality. Support for efficient processing of disk- resident datasets. Based on a key observation… While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 5 ipdps’04 Parallelization in FREERIDE Distributed Memory Setting: –Data is divided b/w processors –Reduction object is replicated –Each node performs local reduction on its data –Master node performs global reduction Shared Memory Setting: –Different data items are assigned to different threads –Synchronization techniques are used to avoid race conditions in accessing the reduction object. –Synchronization involves replication and/or locking.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 6 ipdps’04 Application Specific Functionality To be specified by the developer using this interface: –Subset of data to be processed –Local reductions –Global reductions –Iterator In addition application specific reduction object needs to be defined

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 7 ipdps’04 Previously on FREERIDE FREERIDE has been used for efficient shared and distributed memory parallelization of: –Decision tree construction, –Apriori and FP-tree frequent item set mining, –K-nearest neighbor classification, –K-means clustering, –EM clustering Conclusion: FREERIDE can be used to efficiently and quickly parallelize well known data mining algorithms

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 8 ipdps’04 Vortex Detection Sequential version was implemented by Machiraju et al. Classify-aggregate paradigm: –Detection, –Binary Classification, –Aggregation, –De-noising, –Ranking

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 9 ipdps’04 Mapping vortex detection to FREERIDE Detection and classification are performed as a part of local processing Aggregation is performed as a combination of local processing and node processing and global combination steps. De-noising and ranking are performed in the post processing step. DetectClassifyAggregate De- noise Rank Local processing Node processing Global Combine Post-processing

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 10 ipdps’04 Vortex detection challenges Challenge: –Classification of boundary points requires communication. –Multi-step aggregation is complex and requires special data structures for efficiency Solution: –Replication of boundary points for every chunk. –Saving “face imprints” for every incomplete core region.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 11 ipdps’04 Partitioning and boundary replication.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 12 ipdps’04 Experimental Results Experimental results for up to 8 and 16 nodes Experimental Platform: –Cluster (1-16) of 700 MHz Pentium machines –Connected through Myrinet LANai 7.0 –1 GB memory each node –Datasets ranging in size from 30 MB to 1.8 GB.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 13 ipdps’04 Experimental Results (Cont’d.) Scalability confirmed by tests (up to 16 nodes) Partitioning overhead: more chunks means less of a speedup, when compared to sequential application Parallelization overhead is high for smaller datasets, but shrinks for larger data.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 14 ipdps’04 Absolute Speedups (710 MB) If the data is partitioned into a larger number of chunks, the overhead will grow. Speedups are sub-linear, when based on the 1 chunk – 1 node configuration, but such configuration doesn’t support parallel execution.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 15 ipdps’04 Conclusions & Future Work Currently: working on parallelizing a feature mining application detecting molecular defects in crystalline grids created by physics and material science simulations. Conclusions: –FREERIDE can be used to implement a variety of data and scientific mining algorithms, creating scalable parallel implementations –Such parallelization can be performed more easily than “hand-coding” a parallel application –There’s an overhead that’s incurred with increasing granularity, but parallelization overhead is usually quite small –If the number of chunks remains constant, speedups are linear, proving communication or I/O overheads very small –Parallel applications created using FREERIDE allow working efficiently with disk-resident datasets

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 16 ipdps’04 Questions?