Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Similar presentations


Presentation on theme: "Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature."— Presentation transcript:

1 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature Mining Application Using a Cluster Middleware. Leonid Glimcher Xuan Zhang Gagan Agrawal

2 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 2 ipdps’04 Presentation Road Map Motivation. Description of middleware and functionality. Description of sequential vortex detection algorithm. Parallelization challenges and solution Experimental results. Conclusions and future work.

3 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 3 ipdps’04 Motivation for Middleware. Problem: –Data is growing in size exponentially –Extracting knowledge out of data is increasingly difficult. Solution: –Parallelizing data mining algorithms to make extracting knowledge out of data more efficient. But developing parallel datamining applications is no routine task (tedious and time consuming).

4 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 4 ipdps’04 FREERIDE Framework for Rapid Implementation of Datamining Engines (created by Jin-Agrawal et al.) Distributed and shared memory parallelization functionality. Support for efficient processing of disk- resident datasets. Based on a key observation… While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

5 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 5 ipdps’04 Parallelization in FREERIDE Distributed Memory Setting: –Data is divided b/w processors –Reduction object is replicated –Each node performs local reduction on its data –Master node performs global reduction Shared Memory Setting: –Different data items are assigned to different threads –Synchronization techniques are used to avoid race conditions in accessing the reduction object. –Synchronization involves replication and/or locking.

6 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 6 ipdps’04 Application Specific Functionality To be specified by the developer using this interface: –Subset of data to be processed –Local reductions –Global reductions –Iterator In addition application specific reduction object needs to be defined

7 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 7 ipdps’04 Previously on FREERIDE FREERIDE has been used for efficient shared and distributed memory parallelization of: –Decision tree construction, –Apriori and FP-tree frequent item set mining, –K-nearest neighbor classification, –K-means clustering, –EM clustering Conclusion: FREERIDE can be used to efficiently and quickly parallelize well known data mining algorithms

8 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 8 ipdps’04 Vortex Detection Sequential version was implemented by Machiraju et al. Classify-aggregate paradigm: –Detection, –Binary Classification, –Aggregation, –De-noising, –Ranking

9 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 9 ipdps’04 Mapping vortex detection to FREERIDE Detection and classification are performed as a part of local processing Aggregation is performed as a combination of local processing and node processing and global combination steps. De-noising and ranking are performed in the post processing step. DetectClassifyAggregate De- noise Rank Local processing Node processing Global Combine Post-processing

10 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 10 ipdps’04 Vortex detection challenges Challenge: –Classification of boundary points requires communication. –Multi-step aggregation is complex and requires special data structures for efficiency Solution: –Replication of boundary points for every chunk. –Saving “face imprints” for every incomplete core region.

11 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 11 ipdps’04 Partitioning and boundary replication.

12 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 12 ipdps’04 Experimental Results Experimental results for up to 8 and 16 nodes Experimental Platform: –Cluster (1-16) of 700 MHz Pentium machines –Connected through Myrinet LANai 7.0 –1 GB memory each node –Datasets ranging in size from 30 MB to 1.8 GB.

13 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 13 ipdps’04 Experimental Results (Cont’d.) Scalability confirmed by tests (up to 16 nodes) Partitioning overhead: more chunks means less of a speedup, when compared to sequential application Parallelization overhead is high for smaller datasets, but shrinks for larger data.

14 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 14 ipdps’04 Absolute Speedups (710 MB) If the data is partitioned into a larger number of chunks, the overhead will grow. Speedups are sub-linear, when based on the 1 chunk – 1 node configuration, but such configuration doesn’t support parallel execution.

15 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 15 ipdps’04 Conclusions & Future Work Currently: working on parallelizing a feature mining application detecting molecular defects in crystalline grids created by physics and material science simulations. Conclusions: –FREERIDE can be used to implement a variety of data and scientific mining algorithms, creating scalable parallel implementations –Such parallelization can be performed more easily than “hand-coding” a parallel application –There’s an overhead that’s incurred with increasing granularity, but parallelization overhead is usually quite small –If the number of chunks remains constant, speedups are linear, proving communication or I/O overheads very small –Parallel applications created using FREERIDE allow working efficiently with disk-resident datasets

16 Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 16 ipdps’04 Questions?


Download ppt "Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature."

Similar presentations


Ads by Google