System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation: Data Mining Problem Datasets available from mining are often large Our understanding of what algorithms and parameters will give desired insights is limited Time required for implementing different algorithms and running them with different parameters on large datasets slows down the data mining process

Project Overview FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system Already demonstrated for a variety of standard mining algorithms Working for end applications like feature analysis and mining of simulation data currently

FREERIDE offers:  The ability to rapidly prototype a high- performance mining implementation  Distributed memory parallelization  Shared memory parallelization  Ability to process large and disk-resident datasets  Only modest modifications to a sequential implementation for the above three

Key Observation from Mining Algorithms Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

Performance of Shared Memory Parallelization K-means clustering

Performance on Cluster of SMPs Apriori Association Mining Experiments on a cluster of SUN SMPs – purchased under a NSFRI to University of Delaware in 1997

SPIES On (a) FREERIDE Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume Does not require sorting of data, or partitioning and writing-back of records

Results from EM Clustering Algorithm EM is a popular data mining algorithm Can we parallelize it using the same support that worked for other clustering algo (k-means) and algo for other mining tasks Research supported by an NSF REU supplement (Leo Glimcher – computer science senior).

Broader Research Agenda

Applying FREERIDE for Scientific Data Mining Joint work with Machiraju and Parthasarathy Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. A feature is a region of interest in a dataset A suite of algorithms for extracting and tracking features

Future Work Support for multiple and/or heterogonous clusters Combine with DataCutter’s filter-streaming programming model Still allow high-level interface for mining algorithms Use FREERIDE support for reductions on SMPs More scalable shared memory parallelization Impact of modern memory hierarchies

Publications and Funding Refereed publications Ruoming Jin and Gagan Agrawal, ``A Middleware for Scalable Data Mining’’, proceedings of the first SIAM Conference on Data Mining, April 2001 Ruoming Jin and Gagan Agrawal, ``Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance’’, proceedings of the second SIAM Conference on Data Mining, April 2002 Ruoming Jin and Gagan Agrawal, ``Performance Prediction of Random Write Reductions’’, proceedings of the ACM SIGMETRICS conference, June 2002 Ruoming Jin and Gagan Agrawal, ``Memory and Communication Efficient Parallel Decision Tree Construction’’, proceedings of the third SIAM Conference on Data Mining, May 2003 Funding Research funding provided by NSF through CAREER award ACI-9733520, ACI- 9982087, ACI-0130437, ACI-0203846, and ACI-0234273 Equipment provided under NSF grants EIA-9703088 (RI grant at Delaware) and EIA- 9986052 (Instrumentation grant at OSU) Participation of Leo Glimcher facilitated by an NSF REU supplement

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

Similar presentations

Presentation on theme: "System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

Similar presentations

Presentation on theme: "System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio."— Presentation transcript:

Similar presentations

About project

Feedback