Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,

Similar presentations


Presentation on theme: "Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,"— Presentation transcript:

1 Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A.

2 Who am I?  Min Cai: cmin@cs.okstate.educmin@cs.okstate.edu Ph.D. Student of Computer Science Department at Oklahoma State University Research Interests:  Parallel and distributed computing  Data Mining

3 Introduction  Data warehouses of increasing size  Data mining  technique for discovering interesting properties in data structural data mining  data represented as a graph  aim  substructure discovery  finding “interesting” and recurring subgraphs in a labeled graph

4 SUBDUE (1)  Discovers substructures utilizing minimum description length (MDL) principle Cook, D.J., Holder, L.B., G alal, G., Maglothin, R.: Approaches to Parallel Graph-Based Knowledge Discovery. Journal of Parallel and Distributed Computing, 61(3) (2001) 427-446  Data objects  graph vertices  Relationships  graph edges  Substructure  connected subgraph  NOTE  graph algorithms are notorious for long execution times

5 SUBDUE (2)  Algorithm  two basic steps substructure discovery  apply minimal description length (MDL) principle to find the “best” / “most important” structure in the graph  possibly stop here  this is the answer substructure replacement  replace the substructure found in the first step by a single vertex and repeat the process results  single substructure  hierarchy of substructures

6 Parallel SUBDUE  Data-parallel approach  Graph divided into subgraphs and send to separate processor  Processors find their best structure and communicate with the rest  The best overall substructure is found  Hierarchical process can be repeated

7 MPI-SUBDUE  Graph divided into subgraphs using METIS  point-to-point communications (MPI_Send and MPI_Recv) used to communicate between processors NOTE  best structure in data set “7” may be dreadful when confronted with data set “18”  Galal, G.M., Cook, D.J., Holder, L.B.: Improving Scalability in a Knowledge Discovery System by Exploiting Parallelism In the Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (1997) 171-174

8 NEW-MPI-SUBDUE  Improvements use PARMETIS to divide the initial graph use global communication (MPI_Allgatherv) use binary summation  YES, these changes do not look like much

9 NEW-MPI-SUBDUE Spawn P(0), P(1), P(2),..., P(n) Apply PARMETIS to partition G into n partitions for all P(i) where 1 ≤ i ≤ n do discover the best substructure in partition broadcast best substructure to all other processors evaluate best substructure and broadcast results parallel-binary summation of results to find the best overall partition P(0) finds the best overall structure

10 EXPERIMENTAL SETUP  Mutagenesis data from OxUni datasets collected in order to predict mutagenicity of aromatic and heteroaromatic nitro compounds  Graph 1  2844 vertices and 2883 edges  Graph 2  2896 vertices and 2934 edges  Graph 3  22268 vertices and 22823 edges  16 node cluster (32 processors)  two AMD Athlon MP 1800+ (1.6GHz) CPUs, 2 GB of DDR SDRAM, full-backplane Gigabit Ethernet switch  RedHat Linux 9.0, MPICH, Portland Group C compiler 5.0-2

11 EXPERIMENTAL RESULTS I

12 EXPERIMENTAL RESULTS II

13 EXPERIMENTAL REULTS III

14 COMMENTS  Graphs 1 and 2 that were large in 2000 are small and “ useless ” today  Graph 3 gives realistic performance picture  gains about 33%  Speedup over original SUBDUE  268 on 32 processors for Graph 3 this IS “ cheating ” as some information may be lost due to graph partitioning but …  Graph partitioning and balancing matter

15 THE END THANK YOU!


Download ppt "Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,"

Similar presentations


Ads by Google