Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,

Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A.

Who am I?  Min Cai: cmin@cs.okstate.educmin@cs.okstate.edu Ph.D. Student of Computer Science Department at Oklahoma State University Research Interests:  Parallel and distributed computing  Data Mining

Introduction  Data warehouses of increasing size  Data mining  technique for discovering interesting properties in data structural data mining  data represented as a graph  aim  substructure discovery  finding “interesting” and recurring subgraphs in a labeled graph

SUBDUE (1)  Discovers substructures utilizing minimum description length (MDL) principle Cook, D.J., Holder, L.B., G alal, G., Maglothin, R.: Approaches to Parallel Graph-Based Knowledge Discovery. Journal of Parallel and Distributed Computing, 61(3) (2001) 427-446  Data objects  graph vertices  Relationships  graph edges  Substructure  connected subgraph  NOTE  graph algorithms are notorious for long execution times

SUBDUE (2)  Algorithm  two basic steps substructure discovery  apply minimal description length (MDL) principle to find the “best” / “most important” structure in the graph  possibly stop here  this is the answer substructure replacement  replace the substructure found in the first step by a single vertex and repeat the process results  single substructure  hierarchy of substructures

Parallel SUBDUE  Data-parallel approach  Graph divided into subgraphs and send to separate processor  Processors find their best structure and communicate with the rest  The best overall substructure is found  Hierarchical process can be repeated

MPI-SUBDUE  Graph divided into subgraphs using METIS  point-to-point communications (MPI_Send and MPI_Recv) used to communicate between processors NOTE  best structure in data set “7” may be dreadful when confronted with data set “18”  Galal, G.M., Cook, D.J., Holder, L.B.: Improving Scalability in a Knowledge Discovery System by Exploiting Parallelism In the Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (1997) 171-174

NEW-MPI-SUBDUE  Improvements use PARMETIS to divide the initial graph use global communication (MPI_Allgatherv) use binary summation  YES, these changes do not look like much

NEW-MPI-SUBDUE Spawn P(0), P(1), P(2),..., P(n) Apply PARMETIS to partition G into n partitions for all P(i) where 1 ≤ i ≤ n do discover the best substructure in partition broadcast best substructure to all other processors evaluate best substructure and broadcast results parallel-binary summation of results to find the best overall partition P(0) finds the best overall structure

EXPERIMENTAL SETUP  Mutagenesis data from OxUni datasets collected in order to predict mutagenicity of aromatic and heteroaromatic nitro compounds  Graph 1  2844 vertices and 2883 edges  Graph 2  2896 vertices and 2934 edges  Graph 3  22268 vertices and 22823 edges  16 node cluster (32 processors)  two AMD Athlon MP 1800+ (1.6GHz) CPUs, 2 GB of DDR SDRAM, full-backplane Gigabit Ethernet switch  RedHat Linux 9.0, MPICH, Portland Group C compiler 5.0-2

EXPERIMENTAL RESULTS I

EXPERIMENTAL RESULTS II

EXPERIMENTAL REULTS III

COMMENTS  Graphs 1 and 2 that were large in 2000 are small and “ useless ” today  Graph 3 gives realistic performance picture  gains about 33%  Speedup over original SUBDUE  268 on 32 processors for Graph 3 this IS “ cheating ” as some information may be lost due to graph partitioning but …  Graph partitioning and balancing matter

THE END THANK YOU!

Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,

Similar presentations

Presentation on theme: "Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,

Similar presentations

Presentation on theme: "Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater,"— Presentation transcript:

Similar presentations

About project

Feedback