Presentation on theme: "Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer."— Presentation transcript:
Technische Universität München Large-Scale Graph Mining Using Backbone Refine- ment Classes 05/2009 Andreas Maunz 1, Christoph Helma 1,2, and Stefan Kramer 3 1) FDM Universität Freiburg (D) 2) in-silico toxicology Basel (CH) 3) Technische Universität München (D)
Technische Universität München BACKBONE REFINEMENT CLASS MINING Efficient diverse substructure mining from a large class-labelled graph database
Large-Scale Graph Mining using Backbone Refinement Classes 04 BBRC Rationale Trees are most frequent substructure type; yet efficiently enumerable. However: Excessively large result sets are obtained even for high correlation and minimum frequency constraints. Typical substructure frequencies for databases of small molecules
Large-Scale Graph Mining using Backbone Refinement Classes 04 BBRC Definitions 4 GASTON (GrAph, Sequence and Tree ExtractiON) by Nijssen and Kok 1 : 1 Nijssen S. & Kok J.N.: A Quickstart in Frequent Structure Mining can make a Difference, KDD 04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA: ACM 2004: 647–652. Backbone of a tree: longest path with the lowest sequence (assuming canonical sequence ordering). Since every tree has exactly one backbone, backbones partition the partial order of trees disjointly. Backbone Refinement Class (BBRC): All tree refinements growing from a specific backbone. Pre-order (depth-first) traversal is used within each partition to refine structures.
Large-Scale Graph Mining using Backbone Refinement Classes 04 BBRC Example 5 C-C(-O-C)(=C-c:c:c) C-C(=C(-O-C)(-C))(-c:c:c) C-C(=C-O-C)(-c:c:c) Class 1 Class 2 Refinement Backbone: c:c:c-C=C-O-C Backbones in gray
Large-Scale Graph Mining using Backbone Refinement Classes 04 BBRC Properties (1) 6 BBRCs partition the search space structurally (as opposed to occurrence- based methods, such as open/closed features). Search space for two BBRCs within the same backbone. Some Properties Two types of BBRCs: I. within a backbone: not disjoint (see figure on the left) II. across backbones: disjoint A given backbone spans a maximum search tree. No node may be added without changing the backbone.
Large-Scale Graph Mining using Backbone Refinement Classes 04 BBRC Properties (2) 7 Consider the special case of a rooted perfect binary tree of height h. Backbone with branches in gray Perfect binary tree of height 3 The Number of BBRCs The number of Backbone Refinement Classes is governed by the (recursive) branches on this backbone.
Large-Scale Graph Mining using Backbone Refinement Classes 04 BBRC Properties (3) 8 The Number of BBRCs (unpublished) The number of backbone refinement classes of a branch of length l is 1 L.A. Szekely, Hua Wang, On subtrees of trees, Advances in Applied Mathematics, Volume 34, Issue 1, January 2005, Pages 138-155, The full set of subtrees containing the root has size  where q~1.50284. The set of BBRCs containing the root has size
Large-Scale Graph Mining using Backbone Refinement Classes 04 BBRC Implementation 10 Use paths as candidate backbones. Idea: Mine BBRCs and represent each BBRC by the most ( 2 -) significant member. 1 S. Morishita and J. Sese. Traversing Itemset Lattices with Statistical Metric Pruning. In Symposium on Principles of Database Systems, pages 226–236, 2000. 2 thresholds can not be used for anti-monotonic pruning, however an upper bound for 2 values of refinements of a pattern exists 1 (Statistical Metric Pruning). Dynamic Upper Bound Pruning: 2 threshold may be increased during depth-first traversal since we only search for the max. elements of classes. In case of several most significant members, use the most general one.
Large-Scale Graph Mining using Backbone Refinement Classes 0411 BBRC Experiments (1) Investigation of BBRCs regarding time efficiency, feature set sizes and expressiveness BBRC Representatives: most significant representatives of the backbone refinement classes. Class-Balanced CPDB datasets: Salmonella Mutagenicity (SM, 388 active / 810 compounds) Rat Carcinogenicity (RC, 459 active / 1145 compounds) Mouse Carcinogenicity (MoC, 428 active / 927 compounds) Multicell Call (MuC, 553 active / 1067 compounds). Significant Trees: all trees that are frequent and significant. Open Trees  : most general significant trees with the same occurrences. 1 B. Bringmann, A. Zimmermann, L. de Raedt, and S. Nijssen. Dont Be Afraid of Simpler Patterns. In Proceedings 10th PKDD, pages 55–66. Springer-Verlag, 2006.
Large-Scale Graph Mining using Backbone Refinement Classes 0413 BBRC Experiments (3) Time Efficiency No statistical pruning Static UB pruning Dynamic UB pruning SM2.632.550.44 RC21.2321.116.63 MoC3.712.982.13 MuC5.174.761.76 Minimum frequency: 6
Large-Scale Graph Mining using Backbone Refinement Classes 0414 BBRC Experiments (4) Accuracy, Sensitivity, Specificity Black: Sign. Trees Dark Gray: BBRC-R. Light Gray: Open Trees Sign. Tr. Open Tr. BBRC-R. all74.675.574.6 SMAD80.780.679.4 wt.86.884.585.4 all64.464.567.2 RCAD70.068.770.4 wt.81.880.082.2 all73.371.571.7 MoCAD75.774.476.5 wt.83.780.882.0 all71.970.270.3 MuCAD75.673.574.1 wt.83.581.384.9 Instance-based predictions all: all predictions AD: top 80% confidence predictions wt.: predictions weighted by confidence
Large-Scale Graph Mining using Backbone Refinement Classes 0415 BBRC Experiments (5) Active / Inactive compounds Activating / Deactivating features Euclidean embedding based on Co-Occurrences and Entropy  1 Hannes Schulz, Christian Kersting, Andreas Karwath, ILP, the Blind, and the Elephant: Euclidean Embedding of Co-Proven Queries (Proceedings of the 19th International Conference on Inductive Logic Programming (ILP 2009) (forthcoming)). Differently colored features nearly perfectly separated Features are well distributed with few clusters
Large-Scale Graph Mining using Backbone Refinement Classes 0416 Large-Scale Analysis (1) Large Scale Analysis NCI Yeast Anticancer Drug Screen datasets (April 2002 release) 1. AC-One (stage 0): 87,264 compounds, 12,068 active 2. AC-All (stage 0): 87,264 compounds, 5,777 active 3. AC-All (stage 1): 10,924 compounds, 5,433 active To the best knowledge of the authors, 1. and 2. are the largest labelled datasets that have been considered in correlated graph mining.
Large-Scale Graph Mining using Backbone Refinement Classes 0417 Large-Scale Analysis (2) BBRC descriptors are more probable in lighter regions. AC-One (stage 0): 87,264 comp: Min. Freq.CoverageTime eff. 100 (~0.12 %)47.136m40s Similar results were obtained for the other datasets*. * The effects of not using aromatic perception, i.e. no special node and edge labels for aromatic bindings, were much greater. The number of descriptors per compound in this setting was > 80 for both thresholds. Effects of Minimum Frequency on Dataset Coverage 200 (~0.23%)44.719m40s
Large-Scale Graph Mining using Backbone Refinement Classes 04 Large-Scale Analysis (3) Feature Count for Balanced datasets (downsampling) 1 M. Al Hasan et.al. Origami: Mining Representative Orthogonal Graph Patterns. ICDM 2007. Seventh IEEE International Conference on Data Mining, pages 153–162, Oct. 2007. Max. Trees: the positive border as implied by minimum frequency and significance constraints . Open TreesMemory alloc. error216,206 AC-one (stage 0) 23,400 comp. AC-all (stage 1) 10,548 comp. Sign. Trees1,190,763291,729 Max. Trees  556,673148,562 BBRC Repr.31,45014,381
Large-Scale Graph Mining using Backbone Refinement Classes 0419 Large-Scale Analysis (4) Time Efficiency Time efficiency (Mining) AC-one (st. 0), 23.4004m52s AC-all (st. 1), 105481m13s Open Trees: prediction times of >60s impractical RAM demand. AC-one (st. 0)11.1s AC-all (st. 1)4.7s Time efficiency (Prediction) all: all predictions AD: top 80% confidence predictions wt.: predictions weighted by confidence Accuracy Open Trees: mining times of ~12h
Large-Scale Graph Mining using Backbone Refinement Classes 04 Structurally heterogeneous descriptors, compression by structural invariant (backbone constraint) Backbone Refinement Class Representatives Summary (1) Good dataset coverage, robust against increasing minimum frequencies Applicable to large-scale graph databases through a novel statistical pruning technique
Large-Scale Graph Mining using Backbone Refinement Classes 04 Compression of 90% compared to all trees and 31% compared to open trees Backbone Refinement Class Representatives Summary (2) Time efficiency improved by 85% and 83% versus no statistical pruning and static upper bound pruning, respectively. Discriminative potential similar to complete set of trees, but significantly better than open trees.
Large-Scale Graph Mining using Backbone Refinement Classes 04 Acknowledgements The authors would like to thank Björn Bringmann for providing a binary and friendly cooperation in dataset testing, and Ulrich Rückert for providing datasets. The research was (partially) supported by the EU seventh framework programme under contract no Health-F5-2008-200787 (OpenTox). http://www.opentox.org C++ implementation: http://www.maunz.de/libfminer-doc