Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department.

Similar presentations


Presentation on theme: "Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department."— Presentation transcript:

1 Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved.

2 January 16, 2016 Mining and Searching Graphs in Graph Databases 2

3 Graph Mining II Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Summary

4 Substructure-Based Graph Classification  Data: graph data with labels, e.g., chemical compounds, software behavior graphs, social networks  Basic idea  Extract graph substructures  Represent a graph with a feature vector,  where is the frequency of in that graph  Build a classification model  Different features and representative work  Fingerprint  Maccs keys  Tree and cyclic patterns [Horvath et al., KDD’04]  Minimal contrast subgraph [Ting and Bailey, SDM’06]  Frequent subgraphs [Deshpande et al., TKDE’05; Liu et al., SDM’05]  Graph fragments [Wale and Karypis, ICDM’06]

5 Fingerprints (fp-n) Enumerate all paths up to length l and certain cycles 1 2 ■ ■ ■ ■ ■ n...... Hash features to position(s) in a fixed length bit-vector 1 2 ■ ■ ■ ■ ■ n O N O O Chemical Compounds...... N O O N O O N N O Courtesy of Nikil Wale

6 Maccs Keys (MK) Domain Expert Each Fragment forms a fixed dimension in the descriptor-space Identify “Important” Fragments for bioactivity HO O NH 2 NH 2 O OH O NH NH 2 O Courtesy of Nikil Wale

7 Cycles and Trees (CT) [Horvath et al., KDD’04] Identify Bi-connected components Delete Bi-connected Components from the compound O NH 2 O O Left-over Trees Fixed number of cycles Bounded Cyclicity Using Bi-connected components Chemical Compound O O O NH 2 O Courtesy of Nikil Wale

8 Frequent Subgraphs (FS) [Deshpande et al., TKDE’05 ] Discovering Features OOO Sup:+ve:30% -ve:5% F O Sup:+ve:40%-ve:0% Sup:+ve:1% -ve:30% H H O N O H H H H H HH H HH H H H Chemical Compounds Discovered Subgraphs Frequent Subgraph Discovery Min. Support. Topological features – captured by graph representation F Courtesy of Nikil Wale

9 Graph Fragments (GF) [Wale and Karypis, ICDM’06] Tree Fragments (TF): At least one node of the tree fragment has a degree greater than 2 (no cycles). Path Fragments (PF): All nodes have degree less than or equal to 2 but does not include cycles. Acyclic Fragments (AF): TF U PF –Acyclic fragments are also termed as free trees. NH NH 2 O OH O Courtesy of Nikil Wale

10 Comparison of Different Features [Wale and Karypis, ICDM’06]

11 Minimal Contrast Subgraphs [Ting and Bailey, SDM’06] A contrast graph is a subgraph appearing in one class of graphs and never in another class of graphs Minimal if none of its subgraphs are contrasts May be disconnected Allows succinct description of differences But requires larger search space Mining Contrast Subgraphs Find the maximal common edge sets These may be disconnected Apply a minimal hypergraph transversal operation to derive the minimal contrast edge sets from the maximal common edge sets Must compute minimal contrast vertex sets separately and then minimal union with the minimal contrast edge sets Courtesy of Bailey and Dong

12 Frequent Subgraph-Based Classification [Deshpande et al., TKDE’05] Frequent subgraphs A graph is frequent if its support (occurrence freq.) in a given dataset is no less than a min. support threshold Feature generation Frequent topological subgraphs by FSG Frequent geometric subgraphs with 3D shape information Feature selection Sequential covering paradigm Classification Use SVM to learn a classifier based on feature vectors Assign different misclassification costs for different classes to address skewed class distribution

13 Varying Minimum Support

14 Varying Misclassification Cost

15 All graph substructures up to a given length (size or # of bonds) Determined dynamically → Dataset dependent descriptor space Complete coverage → Descriptors for every compound Precise representation → One to one mapping Complex fragments → Arbitrary topology Recurrence relation to generate graph fragments of length l Graph Fragment [Wale and Karypis, ICDM’06 ] Courtesy of Nikil Wale

16 2016-1-16ICDM 08 Tutorial16 Performance Comparison

17 Re-examination of Pattern-Based Classification Model Learning Positive Negative Training Instances Test Instances Prediction Model Pattern-Based Feature Construction Computationally Expensive! Feature Space Transformation

18 The Computational Bottleneck Data Frequent Patterns 10 4 ~10 6 Discriminative Patterns Two steps, expensive Mining Filtering Data Discriminative Patterns Direct mining, efficient Direct MiningTransform FP-tree

19 Challenge: Non Anti-Monotonic Anti-Monotonic Non Monotonic Non-Monotonic: Enumerate all subgraphs then check their score? Enumerate subgraphs : small-size to large-size

20 Direct Mining of Discriminative Patterns Avoid mining the whole set of patterns Harmony [Wang and Karypis, SDM’05] DDPMine [Cheng et al., ICDE’08] LEAP [Yan et al., SIGMOD’08] MbT [Fan et al., KDD’08] Find the most discriminative pattern A search problem? An optimization problem? Extensions Mining top-k discriminative patterns Mining approximate/weighted discriminative patterns

21 Mining Most Significant Graph with Leap Search [Yan et al., SIGMOD’08] Objective functions

22 Upper-Bound

23 Upper-Bound: Anti-Monotonic Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting We can recycle the existing graph mining algorithms to accommodate non-monotonic functions.

24 Structural Similarity Sibling Structural similarity  Significance similarity Size-4 graph Size-5 graph Size-6 graph

25 Leap on g’ subtree if : leap length, tolerance of structure/frequency dissimilarity Structural Leap Search Mining PartLeap Part g : a discovered graph g’: a sibling of g

26 Frequency Association Association between pattern’s frequency and objective scores Start with a high frequency threshold, gradually decrease it

27 LEAP Algorithm 1. Structural Leap Search with Frequency Threshold 3. Branch-and-Bound Search with F(g*) 2. Support Descending Mining F(g*) converges

28 Branch-and-Bound vs. LEAP Branch-and-BoundLEAP Pruning base Parent-child bound (“vertical”) strict pruning Sibling similarity (“horizontal”) approximate pruning Feature Optimality GuaranteedNear optimal EfficiencyGoodBetter

29 NCI Anti-Cancer Screen Datasets NameAssay IDSizeTumor Description MCF-78327,770Breast MOLT-412339,765Leukemia NCI-H23140,353Non-Small Cell Lung OVCAR-810940,516Ovarian P38833041,472Leukemia PC-34127,509Prostate SF-2954740,271Central Nerve System SN12C14540,004Renal SW-6208140,532Colon UACC2573339,988Melanoma YEAST16779,601Yeast anti-cancer Data Description

30 Efficiency Tests Search EfficiencySearch Quality: G-test

31 OA Kernel scalability problem! Mining Quality: Graph Classification NameOA Kernel* LEAPOA Kernel (6x) LEAP (6x) MCF-70.680.67 0.750.76 MOLT-40.650.660.690.72 NCI-H230.790.760.770.79 OVCAR- 8 0.670.720.790.78 P3880.790.820.81 PC-30.660.690.790.76 Average0.700.720.750.77 AUC Runtime * OA Kernel: Optimal Assignment Kernel LEAP: LEAP search [Frohlich et al., ICML’05]

32 Graph Mining II Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Summary

33 Kernel-based Classification Random walk Marginalized Kernels (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) and are paths in graphs and and are probability distributions on paths is a kernel between paths, e.g.,

34 Kernel-based Classification Optimal local assignment (Fröhlich et al. ICML’05) Can be extended to include neighborhood information e.g., where could be an RBF-kernel to measure the similarity of neighborhoods of vertices and, is a damping parameter

35 Boosting in Graph Classification Decision stumps Simple classifiers in which the final decision is made by single features. A rule is a tuple. If a molecule contains substructure, it is classified as. Gain Applying boosting

36 Boosting An Associative Classifier [Sun, et al., TKDE’06] Apply AdaBoost to associative classification with low-order rules Three weighting strategies for combining classifiers Classifier-based weighting (AdaBoost) Sample-based weighting (Evaluated to be the best) Hybrid weighting

37 Graph Classification with Boosting [Kudo, Maeda and Matsumoto, NIPS’04] Decision stump If a molecule x contains t, it is classified as y Gain Find a decision stump (subgraph) which maximizes gain Boosting with weight vector

38 Graph Mining II Graph Classification Graph Clustering Graph similarity measure Feature-based similarity measure Each graph is represented as a feature vector The similarity is defined by the distance of their corresponding vectors Frequent subgraphs can be used as features Structure-based similarity measure Maximal common subgraph Graph edit distance: insertion, deletion, and relabel Graph alignment distance Graph/network clustering: A link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks

39 Graph Compression Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes

40 Graph/Network Clustering Problem X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, Proc. 2007 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'07), San Jose, CA, Aug. 2007 Networks made up of the mutual relationships of data elements usually have an underlying structure Because relationships are complex, it is difficult to discover these structures. How can the structure be made clear? Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)?

41 An Example of Networks How many clusters? What size should they be? What is the best partitioning? Should some points be segregated?

42 A Social Network Model Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group. Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups. Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group.

43 The Neighborhood of a Vertex Define  ( ) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows ).

44 Structure Similarity The desired features tend to be captured by a measure we call Structural Similarity Structural similarity is large for members of a clique and small for hubs and outliers.

45 Structural Connectivity [1]  -Neighborhood: Core: Direct structure reachable: Structure reachable: transitive closure of direct structure reachability Structure connected: [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases

46 Structure-Connected Clusters Structure-connected cluster C Connectivity: Maximality: Hubs: Not belong to any cluster Bridge to many clusters Outliers: Not belong to any cluster Connect to less clusters hub outlier

47 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7

48 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.63

49 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.75 0.67 0.82

50 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7

51 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.67

52 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.73

53 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7

54 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51

55 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.68

56 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51

57 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7

58 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51 0.68

59 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7

60 Running Time Running time = O(|E|) For sparse networks = O(|V|) [2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004).

61 Summary: Graph Classification and Clustering Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Lots more to be explored

62 References (1)  G. Cong, K. Tan, A. Tung, and X. Xu. Mining Top-k Covering Rule Groups for Gene Expression Data, SIGMOD’05.  M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent Substructure-based Approaches for Classifying Chemical Compounds, TKDE’05.  G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences, KDD’99.  G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by Aggregating Emerging Patterns, DS’99  R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.), John Wiley & Sons, 2001.  W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct Mining of Discriminative and Essential Graphical and Itemset Features via Model-based Search Tree, KDD’08.  D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05.  H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05  T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives”, COLT/Kernel’03  H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03  T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04  C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05  P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04

63 References (2)  T. Horvath, T. Gartner, and S. Wrobel. Cyclic Pattern Kernels for Predictive Graph Mining, KDD’04.  T. Kudo, E. Maeda, and Y. Matsumoto. An Application of Boosting to Graph Classification, NIPS’04.  W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification based on Multiple Class- association Rules, ICDM’01.  B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining, KDD’98.  H. Liu, J. Han, D. Xin, and Z. Shao. Mining Frequent Patterns on Very High Dimensional Data: A Topdown Row Enumeration Approach, SDM’06.  S. Nijssen, and J. Kok. A Quickstart in Frequent Structure Mining Can Make a Difference, KDD’04.  F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki. CARPENTER: Finding Closed Patterns in Long Biological Datasets, KDD’03  F. Pan, A. Tung, G. Cong G, and X. Xu. COBBLER: Combining Column, and Row enumeration for Closed Pattern Discovery, SSDBM’04.  Y. Sun, Y. Wang, and A. K. C. Wong. Boosting an Associative Classifier, TKDE’06.  P-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns, KDD’02.  R. Ting and J. Bailey. Mining Minimal Contrast Subgraph Patterns, SDM’06.  N. Wale and G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM’06.  H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by Pattern Similarity in Large Data Sets, SIGMOD’02.  J. Wang and G. Karypis. HARMONY: Efficiently Mining the Best Rules for Classification, SDM’05.  X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, SCAN: A Structural Clustering Algorithm for Networks, KDD'07  X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns by Scalable Leap Search, SIGMOD’08.

64 January 16, 2016 Mining and Searching Graphs in Graph Databases 64


Download ppt "Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department."

Similar presentations


Ads by Google