Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department.

Data Mining: Concepts and Techniques — Chapter 9 — Graph mining: Part II Graph Classification and Clustering Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved.

January 16, 2016 Mining and Searching Graphs in Graph Databases 2

Graph Mining II Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Summary

Substructure-Based Graph Classification  Data: graph data with labels, e.g., chemical compounds, software behavior graphs, social networks  Basic idea  Extract graph substructures  Represent a graph with a feature vector,  where is the frequency of in that graph  Build a classification model  Different features and representative work  Fingerprint  Maccs keys  Tree and cyclic patterns [Horvath et al., KDD’04]  Minimal contrast subgraph [Ting and Bailey, SDM’06]  Frequent subgraphs [Deshpande et al., TKDE’05; Liu et al., SDM’05]  Graph fragments [Wale and Karypis, ICDM’06]

Fingerprints (fp-n) Enumerate all paths up to length l and certain cycles 1 2 ■ ■ ■ ■ ■ n...... Hash features to position(s) in a fixed length bit-vector 1 2 ■ ■ ■ ■ ■ n O N O O Chemical Compounds...... N O O N O O N N O Courtesy of Nikil Wale

Maccs Keys (MK) Domain Expert Each Fragment forms a fixed dimension in the descriptor-space Identify “Important” Fragments for bioactivity HO O NH 2 NH 2 O OH O NH NH 2 O Courtesy of Nikil Wale

Cycles and Trees (CT) [Horvath et al., KDD’04] Identify Bi-connected components Delete Bi-connected Components from the compound O NH 2 O O Left-over Trees Fixed number of cycles Bounded Cyclicity Using Bi-connected components Chemical Compound O O O NH 2 O Courtesy of Nikil Wale

Frequent Subgraphs (FS) [Deshpande et al., TKDE’05 ] Discovering Features OOO Sup:+ve:30% -ve:5% F O Sup:+ve:40%-ve:0% Sup:+ve:1% -ve:30% H H O N O H H H H H HH H HH H H H Chemical Compounds Discovered Subgraphs Frequent Subgraph Discovery Min. Support. Topological features – captured by graph representation F Courtesy of Nikil Wale

Graph Fragments (GF) [Wale and Karypis, ICDM’06] Tree Fragments (TF): At least one node of the tree fragment has a degree greater than 2 (no cycles). Path Fragments (PF): All nodes have degree less than or equal to 2 but does not include cycles. Acyclic Fragments (AF): TF U PF –Acyclic fragments are also termed as free trees. NH NH 2 O OH O Courtesy of Nikil Wale

Comparison of Different Features [Wale and Karypis, ICDM’06]

Minimal Contrast Subgraphs [Ting and Bailey, SDM’06] A contrast graph is a subgraph appearing in one class of graphs and never in another class of graphs Minimal if none of its subgraphs are contrasts May be disconnected Allows succinct description of differences But requires larger search space Mining Contrast Subgraphs Find the maximal common edge sets These may be disconnected Apply a minimal hypergraph transversal operation to derive the minimal contrast edge sets from the maximal common edge sets Must compute minimal contrast vertex sets separately and then minimal union with the minimal contrast edge sets Courtesy of Bailey and Dong

Frequent Subgraph-Based Classification [Deshpande et al., TKDE’05] Frequent subgraphs A graph is frequent if its support (occurrence freq.) in a given dataset is no less than a min. support threshold Feature generation Frequent topological subgraphs by FSG Frequent geometric subgraphs with 3D shape information Feature selection Sequential covering paradigm Classification Use SVM to learn a classifier based on feature vectors Assign different misclassification costs for different classes to address skewed class distribution

Varying Minimum Support

Varying Misclassification Cost

All graph substructures up to a given length (size or # of bonds) Determined dynamically → Dataset dependent descriptor space Complete coverage → Descriptors for every compound Precise representation → One to one mapping Complex fragments → Arbitrary topology Recurrence relation to generate graph fragments of length l Graph Fragment [Wale and Karypis, ICDM’06 ] Courtesy of Nikil Wale

2016-1-16ICDM 08 Tutorial16 Performance Comparison

Re-examination of Pattern-Based Classification Model Learning Positive Negative Training Instances Test Instances Prediction Model Pattern-Based Feature Construction Computationally Expensive! Feature Space Transformation

The Computational Bottleneck Data Frequent Patterns 10 4 ~10 6 Discriminative Patterns Two steps, expensive Mining Filtering Data Discriminative Patterns Direct mining, efficient Direct MiningTransform FP-tree

Challenge: Non Anti-Monotonic Anti-Monotonic Non Monotonic Non-Monotonic: Enumerate all subgraphs then check their score? Enumerate subgraphs : small-size to large-size

Direct Mining of Discriminative Patterns Avoid mining the whole set of patterns Harmony [Wang and Karypis, SDM’05] DDPMine [Cheng et al., ICDE’08] LEAP [Yan et al., SIGMOD’08] MbT [Fan et al., KDD’08] Find the most discriminative pattern A search problem? An optimization problem? Extensions Mining top-k discriminative patterns Mining approximate/weighted discriminative patterns

Mining Most Significant Graph with Leap Search [Yan et al., SIGMOD’08] Objective functions

Upper-Bound

Upper-Bound: Anti-Monotonic Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting We can recycle the existing graph mining algorithms to accommodate non-monotonic functions.

Structural Similarity Sibling Structural similarity  Significance similarity Size-4 graph Size-5 graph Size-6 graph

Leap on g’ subtree if : leap length, tolerance of structure/frequency dissimilarity Structural Leap Search Mining PartLeap Part g : a discovered graph g’: a sibling of g

Frequency Association Association between pattern’s frequency and objective scores Start with a high frequency threshold, gradually decrease it

LEAP Algorithm 1. Structural Leap Search with Frequency Threshold 3. Branch-and-Bound Search with F(g*) 2. Support Descending Mining F(g*) converges

Branch-and-Bound vs. LEAP Branch-and-BoundLEAP Pruning base Parent-child bound (“vertical”) strict pruning Sibling similarity (“horizontal”) approximate pruning Feature Optimality GuaranteedNear optimal EfficiencyGoodBetter

NCI Anti-Cancer Screen Datasets NameAssay IDSizeTumor Description MCF-78327,770Breast MOLT-412339,765Leukemia NCI-H23140,353Non-Small Cell Lung OVCAR-810940,516Ovarian P38833041,472Leukemia PC-34127,509Prostate SF-2954740,271Central Nerve System SN12C14540,004Renal SW-6208140,532Colon UACC2573339,988Melanoma YEAST16779,601Yeast anti-cancer Data Description

Efficiency Tests Search EfficiencySearch Quality: G-test

OA Kernel scalability problem! Mining Quality: Graph Classification NameOA Kernel* LEAPOA Kernel (6x) LEAP (6x) MCF-70.680.67 0.750.76 MOLT-40.650.660.690.72 NCI-H230.790.760.770.79 OVCAR- 8 0.670.720.790.78 P3880.790.820.81 PC-30.660.690.790.76 Average0.700.720.750.77 AUC Runtime * OA Kernel: Optimal Assignment Kernel LEAP: LEAP search [Frohlich et al., ICML’05]

Graph Mining II Graph Classification Graph pattern-based approach Subgraph patterns from data mining: LEAP Machine Learning approaches Kernel-based approach Boosting Graph Clustering Link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks” Summary

Kernel-based Classification Random walk Marginalized Kernels (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) and are paths in graphs and and are probability distributions on paths is a kernel between paths, e.g.,

Kernel-based Classification Optimal local assignment (Fröhlich et al. ICML’05) Can be extended to include neighborhood information e.g., where could be an RBF-kernel to measure the similarity of neighborhoods of vertices and, is a damping parameter

Boosting in Graph Classification Decision stumps Simple classifiers in which the final decision is made by single features. A rule is a tuple. If a molecule contains substructure, it is classified as. Gain Applying boosting

Boosting An Associative Classifier [Sun, et al., TKDE’06] Apply AdaBoost to associative classification with low-order rules Three weighting strategies for combining classifiers Classifier-based weighting (AdaBoost) Sample-based weighting (Evaluated to be the best) Hybrid weighting

Graph Classification with Boosting [Kudo, Maeda and Matsumoto, NIPS’04] Decision stump If a molecule x contains t, it is classified as y Gain Find a decision stump (subgraph) which maximizes gain Boosting with weight vector

Graph Mining II Graph Classification Graph Clustering Graph similarity measure Feature-based similarity measure Each graph is represented as a feature vector The similarity is defined by the distance of their corresponding vectors Frequent subgraphs can be used as features Structure-based similarity measure Maximal common subgraph Graph edit distance: insertion, deletion, and relabel Graph alignment distance Graph/network clustering: A link-density-based approach: “SCAN: A Structural Clustering Algorithm for Networks

Graph Compression Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes

Graph/Network Clustering Problem X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, Proc. 2007 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'07), San Jose, CA, Aug. 2007 Networks made up of the mutual relationships of data elements usually have an underlying structure Because relationships are complex, it is difficult to discover these structures. How can the structure be made clear? Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)?

An Example of Networks How many clusters? What size should they be? What is the best partitioning? Should some points be segregated?

A Social Network Model Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group. Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups. Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group.

The Neighborhood of a Vertex Define  ( ) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows ).

Structure Similarity The desired features tend to be captured by a measure we call Structural Similarity Structural similarity is large for members of a clique and small for hubs and outliers.

Structural Connectivity [1]  -Neighborhood: Core: Direct structure reachable: Structure reachable: transitive closure of direct structure reachability Structure connected: [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases

Structure-Connected Clusters Structure-connected cluster C Connectivity: Maximality: Hubs: Not belong to any cluster Bridge to many clusters Outliers: Not belong to any cluster Connect to less clusters hub outlier