1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign.

1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2013 Jiawei Han. All rights reserved. 1

June 6, 2016 Mining and Searching Graphs in Graph Databases 2

3 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary

Graph, Graph, Everywhere Aspirin Yeast protein interaction network from H. Jeong et al Nature 411, 41 (2001) Internet Co-author network 4

5 A Taxonomy of Link Mining Tasks Object-Related Tasks Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution) Link-Related Tasks Link prediction Graph-Related Tasks Subgraph discovery Graph classification Generative model for graphs 5

6 Link-Based Object Ranking (LBR) Exploit the link structure of a graph to order or prioritize the set of objects within the graph Focused on graphs with single object type and single link type A primary focus of link analysis community Web information analysis PageRank and Hits are typical LBR approaches In social network analysis (SNA), LBR is a core analysis task Objective: rank individuals in terms of “centrality” Degree centrality vs. eigen vector/power centrality Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs 6

7 Block-level Link Analysis (Cai et al. 04) Most of the existing link analysis algorithms, e.g. PageRank and HITS, treat a web page as a single node in the web graph However, in most cases, a web page contains multiple semantics and hence it might not be considered as an atomic and homogeneous node Web page is partitioned into blocks using the vision- based page segmentation algorithm extract page-to-block, block-to-page relationships Block-level PageRank and Block-level HITS 7

8 Link-Based Object Classification (LBC) Predicting the category of an object based on its attributes, its links and the attributes of linked objects Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations Epidemics: Predict disease type based on characteristics of the patients infected by the disease Communication: Predict whether a communication contact is by email, phone call or mail 8

9 Challenges in Link-Based Classification Labels of related objects tend to be correlated Collective classification: Explore such correlations and jointly infer the categorical values associated with the objects in the graph Ex: Classify related news items in Reuter data sets (Chak’98) Simply incorp. words from neighboring documents: not helpful Multi-relational classification is another solution for link-based classification 9

10 Group Detection Cluster the nodes in the graph into groups that share common characteristics Web: identifying communities Citation: identifying research communities Methods Hierarchical clustering Blockmodeling of SNA Spectral graph partitioning Stochastic blockmodeling Multi-relational clustering 10

11 Entity Resolution Predicting when two objects are the same, based on their attributes and their links Also known as: deduplication, reference reconciliation, co- reference resolution, object consolidation Applications Web: predict when two sites are mirrors of each other Citation: predicting when two citations are referring to the same paper Epidemics: predicting when two disease strains are the same Biology: learning when two names refer to the same protein 11

12 Entity Resolution Methods Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes Importance at considering links Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents Use of links in resolution Collective entity resolution: one resolution decision affects another if they are linked Propagating evidence over links in a depen. graph Probabilistic models interact with different entity recognition decisions 12

13 Link Prediction Predict whether a link exists between two entities, based on attributes and other observed links Applications Web: predict if there will be a link between two pages Citation: predicting if a paper will cite another paper Epidemics: predicting who a patient’s contacts are Methods Often viewed as a binary classification problem Local conditional probability model, based on structural and attribute features Difficulty: sparseness of existing links Collective prediction, e.g., Markov random field model 13

14 Link Cardinality Estimation Predicting the number of links to an object Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out- links Citation: predicting the impact of a paper based on the number of citations Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease Predicting the number of objects reached along a path from an object Web: predicting number of pages retrieved by crawling a site Citation: predicting the number of citations of a particular author in a specific journal 14

15 Subgraph Discovery Find characteristic subgraphs Focus of graph-based data mining Applications Biology: protein structure discovery Communications: legitimate vs. illegitimate groups Chemistry: chemical substructure discovery Methods Subgraph pattern mining Graph classification Classification based on subgraph pattern analysis 15

Information Diffusion, Evolution and Contagion Network structure and information diffusion How information propagates across network Contagion, opinion formation Opinion formation based on network structures Evolution of network structures Community formation, merge, split, fade, and disappearance 16

17 Metadata Mining Schema mapping, schema discovery, schema reformulation cite – matching between two bibliographic sources web - discovering schema from unstructured or semistructured data bio – mapping between two medical ontologies 17

18 Link Mining Challenges Logical vs. statistical dependencies Feature construction: Aggregation vs. selection Instances vs. classes Collective classification and collective consolidation Effective use of labeled & unlabeled data Link prediction Closed vs. open world Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few) 18

June 6, 2016 Mining and Searching Graphs in Graph Databases 19

20 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary 20

Why Graph Pattern Mining? Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity 21

Graph Pattern Mining Frequent subgraphs A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, compression, comparison, and correlation analysis 22

23 Example: Frequent Subgraphs GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) (A)(B)(C) (1)(2) 23

24 EXAMPLE (II) GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) 24

Graph Mining Algorithms Incomplete beam search – Greedy (Subdue) Inductive logic programming (WARMR) Graph theory-based approaches Apriori-based approach Pattern-growth approach 25

SUBDUE (Holder et al. KDD’94) Start with single vertices Expand best substructures with a new edge Limit the number of best substructures Substructures are evaluated based on their ability to compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G\S) Terminate until no new substructure is discovered 26

WARMR (Dehaspe et al. KDD’98) Graphs are represented by Datalog facts atomel(C, A1, c), bond (C, A1, A2, BT), atomel(C, A2, c) : a carbon atom bound to a carbon atom with bond type BT WARMR: the first general purpose ILP system Level-wise search Simulate Apriori for frequent pattern discovery 27

Frequent Subgraph Mining Approaches Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH # : Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04) 28

Properties of Graph Mining Algorithms Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path  tree  graph 29

Apriori-Based Approach … G G1G1 G2G2 GnGn k-edge (k+1)-edge G’ G’’ JOIN 30

31 Apriori-Based, Breadth-First Search AGM (Inokuchi, et al. PKDD’00) generates new graphs with one more node Methodology: breadth-search, joining two graphs FSG (Kuramochi and Karypis ICDM’01) generates new graphs with one more edge 31

PATH (Vanetik and Gudes ICDM’02, ’04) Apriori-based approach Building blocks: edge-disjoint path A graph with 3 edge-disjoint paths construct frequent paths construct frequent graphs with 2 edge-disjoint paths construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths repeat 32

FFSM (Huan, et al. ICDM’03) Represent graphs using canonical adjacency matrix (CAM) Join two CAMs or extend a CAM to generate a new graph Store the embeddings of CAMs All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs 33

Pattern Growth Method … G G1G1 G2G2 GnGn k-edge (k+1)-edge … (k+2)-edge … duplicate graph 34

MoFa (Borgelt and Berthold ICDM’02) Extend graphs by adding a new edge Store embeddings of discovered frequent graphs Fast support calculation Also used in other later developed algorithms such as FFSM and GASTON Expensive Memory usage Local structural pruning 35

GSPAN (Yan and Han ICDM’02) Right-Most Extension Theorem: Completeness The Enumeration of Graphs using Right-most Extension is COMPLETE 36

DFS Code Flatten a graph into a sequence using depth first search 0 1 2 3 4 e0: (0,1) e1: (1,2) e2: (2,0) e3: (2,3) e4: (3,1) e5: (2,4) 37

38 DFS Lexicographic Order Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a<=b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Let a = (x 0, x 1, …, x n ) and b = (y 0, y 1, …, y n ), (i)if there exists t, 0<= t <= min(m,n), x k =y k for all k, s.t. k<t, and x t < y t (ii)x k =y k for all k, s.t. 0<= k<= m and m <= n. 38

39 DFS Code Extension Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any DFS code d generated from b by one right-most extension, (i) d is not a minimum DFS code, (ii) min_dfs(d) cannot be extended from b, and (iii) min_dfs(d) is either less than a or can be extended from a. THEOREM [ RIGHT-EXTENSION ] The DFS code of a graph extended from a Non-minimum DFS code is NOT MINIMUM 39

GASTON (Nijssen and Kok KDD’04) Extend graphs directly Store embeddings Separate the discovery of different types of graphs path  tree  graph Simple structures are easier to mine and duplication detection is much simpler 40

Graph Pattern Explosion Problem If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property An n-edge frequent graph may have 2 n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5% 41

Closed Frequent Graphs Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete 42

CLOSEGRAPH (Yan & Han, KDD’03) … A Pattern-Growth Approach G G1G1 G2G2 GnGn k-edge (k+1)-edge At what condition, can we stop searching their children i.e., early termination? If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. 43

Handling Tricky Exception Cases (graph 1) a c b d (pattern 2) (pattern 1) (graph 2) a c b d ab a c d 44

Experimental Result The AIDS antiviral screen compound dataset from NCI/NIH The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI 45

46 Discovered Patterns 20% 10% 5% 46

Performance (1): Run Time Minimum support (in %) Run time per pattern (msec) 47

Performance (2): Memory Usage Minimum support (in %) Memory usage (GB) 48

49 Number of Patterns: Frequent vs. Closed CA Minimum support Number of patterns 49

50 Runtime: Frequent vs. Closed CA Minimum support Run time (sec) 50

Do the Odds Beat the Curse of Complexity? Potentially exponential number of frequent patterns The worst case complexty vs. the expected probability Ex.: Suppose Walmart has 10 4 kinds of products The chance to pick up one product 10 -4 The chance to pick up a particular set of 10 products: 10 -40 What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions? Have we solved the NP-hard problem of subgraph isomorphism testing? No. But the real graphs in bio/chemistry is not so bad A carbon has only 4 bounds and most proteins in a network have distinct labels 51

Graph Search Querying graph databases: Given a graph database and a query graph, find all the graphs containing this query graph query graph graph database 52

53 Scalability Issue Sequential scan Disk I/Os Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03 53

Indexing Strategy Graph (G) Substructure Query graph (Q) If graph G contains query graph Q, G should contain any substructure of Q Remarks Index substructures of a query graph to prune graphs that do not contain these substructures 54

Indexing Framework Two steps in processing graph queries Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test 55

Cost Analysis QUERY RESPONSE TIME REMARK: make |C q | as small as possible fetch indexnumber of candidates 56

Path-based Approach GRAPH DATABASE PATHS 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C,... 3-length:... (a) (b) (c) Built an inverted index between paths and graphs 57

Path-based Approach (cont.) QUERY GRAPH 0-edge: S C ={a, b, c}, S N ={a, b, c} 1-edge: S C-C ={a, b, c}, S C-N ={a, b, c} 2-edge: S C-N-C = {a, b}, … … Intersect these sets, we obtain the candidate answers - graph (a) and graph (b) - which may contain this query graph. 58

59 Problems: Path-based Approach GRAPH DATABASE (a) (b) (c) QUERY GRAPH Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b). 59

gIndex: Indexing Graphs by Data Mining Our methodology on graph index: Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database 60

IDEAS: Indexing with Two Constraints structure (>10 6 ) frequent (~10 5 ) discriminative (~10 3 ) 61

Why Discriminative Subgraphs? All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Only index structures that provide more information than existing structures Sample database (a) (b) (c) 62

63 Discriminative Structures Pinpoint the most useful frequent structures Given a set of structures and a new structure, we measure the extra indexing power provided by, When is small enough, is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude 63

Why Frequent Structures? We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold size support minimum support threshold 64

Experimental Setting The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds Query graphs are randomly extracted from the dataset GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10 65

66 Experiments: Index Size DATABASE SIZE # OF FEATURES 66

67 Experiments: Answer Set Size QUERY SIZE # OF CANDIDATES 67

Experiments: Incremental Maintenance Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database 68

Substructure-Based Graph Classification  Data: graph data with labels, e.g., chemical compounds, software behavior graphs, social networks  Basic idea  Extract graph substructures  Represent a graph with a feature vector,  where is the frequency of in that graph  Build a classification model  Different features and representative work  Fingerprint  Maccs keys  Tree and cyclic patterns [Horvath et al., KDD’04]  Minimal contrast subgraph [Ting and Bailey, SDM’06]  Frequent subgraphs [Deshpande et al., TKDE’05; Liu et al., SDM’05]  Graph fragments [Wale and Karypis, ICDM’06] 71

Fingerprints (fp-n) Enumerate all paths up to length l and certain cycles 1 2 ■ ■ ■ ■ ■ n...... Hash features to position(s) in a fixed length bit-vector 1 2 ■ ■ ■ ■ ■ n O N O O Chemical Compounds...... N O O N O O N N O Courtesy of Nikil Wale 72

Maccs Keys (MK) Domain Expert Each Fragment forms a fixed dimension in the descriptor-space Identify “Important” Fragments for bioactivity HO O NH 2 NH 2 O OH O NH NH 2 O Courtesy of Nikil Wale 73

Cycles and Trees (CT) [Horvath et al., KDD’04] Identify Bi-connected components Delete Bi-connected Components from the compound O NH 2 O O Left-over Trees Fixed number of cycles Bounded Cyclicity Using Bi-connected components Chemical Compound O O O NH 2 O Courtesy of Nikil Wale 74

Frequent Subgraphs (FS) [Deshpande et al., TKDE’05 ] Discovering Features OOO Sup:+ve:30% -ve:5% F O Sup:+ve:40%-ve:0% Sup:+ve:1% -ve:30% H H O N O H H H H H HH H HH H H H Chemical Compounds Discovered Subgraphs Frequent Subgraph Discovery Min. Support. Topological features – captured by graph representation F Courtesy of Nikil Wale 75

Graph Fragments (GF) [Wale & Karypis, ICDM’06] Tree Fragments (TF): At least one node of the tree fragment has a degree greater than 2 (no cycles). Path Fragments (PF): All nodes have degree less than or equal to 2 but does not include cycles. Acyclic Fragments (AF): TF U PF –Acyclic fragments are also termed as free trees. NH NH 2 O OH O Courtesy of Nikil Wale 76

Comparison of Different Features [Wale & Karypis, ICDM’06] 77

Minimal Contrast Subgraphs [Ting and Bailey, SDM’06] A contrast graph is a subgraph appearing in one class of graphs and never in another class of graphs Minimal if none of its subgraphs are contrasts May be disconnected Allows succinct description of differences But requires larger search space Mining Contrast Subgraphs Find the maximal common edge sets These may be disconnected Apply a minimal hypergraph transversal operation to derive the minimal contrast edge sets from the maximal common edge sets Must compute minimal contrast vertex sets separately and then minimal union with the minimal contrast edge sets Courtesy of Bailey and Dong 78

Frequent Subgraph-Based Classification [Deshpande et al., TKDE’05] Frequent subgraphs A graph is frequent if its support (occurrence freq.) in a given dataset is no less than a min. support threshold Feature generation Frequent topological subgraphs by FSG Frequent geometric subgraphs with 3D shape information Feature selection Sequential covering paradigm Classification Use SVM to learn a classifier based on feature vectors Assign different misclassification costs for different classes to address skewed class distribution 79

Varying Minimum Support 80

Varying Misclassification Cost 81

All graph substructures up to a given length (size or # of bonds) Determined dynamically → Dataset dependent descriptor space Complete coverage → Descriptors for every compound Precise representation → One to one mapping Complex fragments → Arbitrary topology Recurrence relation to generate graph fragments of length l Graph Fragment [Wale and Karypis, ICDM’06] Courtesy of Nikil Wale 82

2016-6-6ICDM 08 Tutorial83 Performance Comparison 83

Re-examination of Pattern-Based Classification Model Learning Positive Negative Training Instances Test Instances Prediction Model Pattern-Based Feature Construction Computationally Expensive! Feature Space Transformation 84

Computational Bottleneck Data Frequent Patterns 10 4 ~10 6 Discriminative Patterns Two steps, expensive Mining Filtering Data Discriminative Patterns Direct mining, efficient Direct MiningTransform FP-tree 85

Challenge: Non Anti-Monotonic Anti-Monotonic Non Monotonic Non-Monotonic: Enumerate all subgraphs then check their score? Enumerate subgraphs : small-size to large-size 86

Direct Mining of Discriminative Patterns Avoid mining the whole set of patterns Harmony [Wang and Karypis, SDM’05] DDPMine [Cheng et al., ICDE’08] LEAP [Yan et al., SIGMOD’08] MbT [Fan et al., KDD’08] Find the most discriminative pattern A search problem? An optimization problem? Extensions Mining top-k discriminative patterns Mining approximate/weighted discriminative patterns 87

Mining Most Significant Graph with Leap Search [Yan et al., SIGMOD’08] Objective functions 88

Upper-Bound 89

Upper-Bound: Anti-Monotonic Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting We can recycle the existing graph mining algorithms to accommodate non-monotonic functions. 90

Structural Similarity Sibling Structural similarity  Significance similarity Size-4 graph Size-5 graph Size-6 graph 91

Leap on g’ subtree if : leap length, tolerance of structure/frequency dissimilarity Structural Leap Search Mining PartLeap Part g : a discovered graph g’: a sibling of g 92

Frequency Association Association between pattern’s frequency and objective scores Start with a high frequency threshold, gradually decrease it 93

LEAP Algorithm 1. Structural Leap Search with Frequency Threshold 3. Branch-and-Bound Search with F(g*) 2. Support Descending Mining F(g*) converges 94

Branch-and-Bound vs. LEAP Branch-and-BoundLEAP Pruning base Parent-child bound (“vertical”) strict pruning Sibling similarity (“horizontal”) approximate pruning Feature Optimality GuaranteedNear optimal EfficiencyGoodBetter 95

NCI Anti-Cancer Screen Datasets NameAssay IDSizeTumor Description MCF-78327,770Breast MOLT-412339,765Leukemia NCI-H23140,353Non-Small Cell Lung OVCAR-810940,516Ovarian P38833041,472Leukemia PC-34127,509Prostate SF-2954740,271Central Nerve System SN12C14540,004Renal SW-6208140,532Colon UACC2573339,988Melanoma YEAST16779,601Yeast anti-cancer Data Description 96

Efficiency Tests Search EfficiencySearch Quality: G-test 97

OA Kernel scalability problem! Mining Quality: Graph Classification NameOA Kernel* LEAPOA Kernel (6x) LEAP (6x) MCF-70.680.67 0.750.76 MOLT-40.650.660.690.72 NCI-H230.790.760.770.79 OVCAR- 8 0.670.720.790.78 P3880.790.820.81 PC-30.660.690.790.76 Average0.700.720.750.77 AUC Runtime * OA Kernel: Optimal Assignment Kernel LEAP: LEAP search [Frohlich et al., ICML’05] 98

Graph Compression Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes 101

Graph/Network Clustering Problem X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, Proc. 2007 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'07), San Jose, CA, Aug. 2007 Networks made up of the mutual relationships of data elements usually have an underlying structure Because relationships are complex, it is difficult to discover these structures. How can the structure be made clear? Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)? 102

Graph Clustering: Sparsest Cut G = (V,E). The cut set of a cut is the set of edges {(u, v) ∈ E | u ∈ S, v ∈ T } and S and T are in two partitions Size of the cut: # of edges in the cut set Min-cut (e.g., C 1 ) is not a good partition A better measure: Sparsity: A cut is sparsest if its sparsity is not greater than that of any other cut Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut For k clusters, the modularity of a clustering assesses the quality of the clustering: The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected The optimal clustering of graphs maximizes the modularity l i : # edges between vertices in the i-th cluster d i : the sum of the degrees of the vertices in the i-th cluster 103

Graph Clustering: Challenges of Finding Good Cuts High computational cost Many graph cut problems are computationally expensive The sparsest cut problem is NP-hard Need to tradeoff between efficiency/scalability and quality Sophisticated graphs May involve weights and/or cycles. High dimensionality A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph Sparsity A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices A similarity matrix from a large sparse graph can also be sparse 104

Two Approaches for Graph Clustering Two approaches for clustering graph data Use generic clustering methods for high-dimensional data Designed specifically for clustering graphs Using clustering methods for high-dimensional data Extract a similarity matrix from a graph using a similarity measure A generic clustering method can then be applied on the similarity matrix to discover clusters Ex. Spectral clustering: approximate optimal graph cut solutions Methods specific to graphs Search the graph to find well-connected components as clusters Ex. SCAN (Structural Clustering Algorithm for Networks) X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'07 105

An Example of Networks How many clusters? What size should they be? What is the best partitioning? Should some points be segregated? 106

A Social Network Model Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group. Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups. Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group. 107

The Neighborhood of a Vertex Define  ( ) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows ). 108

Structure Similarity The desired features tend to be captured by a measure we call Structural Similarity Structural similarity is large for members of a clique and small for hubs and outliers. 109

Structural Connectivity [1]  -Neighborhood: Core: Direct structure reachable: Structure reachable: transitive closure of direct structure reachability Structure connected: [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases 110

Structure-Connected Clusters Structure-connected cluster C Connectivity: Maximality: Hubs: Not belong to any cluster Bridge to many clusters Outliers: Not belong to any cluster Connect to less clusters hub outlier 111

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 112

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.63 113

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.75 0.67 0.82 114

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 115

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.67 116

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.73 117

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 118

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51 119

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.68 120

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51 121

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 122

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51 0.68 123

13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 124

Running Time Running time = O(|E|) For sparse networks = O(|V|) [2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004). 125

Summary: Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering 126

References: Graph Pattern Mining (1) T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02 C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003 M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02 L. Dehaspe, H. Toivonen, and R. King. “Finding frequent substructures in chemical compounds”, KDD'98 C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04 L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94 J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04 J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03 H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05 A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00 C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., 2003. G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04 M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004. 128

M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04 B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45--87, 1981. S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04 D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and graph searching”, PODS'02 J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976. N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from semistructured data”, ICDM'02 C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base graph databases”, KDD'04 T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD Explorations, 5:59-68, 2003 X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02 X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”, SIGMOD'04 X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity Constraints”, KDD'05 X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed Distance”, ICDE'06 M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02 P. Zhao and J. Han, “On Graph Query Optimization in Large Networks", VLDB'10 References: Graph Pattern Mining (2) 129

References: Graph Classification (1)  G. Cong, K. Tan, A. Tung, and X. Xu. Mining Top-k Covering Rule Groups for Gene Expression Data, SIGMOD’05.  M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent Substructure-based Approaches for Classifying Chemical Compounds, TKDE’05.  G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences, KDD’99.  G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by Aggregating Emerging Patterns, DS’99  R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.), John Wiley & Sons, 2001.  W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct Mining of Discriminative and Essential Graphical and Itemset Features via Model-based Search Tree, KDD’08.  D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05.  H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05  T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives”, COLT/Kernel’03  H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03  T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04  C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05  P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04 130

 T. Horvath, T. Gartner, and S. Wrobel. Cyclic Pattern Kernels for Predictive Graph Mining, KDD’04.  T. Kudo, E. Maeda, and Y. Matsumoto. An Application of Boosting to Graph Classification, NIPS’04.  W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification based on Multiple Class- association Rules, ICDM’01.  B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining, KDD’98.  H. Liu, J. Han, D. Xin, and Z. Shao. Mining Frequent Patterns on Very High Dimensional Data: A Topdown Row Enumeration Approach, SDM’06.  S. Nijssen, and J. Kok. A Quickstart in Frequent Structure Mining Can Make a Difference, KDD’04.  F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki. CARPENTER: Finding Closed Patterns in Long Biological Datasets, KDD’03  F. Pan, A. Tung, G. Cong G, and X. Xu. COBBLER: Combining Column, and Row enumeration for Closed Pattern Discovery, SSDBM’04.  Y. Sun, Y. Wang, and A. K. C. Wong. Boosting an Associative Classifier, TKDE’06.  P-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns, KDD’02.  R. Ting and J. Bailey. Mining Minimal Contrast Subgraph Patterns, SDM’06.  N. Wale and G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM’06.  H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by Pattern Similarity in Large Data Sets, SIGMOD’02.  J. Wang and G. Karypis. HARMONY: Efficiently Mining the Best Rules for Classification, SDM’05.  X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, SCAN: A Structural Clustering Algorithm for Networks, KDD'07  X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns by Scalable Leap Search, SIGMOD’08. References: Graph Classification (2) 131

1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback