Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign."— Presentation transcript:

1 1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2013 Jiawei Han. All rights reserved. 1

2 June 6, 2016 Mining and Searching Graphs in Graph Databases 2

3 3 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary

4 Graph, Graph, Everywhere Aspirin Yeast protein interaction network from H. Jeong et al Nature 411, 41 (2001) Internet Co-author network 4

5 5 A Taxonomy of Link Mining Tasks Object-Related Tasks Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution) Link-Related Tasks Link prediction Graph-Related Tasks Subgraph discovery Graph classification Generative model for graphs 5

6 6 Link-Based Object Ranking (LBR) Exploit the link structure of a graph to order or prioritize the set of objects within the graph Focused on graphs with single object type and single link type A primary focus of link analysis community Web information analysis PageRank and Hits are typical LBR approaches In social network analysis (SNA), LBR is a core analysis task Objective: rank individuals in terms of “centrality” Degree centrality vs. eigen vector/power centrality Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs 6

7 7 Block-level Link Analysis (Cai et al. 04) Most of the existing link analysis algorithms, e.g. PageRank and HITS, treat a web page as a single node in the web graph However, in most cases, a web page contains multiple semantics and hence it might not be considered as an atomic and homogeneous node Web page is partitioned into blocks using the vision- based page segmentation algorithm extract page-to-block, block-to-page relationships Block-level PageRank and Block-level HITS 7

8 8 Link-Based Object Classification (LBC) Predicting the category of an object based on its attributes, its links and the attributes of linked objects Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc. Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations Epidemics: Predict disease type based on characteristics of the patients infected by the disease Communication: Predict whether a communication contact is by email, phone call or mail 8

9 9 Challenges in Link-Based Classification Labels of related objects tend to be correlated Collective classification: Explore such correlations and jointly infer the categorical values associated with the objects in the graph Ex: Classify related news items in Reuter data sets (Chak’98) Simply incorp. words from neighboring documents: not helpful Multi-relational classification is another solution for link-based classification 9

10 10 Group Detection Cluster the nodes in the graph into groups that share common characteristics Web: identifying communities Citation: identifying research communities Methods Hierarchical clustering Blockmodeling of SNA Spectral graph partitioning Stochastic blockmodeling Multi-relational clustering 10

11 11 Entity Resolution Predicting when two objects are the same, based on their attributes and their links Also known as: deduplication, reference reconciliation, co- reference resolution, object consolidation Applications Web: predict when two sites are mirrors of each other Citation: predicting when two citations are referring to the same paper Epidemics: predicting when two disease strains are the same Biology: learning when two names refer to the same protein 11

12 12 Entity Resolution Methods Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes Importance at considering links Coauthor links in bib data, hierarchical links between spatial references, co-occurrence links between name references in documents Use of links in resolution Collective entity resolution: one resolution decision affects another if they are linked Propagating evidence over links in a depen. graph Probabilistic models interact with different entity recognition decisions 12

13 13 Link Prediction Predict whether a link exists between two entities, based on attributes and other observed links Applications Web: predict if there will be a link between two pages Citation: predicting if a paper will cite another paper Epidemics: predicting who a patient’s contacts are Methods Often viewed as a binary classification problem Local conditional probability model, based on structural and attribute features Difficulty: sparseness of existing links Collective prediction, e.g., Markov random field model 13

14 14 Link Cardinality Estimation Predicting the number of links to an object Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out- links Citation: predicting the impact of a paper based on the number of citations Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease Predicting the number of objects reached along a path from an object Web: predicting number of pages retrieved by crawling a site Citation: predicting the number of citations of a particular author in a specific journal 14

15 15 Subgraph Discovery Find characteristic subgraphs Focus of graph-based data mining Applications Biology: protein structure discovery Communications: legitimate vs. illegitimate groups Chemistry: chemical substructure discovery Methods Subgraph pattern mining Graph classification Classification based on subgraph pattern analysis 15

16 Information Diffusion, Evolution and Contagion Network structure and information diffusion How information propagates across network Contagion, opinion formation Opinion formation based on network structures Evolution of network structures Community formation, merge, split, fade, and disappearance 16

17 17 Metadata Mining Schema mapping, schema discovery, schema reformulation cite – matching between two bibliographic sources web - discovering schema from unstructured or semi- structured data bio – mapping between two medical ontologies 17

18 18 Link Mining Challenges Logical vs. statistical dependencies Feature construction: Aggregation vs. selection Instances vs. classes Collective classification and collective consolidation Effective use of labeled & unlabeled data Link prediction Closed vs. open world Challenges common to any link-based statistical model (Bayesian Logic Programs, Conditional Random Fields, Probabilistic Relational Models, Relational Markov Networks, Relational Probability Trees, Stochastic Logic Programming to name a few) 18

19 June 6, 2016 Mining and Searching Graphs in Graph Databases 19

20 20 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary 20

21 Why Graph Pattern Mining? Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity 21

22 Graph Pattern Mining Frequent subgraphs A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold Applications of graph pattern mining Mining biochemical structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, compression, comparison, and correlation analysis 22

23 23 Example: Frequent Subgraphs GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) (A)(B)(C) (1)(2) 23

24 24 EXAMPLE (II) GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) 24

25 Graph Mining Algorithms Incomplete beam search – Greedy (Subdue) Inductive logic programming (WARMR) Graph theory-based approaches Apriori-based approach Pattern-growth approach 25

26 SUBDUE (Holder et al. KDD’94) Start with single vertices Expand best substructures with a new edge Limit the number of best substructures Substructures are evaluated based on their ability to compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G\S) Terminate until no new substructure is discovered 26

27 WARMR (Dehaspe et al. KDD’98) Graphs are represented by Datalog facts atomel(C, A1, c), bond (C, A1, A2, BT), atomel(C, A2, c) : a carbon atom bound to a carbon atom with bond type BT WARMR: the first general purpose ILP system Level-wise search Simulate Apriori for frequent pattern discovery 27

28 Frequent Subgraph Mining Approaches Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH # : Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03) Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04) 28

29 Properties of Graph Mining Algorithms Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path  tree  graph 29

30 Apriori-Based Approach … G G1G1 G2G2 GnGn k-edge (k+1)-edge G’ G’’ JOIN 30

31 31 Apriori-Based, Breadth-First Search AGM (Inokuchi, et al. PKDD’00) generates new graphs with one more node Methodology: breadth-search, joining two graphs FSG (Kuramochi and Karypis ICDM’01) generates new graphs with one more edge 31

32 PATH (Vanetik and Gudes ICDM’02, ’04) Apriori-based approach Building blocks: edge-disjoint path A graph with 3 edge-disjoint paths construct frequent paths construct frequent graphs with 2 edge-disjoint paths construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths repeat 32

33 FFSM (Huan, et al. ICDM’03) Represent graphs using canonical adjacency matrix (CAM) Join two CAMs or extend a CAM to generate a new graph Store the embeddings of CAMs All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs 33

34 Pattern Growth Method … G G1G1 G2G2 GnGn k-edge (k+1)-edge … (k+2)-edge … duplicate graph 34

35 MoFa (Borgelt and Berthold ICDM’02) Extend graphs by adding a new edge Store embeddings of discovered frequent graphs Fast support calculation Also used in other later developed algorithms such as FFSM and GASTON Expensive Memory usage Local structural pruning 35

36 GSPAN (Yan and Han ICDM’02) Right-Most Extension Theorem: Completeness The Enumeration of Graphs using Right-most Extension is COMPLETE 36

37 DFS Code Flatten a graph into a sequence using depth first search 0 1 2 3 4 e0: (0,1) e1: (1,2) e2: (2,0) e3: (2,3) e4: (3,1) e5: (2,4) 37

38 38 DFS Lexicographic Order Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a<=b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Let a = (x 0, x 1, …, x n ) and b = (y 0, y 1, …, y n ), (i)if there exists t, 0<= t <= min(m,n), x k =y k for all k, s.t. k<t, and x t < y t (ii)x k =y k for all k, s.t. 0<= k<= m and m <= n. 38

39 39 DFS Code Extension Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any DFS code d generated from b by one right-most extension, (i) d is not a minimum DFS code, (ii) min_dfs(d) cannot be extended from b, and (iii) min_dfs(d) is either less than a or can be extended from a. THEOREM [ RIGHT-EXTENSION ] The DFS code of a graph extended from a Non-minimum DFS code is NOT MINIMUM 39

40 GASTON (Nijssen and Kok KDD’04) Extend graphs directly Store embeddings Separate the discovery of different types of graphs path  tree  graph Simple structures are easier to mine and duplication detection is much simpler 40

41 Graph Pattern Explosion Problem If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property An n-edge frequent graph may have 2 n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5% 41

42 Closed Frequent Graphs Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete 42

43 CLOSEGRAPH (Yan & Han, KDD’03) … A Pattern-Growth Approach G G1G1 G2G2 GnGn k-edge (k+1)-edge At what condition, can we stop searching their children i.e., early termination? If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’. 43

44 Handling Tricky Exception Cases (graph 1) a c b d (pattern 2) (pattern 1) (graph 2) a c b d ab a c d 44

45 Experimental Result The AIDS antiviral screen compound dataset from NCI/NIH The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI 45

46 46 Discovered Patterns 20% 10% 5% 46

47 Performance (1): Run Time Minimum support (in %) Run time per pattern (msec) 47

48 Performance (2): Memory Usage Minimum support (in %) Memory usage (GB) 48

49 49 Number of Patterns: Frequent vs. Closed CA Minimum support Number of patterns 49

50 50 Runtime: Frequent vs. Closed CA Minimum support Run time (sec) 50

51 Do the Odds Beat the Curse of Complexity? Potentially exponential number of frequent patterns The worst case complexty vs. the expected probability Ex.: Suppose Walmart has 10 4 kinds of products The chance to pick up one product 10 -4 The chance to pick up a particular set of 10 products: 10 -40 What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions? Have we solved the NP-hard problem of subgraph isomorphism testing? No. But the real graphs in bio/chemistry is not so bad A carbon has only 4 bounds and most proteins in a network have distinct labels 51

52 Graph Search Querying graph databases: Given a graph database and a query graph, find all the graphs containing this query graph query graph graph database 52

53 53 Scalability Issue Sequential scan Disk I/Os Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03 53

54 Indexing Strategy Graph (G) Substructure Query graph (Q) If graph G contains query graph Q, G should contain any substructure of Q Remarks Index substructures of a query graph to prune graphs that do not contain these substructures 54

55 Indexing Framework Two steps in processing graph queries Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test 55

56 Cost Analysis QUERY RESPONSE TIME REMARK: make |C q | as small as possible fetch indexnumber of candidates 56

57 Path-based Approach GRAPH DATABASE PATHS 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C,... 3-length:... (a) (b) (c) Built an inverted index between paths and graphs 57

58 Path-based Approach (cont.) QUERY GRAPH 0-edge: S C ={a, b, c}, S N ={a, b, c} 1-edge: S C-C ={a, b, c}, S C-N ={a, b, c} 2-edge: S C-N-C = {a, b}, … … Intersect these sets, we obtain the candidate answers - graph (a) and graph (b) - which may contain this query graph. 58

59 59 Problems: Path-based Approach GRAPH DATABASE (a) (b) (c) QUERY GRAPH Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b). 59

60 gIndex: Indexing Graphs by Data Mining Our methodology on graph index: Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database 60

61 IDEAS: Indexing with Two Constraints structure (>10 6 ) frequent (~10 5 ) discriminative (~10 3 ) 61

62 Why Discriminative Subgraphs? All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Only index structures that provide more information than existing structures Sample database (a) (b) (c) 62

63 63 Discriminative Structures Pinpoint the most useful frequent structures Given a set of structures and a new structure, we measure the extra indexing power provided by, When is small enough, is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude 63

64 Why Frequent Structures? We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold size support minimum support threshold 64

65 Experimental Setting The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds Query graphs are randomly extracted from the dataset GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10 65

66 66 Experiments: Index Size DATABASE SIZE # OF FEATURES 66

67 67 Experiments: Answer Set Size QUERY SIZE # OF CANDIDATES 67

68 Experiments: Incremental Maintenance Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database 68

69

70 70 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary 70

71 Substructure-Based Graph Classification  Data: graph data with labels, e.g., chemical compounds, software behavior graphs, social networks  Basic idea  Extract graph substructures  Represent a graph with a feature vector,  where is the frequency of in that graph  Build a classification model  Different features and representative work  Fingerprint  Maccs keys  Tree and cyclic patterns [Horvath et al., KDD’04]  Minimal contrast subgraph [Ting and Bailey, SDM’06]  Frequent subgraphs [Deshpande et al., TKDE’05; Liu et al., SDM’05]  Graph fragments [Wale and Karypis, ICDM’06] 71

72 Fingerprints (fp-n) Enumerate all paths up to length l and certain cycles 1 2 ■ ■ ■ ■ ■ n...... Hash features to position(s) in a fixed length bit-vector 1 2 ■ ■ ■ ■ ■ n O N O O Chemical Compounds...... N O O N O O N N O Courtesy of Nikil Wale 72

73 Maccs Keys (MK) Domain Expert Each Fragment forms a fixed dimension in the descriptor-space Identify “Important” Fragments for bioactivity HO O NH 2 NH 2 O OH O NH NH 2 O Courtesy of Nikil Wale 73

74 Cycles and Trees (CT) [Horvath et al., KDD’04] Identify Bi-connected components Delete Bi-connected Components from the compound O NH 2 O O Left-over Trees Fixed number of cycles Bounded Cyclicity Using Bi-connected components Chemical Compound O O O NH 2 O Courtesy of Nikil Wale 74

75 Frequent Subgraphs (FS) [Deshpande et al., TKDE’05 ] Discovering Features OOO Sup:+ve:30% -ve:5% F O Sup:+ve:40%-ve:0% Sup:+ve:1% -ve:30% H H O N O H H H H H HH H HH H H H Chemical Compounds Discovered Subgraphs Frequent Subgraph Discovery Min. Support. Topological features – captured by graph representation F Courtesy of Nikil Wale 75

76 Graph Fragments (GF) [Wale & Karypis, ICDM’06] Tree Fragments (TF): At least one node of the tree fragment has a degree greater than 2 (no cycles). Path Fragments (PF): All nodes have degree less than or equal to 2 but does not include cycles. Acyclic Fragments (AF): TF U PF –Acyclic fragments are also termed as free trees. NH NH 2 O OH O Courtesy of Nikil Wale 76

77 Comparison of Different Features [Wale & Karypis, ICDM’06] 77

78 Minimal Contrast Subgraphs [Ting and Bailey, SDM’06] A contrast graph is a subgraph appearing in one class of graphs and never in another class of graphs Minimal if none of its subgraphs are contrasts May be disconnected Allows succinct description of differences But requires larger search space Mining Contrast Subgraphs Find the maximal common edge sets These may be disconnected Apply a minimal hypergraph transversal operation to derive the minimal contrast edge sets from the maximal common edge sets Must compute minimal contrast vertex sets separately and then minimal union with the minimal contrast edge sets Courtesy of Bailey and Dong 78

79 Frequent Subgraph-Based Classification [Deshpande et al., TKDE’05] Frequent subgraphs A graph is frequent if its support (occurrence freq.) in a given dataset is no less than a min. support threshold Feature generation Frequent topological subgraphs by FSG Frequent geometric subgraphs with 3D shape information Feature selection Sequential covering paradigm Classification Use SVM to learn a classifier based on feature vectors Assign different misclassification costs for different classes to address skewed class distribution 79

80 Varying Minimum Support 80

81 Varying Misclassification Cost 81

82 All graph substructures up to a given length (size or # of bonds) Determined dynamically → Dataset dependent descriptor space Complete coverage → Descriptors for every compound Precise representation → One to one mapping Complex fragments → Arbitrary topology Recurrence relation to generate graph fragments of length l Graph Fragment [Wale and Karypis, ICDM’06] Courtesy of Nikil Wale 82

83 2016-6-6ICDM 08 Tutorial83 Performance Comparison 83

84 Re-examination of Pattern-Based Classification Model Learning Positive Negative Training Instances Test Instances Prediction Model Pattern-Based Feature Construction Computationally Expensive! Feature Space Transformation 84

85 Computational Bottleneck Data Frequent Patterns 10 4 ~10 6 Discriminative Patterns Two steps, expensive Mining Filtering Data Discriminative Patterns Direct mining, efficient Direct MiningTransform FP-tree 85

86 Challenge: Non Anti-Monotonic Anti-Monotonic Non Monotonic Non-Monotonic: Enumerate all subgraphs then check their score? Enumerate subgraphs : small-size to large-size 86

87 Direct Mining of Discriminative Patterns Avoid mining the whole set of patterns Harmony [Wang and Karypis, SDM’05] DDPMine [Cheng et al., ICDE’08] LEAP [Yan et al., SIGMOD’08] MbT [Fan et al., KDD’08] Find the most discriminative pattern A search problem? An optimization problem? Extensions Mining top-k discriminative patterns Mining approximate/weighted discriminative patterns 87

88 Mining Most Significant Graph with Leap Search [Yan et al., SIGMOD’08] Objective functions 88

89 Upper-Bound 89

90 Upper-Bound: Anti-Monotonic Rule of Thumb : If the frequency difference of a graph pattern in the positive dataset and the negative dataset increases, the pattern becomes more interesting We can recycle the existing graph mining algorithms to accommodate non-monotonic functions. 90

91 Structural Similarity Sibling Structural similarity  Significance similarity Size-4 graph Size-5 graph Size-6 graph 91

92 Leap on g’ subtree if : leap length, tolerance of structure/frequency dissimilarity Structural Leap Search Mining PartLeap Part g : a discovered graph g’: a sibling of g 92

93 Frequency Association Association between pattern’s frequency and objective scores Start with a high frequency threshold, gradually decrease it 93

94 LEAP Algorithm 1. Structural Leap Search with Frequency Threshold 3. Branch-and-Bound Search with F(g*) 2. Support Descending Mining F(g*) converges 94

95 Branch-and-Bound vs. LEAP Branch-and-BoundLEAP Pruning base Parent-child bound (“vertical”) strict pruning Sibling similarity (“horizontal”) approximate pruning Feature Optimality GuaranteedNear optimal EfficiencyGoodBetter 95

96 NCI Anti-Cancer Screen Datasets NameAssay IDSizeTumor Description MCF-78327,770Breast MOLT-412339,765Leukemia NCI-H23140,353Non-Small Cell Lung OVCAR-810940,516Ovarian P38833041,472Leukemia PC-34127,509Prostate SF-2954740,271Central Nerve System SN12C14540,004Renal SW-6208140,532Colon UACC2573339,988Melanoma YEAST16779,601Yeast anti-cancer Data Description 96

97 Efficiency Tests Search EfficiencySearch Quality: G-test 97

98 OA Kernel scalability problem! Mining Quality: Graph Classification NameOA Kernel* LEAPOA Kernel (6x) LEAP (6x) MCF-70.680.67 0.750.76 MOLT-40.650.660.690.72 NCI-H230.790.760.770.79 OVCAR- 8 0.670.720.790.78 P3880.790.820.81 PC-30.660.690.790.76 Average0.700.720.750.77 AUC Runtime * OA Kernel: Optimal Assignment Kernel LEAP: LEAP search [Frohlich et al., ICML’05] 98

99

100 100 Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering Summary 100

101 Graph Compression Extract common subgraphs and simplify graphs by condensing these subgraphs into nodes 101

102 Graph/Network Clustering Problem X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, Proc. 2007 ACM SIGKDD Int. Conf. Knowledge Discovery in Databases (KDD'07), San Jose, CA, Aug. 2007 Networks made up of the mutual relationships of data elements usually have an underlying structure Because relationships are complex, it is difficult to discover these structures. How can the structure be made clear? Given simply information of who associates with whom, could one identify clusters of individuals with common interests or special relationships (families, cliques, terrorist cells)? 102

103 Graph Clustering: Sparsest Cut G = (V,E). The cut set of a cut is the set of edges {(u, v) ∈ E | u ∈ S, v ∈ T } and S and T are in two partitions Size of the cut: # of edges in the cut set Min-cut (e.g., C 1 ) is not a good partition A better measure: Sparsity: A cut is sparsest if its sparsity is not greater than that of any other cut Ex. Cut C2 = ({a, b, c, d, e, f, l}, {g, h, i, j, k}) is the sparsest cut For k clusters, the modularity of a clustering assesses the quality of the clustering: The modularity of a clustering of a graph is the difference between the fraction of all edges that fall into individual clusters and the fraction that would do so if the graph vertices were randomly connected The optimal clustering of graphs maximizes the modularity l i : # edges between vertices in the i-th cluster d i : the sum of the degrees of the vertices in the i-th cluster 103

104 Graph Clustering: Challenges of Finding Good Cuts High computational cost Many graph cut problems are computationally expensive The sparsest cut problem is NP-hard Need to tradeoff between efficiency/scalability and quality Sophisticated graphs May involve weights and/or cycles. High dimensionality A graph can have many vertices. In a similarity matrix, a vertex is represented as a vector (a row in the matrix) whose dimensionality is the number of vertices in the graph Sparsity A large graph is often sparse, meaning each vertex on average connects to only a small number of other vertices A similarity matrix from a large sparse graph can also be sparse 104

105 Two Approaches for Graph Clustering Two approaches for clustering graph data Use generic clustering methods for high-dimensional data Designed specifically for clustering graphs Using clustering methods for high-dimensional data Extract a similarity matrix from a graph using a similarity measure A generic clustering method can then be applied on the similarity matrix to discover clusters Ex. Spectral clustering: approximate optimal graph cut solutions Methods specific to graphs Search the graph to find well-connected components as clusters Ex. SCAN (Structural Clustering Algorithm for Networks) X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, “SCAN: A Structural Clustering Algorithm for Networks”, KDD'07 105

106 An Example of Networks How many clusters? What size should they be? What is the best partitioning? Should some points be segregated? 106

107 A Social Network Model Individuals in a tight social group, or clique, know many of the same people, regardless of the size of the group. Individuals who are hubs know many people in different groups but belong to no single group. Politicians, for example bridge multiple groups. Individuals who are outliers reside at the margins of society. Hermits, for example, know few people and belong to no group. 107

108 The Neighborhood of a Vertex Define  ( ) as the immediate neighborhood of a vertex (i.e. the set of people that an individual knows ). 108

109 Structure Similarity The desired features tend to be captured by a measure we call Structural Similarity Structural similarity is large for members of a clique and small for hubs and outliers. 109

110 Structural Connectivity [1]  -Neighborhood: Core: Direct structure reachable: Structure reachable: transitive closure of direct structure reachability Structure connected: [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases 110

111 Structure-Connected Clusters Structure-connected cluster C Connectivity: Maximality: Hubs: Not belong to any cluster Bridge to many clusters Outliers: Not belong to any cluster Connect to less clusters hub outlier 111

112 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 112

113 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.63 113

114 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.75 0.67 0.82 114

115 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 115

116 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.67 116

117 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.73 117

118 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 118

119 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51 119

120 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.68 120

121 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51 121

122 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 122

123 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 0.51 0.68 123

124 13 9 10 11 7 8 12 6 4 0 1 5 2 3 Algorithm  = 2  = 0.7 124

125 Running Time Running time = O(|E|) For sparse networks = O(|V|) [2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E 70, 066111 (2004). 125

126 Summary: Mining Homogeneous Networks A Taxonomy of Link Mining Tasks Graph Pattern Mining Graph Classification Graph Clustering 126

127

128 References: Graph Pattern Mining (1) T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02 C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003 M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02 L. Dehaspe, H. Toivonen, and R. King. “Finding frequent substructures in chemical compounds”, KDD'98 C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04 L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94 J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04 J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03 H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05 A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00 C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., 2003. G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04 M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004. 128

129 M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04 B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45--87, 1981. S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04 D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and graph searching”, PODS'02 J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976. N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from semistructured data”, ICDM'02 C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base graph databases”, KDD'04 T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD Explorations, 5:59-68, 2003 X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02 X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”, SIGMOD'04 X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity Constraints”, KDD'05 X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, SIGMOD'05 X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed Distance”, ICDE'06 M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02 P. Zhao and J. Han, “On Graph Query Optimization in Large Networks", VLDB'10 References: Graph Pattern Mining (2) 129

130 References: Graph Classification (1)  G. Cong, K. Tan, A. Tung, and X. Xu. Mining Top-k Covering Rule Groups for Gene Expression Data, SIGMOD’05.  M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis. Frequent Substructure-based Approaches for Classifying Chemical Compounds, TKDE’05.  G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences, KDD’99.  G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by Aggregating Emerging Patterns, DS’99  R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.), John Wiley & Sons, 2001.  W. Fan, K. Zhang, H. Cheng, J. Gao, X. Yan, J. Han, P. S. Yu, and O. Verscheure. Direct Mining of Discriminative and Essential Graphical and Itemset Features via Model-based Search Tree, KDD’08.  D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05.  H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05  T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives”, COLT/Kernel’03  H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03  T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04  C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05  P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04 130

131  T. Horvath, T. Gartner, and S. Wrobel. Cyclic Pattern Kernels for Predictive Graph Mining, KDD’04.  T. Kudo, E. Maeda, and Y. Matsumoto. An Application of Boosting to Graph Classification, NIPS’04.  W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification based on Multiple Class- association Rules, ICDM’01.  B. Liu, W. Hsu, and Y. Ma. Integrating Classification and Association Rule Mining, KDD’98.  H. Liu, J. Han, D. Xin, and Z. Shao. Mining Frequent Patterns on Very High Dimensional Data: A Topdown Row Enumeration Approach, SDM’06.  S. Nijssen, and J. Kok. A Quickstart in Frequent Structure Mining Can Make a Difference, KDD’04.  F. Pan, G. Cong, A. Tung, J. Yang, and M. Zaki. CARPENTER: Finding Closed Patterns in Long Biological Datasets, KDD’03  F. Pan, A. Tung, G. Cong G, and X. Xu. COBBLER: Combining Column, and Row enumeration for Closed Pattern Discovery, SSDBM’04.  Y. Sun, Y. Wang, and A. K. C. Wong. Boosting an Associative Classifier, TKDE’06.  P-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns, KDD’02.  R. Ting and J. Bailey. Mining Minimal Contrast Subgraph Patterns, SDM’06.  N. Wale and G. Karypis. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification, ICDM’06.  H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by Pattern Similarity in Large Data Sets, SIGMOD’02.  J. Wang and G. Karypis. HARMONY: Efficiently Mining the Best Rules for Classification, SDM’05.  X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger, SCAN: A Structural Clustering Algorithm for Networks, KDD'07  X. Yan, H. Cheng, J. Han, and P. S. Yu, Mining Significant Graph Patterns by Scalable Leap Search, SIGMOD’08. References: Graph Classification (2) 131


Download ppt "1 Data Mining: Principles and Algorithms Mining Homogeneous Networks Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign."

Similar presentations


Ads by Google