Presentation is loading. Please wait.

Presentation is loading. Please wait.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations


Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

1 The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

2 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 9/18/2006 Graph Data Analysis Graph Data Analysis Overview Methods for Mining Frequent Subgraphs Mining Variant and Constrained Substructure Patterns Graph Classification Graph Clustering Summary

3 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 9/18/2006 Graph Data Analysis Why Graph Mining? Graphs are ubiquitous Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) Complexity of algorithms: many problems are of high complexity

4 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 9/18/2006 Graph Data Analysis Graph Everywhere Aspirin Yeast protein interaction network from H. Jeong et al Nature 411, 41 (2001) Internet Co-author network

5 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 9/18/2006 Graph Data Analysis Labeled Graphs A labeled graph is a graph where each node and each edge has a label. G1G1 p2p2 p5p5 a b b d y x y y y p1p1 p3p3 p4p4 c a b b y x y G2G2 q1q1 q3q3 q2q2 a b b y y G3G3 s1s1 s3s3 s2s2 c s4s4 y

6 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 9/18/2006 Graph Data Analysis Pattern Matching A graph G is subgraph isomorphic to a graph G’, denoted by G  G’, if there exists a 1-1 mapping from nodes in G to G’ such that node labels, edges, and edge labels are preserved with the mapping. A pattern is a graph. Pattern G matches G’ if G  G’ G occurs in G’ if G  G’. With a label set, a graph space is a collection of graphs whose labels are from the set. G1G1 p2p2 p5p5 a b b d y x y y y p1p1 p3p3 p4p4 c a b b y x y G2G2 q1q1 q3q3 q2q2 a b b y y G3G3 s1s1 s3s3 s2s2 c s4s4 y g1g1 a b y g2g2 c g3g3 y G

7 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 9/18/2006 Graph Data Analysis Graph Pattern Mining Frequent subgraphs A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold

8 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 9/18/2006 Graph Data Analysis Examples G1G1 p2p2 p5p5 a b b d y x y y y p1p1 p3p3 p4p4 c a b b y x y G2G2 q1q1 q3q3 q2q2 a b b y y G3G3 s1s1 s3s3 s2s2 c s4s4 y f = 3/3 +: induced frequent subgraphs P1P1 a b y y f=2/3 a b b y x y P3P3 a b b y x P2P2 P4P4 b c y b y f=3/3 f=2/3 P5P5 b b x P6P6 a  = 2/3 The induced subgraph isomorphism penalizes any unmatched edges b + + + + f = 1/3 f=0/3

9 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 9/18/2006 Graph Data Analysis Examples G1G1 p2p2 p5p5 a b b d y x y y y p1p1 p3p3 p4p4 c a b b y x y G2G2 q1q1 q3q3 q2q2 a b b y y G3G3 s1s1 s3s3 s2s2 c s4s4 y ! !: Maximal frequent subgraphs a b b y x y f=2/3 P3P3  = 2/3 Maximal frequent subgraph are ones that none of their supergraphs are frequent Other criteria for selecting subgraphs may be incorporated f=2/3 P4P4 b c y b y f=3/3 f=2/3 P5P5 b b x P6P6 a a b b y x P2P2 P1P1 a b y y b

10 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 9/18/2006 Graph Data Analysis Example: Frequent Subgraphs GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2) (A)(B)(C) (1)(2)

11 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 9/18/2006 Graph Data Analysis Applications Mining biomolecular structures Program control flow analysis Mining XML structures or Web communities Building blocks for graph classification, clustering, compression, comparison, and correlation analysis

12 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 9/18/2006 Graph Data Analysis Graph Mining Algorithms Incomplete beam search – Greedy (Subdue) Inductive logic programming (WARMR) Graph theory-based approaches Edge based Path based Tree based

13 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 9/18/2006 Graph Data Analysis SUBDUE (Holder et al. KDD’94) Start with single vertices Expand best substructures with a new edge Limit the number of best substructures Substructures are evaluated based on their ability to compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) + DL(G\S) Terminate until no new substructure is discovered

14 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 9/18/2006 Graph Data Analysis WARMR (Dehaspe et al. KDD’98) Graphs are represented by Datalog facts atomel(C, A1, c), bond (C, A1, A2, BT), atomel(C, A2, c) : a carbon atom bound to a carbon atom with bond type BT WARMR: the first general purpose ILP system Level-wise search Simulate Apriori for frequent pattern discovery

15 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 9/18/2006 Graph Data Analysis Frequent Subgraph Mining Approaches Edge-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) FFSM: Huan, et al. (ICDM’03) Path-based approach PATH # : Vanetik and Gudes (ICDM’02, ICDM’04) Tree-based approach Gaston: Nijssen and Kok (KDD’04) SPIN: Huan, et al. (KDD’04)

16 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 9/18/2006 Graph Data Analysis Properties of Graph Mining Algorithms Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path  tree  graph

17 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 9/18/2006 Graph Data Analysis Search DAG Task: identify all frequently occurring subgraphs from a group of graphs, or a graph database Support anti-monotonicity Any supergraph of an infrequent subgraph is infrequent Known as the Apriori property Level-wise search Keep all patterns with the same size in memory (poor memory utilization) Depth-first search Better memory utilization May repeatedly search patterns in the DAG (redundant candidates)

18 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 9/18/2006 Graph Data Analysis Apriori-Based Approach … G G1G1 G2G2 GnGn k-edge (k+1)-edge G’ G’’ JOIN

19 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 9/18/2006 Graph Data Analysis Apriori-Based, Breadth-First Search AGM (Inokuchi, et al. PKDD’00) generates new graphs with one more node Methodology: breadth-search, joining two graphs FSG (Kuramochi and Karypis ICDM’01) generates new graphs with one more edge

20 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 9/18/2006 Graph Data Analysis FSG Algorithm K = 1 F 1 = all frequent edges Repeat K = K + 1; C K = join(F K-1 ) F K = frequent patterns in C K Until F K is empty

21 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 9/18/2006 Graph Data Analysis Join: Key Operation Join(L) =  join(P, Q) for all P, Q  L Join(P, Q) = {G | P, Q,  G, |G| = |P| + 1, |P| = |Q|} Two graphs P and Q are joinable if the join of the two graphs produces an non-empty set Theorem: two graphs P and Q are joinable if P ∩ Q is a graph with size |P| -1 or share a common “core” with size P-1

22 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 9/18/2006 Graph Data Analysis Multiplicity of Candidates Case 1: identical vertex labels a b e a a b e a + a b e a e a b e a e a b e a

23 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 9/18/2006 Graph Data Analysis Multiplicity of Candidates Case 2: Core contains identical labels Core: The (k-1) subgraph that is common between the joint graphs a a a a c b a a a a c b a a a a c b + a a a a b a a a a c

24 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 9/18/2006 Graph Data Analysis Multiplicity of Candidates Case 3: Core multiplicity a ab + a a aab aab a a ab aa ab a b aab aa

25 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 9/18/2006 Graph Data Analysis PATH (Vanetik and Gudes ICDM’02, ’04) Apriori-based approach Building blocks: edge-disjoint path Identify all frequent paths Construct frequent graphs with 2 edge-disjoint paths Construct graphs with k+1 edge-disjoint paths from graphs with k edge-disjoint paths Repeat A graph with 3 edge-disjoint paths

26 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 9/18/2006 Graph Data Analysis PATH Algorithm K = 1 F 1 = all frequent paths Repeat K = K + 1; C K = join(F K-1 ) F K = frequent patterns in C K Until F K is empty

27 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 9/18/2006 Graph Data Analysis Challenges Graph isomorphism Two graphs may have the same topology though their layouts are different Subgraph isomorphism How to compute the support value of a pattern

28 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 9/18/2006 Graph Data Analysis Graph Isomorphism A graph is isomorphic if it is topologically equivalent to another graph

29 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 9/18/2006 Graph Data Analysis Why Redundant Candidates? All the algorithms may propose the same candidate several times. We need to keep track of the identical candidates to Avoid redundancy in results Avoid redundant search

30 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 9/18/2006 Graph Data Analysis Intuitions for Graph Normalization A Graph Space An arbitrary set  A 1-1 mapping A partial order defined on 

31 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 9/18/2006 Graph Data Analysis GSPAN A graph normalization is a 1-1 mapping of a graph space to an arbitrary space (usually a string space) Deal with graph isomorphism using DFS code Start with a singe edge Depth first enumeration of a pattern space Add one edge a time Yan & Han, ICDM’02

32 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 9/18/2006 Graph Data Analysis DFS Code Flatten a graph into a sequence using depth first search 0 1 2 3 4 e0: (0,1) e1: (1,2) e2: (2,0) e3: (2,3) e4: (3,1) e5: (1,4) DFS code: (0, 1, x, a, y), (1, 2, y, b, x), (2, 0, x, a, x), (2, 3, x, c, z), (3, 1, z, b, y), (1, 4, y, d, z)

33 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 9/18/2006 Graph Data Analysis DFS Code

34 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 9/18/2006 Graph Data Analysis DFS Lexicographic Order Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a  b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Let a = (x 0, x 1, …, x m ) and b = (y 0, y 1, …, y n ), (i) if there exists t, 0  t  min(m,n), x k =y k for all k, s.t. k<t, and x t < y t (ii) x k =y k for all k, s.t. 0  k  m and m  n.

35 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 9/18/2006 Graph Data Analysis DFS Code Example We have γ < β < α

36 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 9/18/2006 Graph Data Analysis DFS Code Extension Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any graph G’  G, we have minDFS(G’) < b a is a prefix of minDFS(G’) or minDFS(G’) < a There is a 1-1 mapping from a graph to its minimum DFS code. For every graph G’, there exists a G such that G’  G and minDFS(G) is a prefix of minDFS(G’)

37 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 9/18/2006 Graph Data Analysis gSpan Code Input: A graph database and a support threshold t Output: All frequent patterns F gSpan: F 1 = { frequent node labels}, K=1, gSpan_enumeration (F 1, K, F) gSpan_enumeration (F K, K, F) K = K + 1; For each pattern P in F k C = Candidates(P, K); F = F  C; gSpan_enumeration (C, K, F)

38 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 9/18/2006 Graph Data Analysis How to Propose Candidates Generate all supergraphs: Candidate = {G | P  G, sup(G) >= t, |G| = k} gSpan method: Candidate = {G | P  G, sup(G) >= t, |G| = k, minDSP(P) is a prefix of minDSP(G) } Right-most expansion

39 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 9/18/2006 Graph Data Analysis Graph Canonical Code in FFSM The Canonical Code (θ) maps a graph G to a string. Code(M 1 ): (1, 1, a) (2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) < Code(M 2 ): (1, 1, a) (2, 1, x) (2, 2, b) (3, 2, x) (3, 3, c) (4, 1, x) (4, 2, y) (4, 4, b) < Code(M 3 ): (1, 1, a) (2, 2, c) (3, 1, x) (3, 2, x) (3, 3, b) (4, 1, x) (4, 3, y) (4, 4, b) θ(P) = (1, 1, a) (2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c) (i, j, M i,j )  (k, l, M k,l ) if i < k, or i = k, j < l, or i =k, j = l, M i,j  M k,l p2p2 p4p4 a b b x y x x (P) p1p1 p3p3 c p’ 2 a b b c x y x x (P’) p’ 1 p’ 3 p’ 4 M1 M1 b bx cx0 yx a 0 P 1 P 2 P 3 P 4 M2 M2 c bx byx x0 a 0 P 1 P 2 P 4 P 3 M3 M3 b c0 b0x xx a y P 1 P 4 P 2 P 3 Code(M 1 ): (1, 1, a) (2, 1, x) (2, 2, b) (3, 1, x) (3, 2, y) (3, 3, b) (4, 2, x) (4, 4, c)

40 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 9/18/2006 Graph Data Analysis The Power of Graph Normalization A Graph Space An arbitrary set  A 1-1 mapping A partial order defined on  A partial order defined on the graph space

41 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 9/18/2006 Graph Data Analysis MoFa (Borgelt and Berthold ICDM’02) Extend graphs by adding a new edge Store embeddings of discovered frequent graphs Fast support calculation Also used in other later developed algorithms such as FFSM and GASTON Local structural pruning

42 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 9/18/2006 Graph Data Analysis GASTON (Nijssen and Kok KDD’04) Extend graphs directly Store embeddings Separate the discovery of different types of graphs path  tree  graph Simple structures are easier to mine and duplication detection is much simpler

43 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 9/18/2006 Graph Data Analysis Graph Pattern Explosion Problem If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property An n-edge frequent graph may have 2 n subgraphs Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%

44 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 9/18/2006 Graph Data Analysis Closed Frequent Graphs Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete

45 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 9/18/2006 Graph Data Analysis CLOSEGRAPH (Yan & Han, KDD’03) … A Pattern-Growth Approach G G1G1 G2G2 GnGn k-edge (k+1)-edge At what condition, can we stop searching their children i.e., early termination? If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’.

46 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 9/18/2006 Graph Data Analysis Handling Tricky Exception Cases (graph 1) a c b d (pattern 2) (pattern 1) (graph 2) a c b d ab a c d

47 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 9/18/2006 Graph Data Analysis Experimental Result The AIDS antiviral screen compound dataset from NCI/NIH The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI

48 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 9/18/2006 Graph Data Analysis Discovered Patterns 20% 10% 5%

49 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 9/18/2006 Graph Data Analysis Do the Odds Beat the Curse of Complexity? Potentially exponential number of frequent patterns The worst case complexty vs. the expected probability Ex.: Suppose Walmart has 10 4 kinds of products The chance to pick up one product 10 -4 The chance to pick up a particular set of 10 products: 10 -40 What is the chance this particular set of 10 products to be frequent 10 3 times in 10 9 transactions? Have we solved the NP-hard problem of subgraph isomorphism testing? No. But the real graphs in bio/chemistry is not so bad A carbon has only 4 bounds and most proteins in a network have distinct labels

50 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 9/18/2006 Graph Data Analysis Constrained Patterns Density Diameter Connectivity Degree Min, Max, Avg

51 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 9/18/2006 Graph Data Analysis Constraint-Based Graph Pattern Mining Highly connected subgraphs in a large graph usually are not artifacts (group, functionality) Recurrent patterns discovered in multiple graphs are more robust than the patterns mined from a single graph

52 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 9/18/2006 Graph Data Analysis No Downward Closure Property A minimal cut of a graph G is the (minimal) set of edges, once removed from G, G becomes an unconnected graphs. The connectivity of G is defined as the size of the minimal cut of the graph. Given two graphs G and G’, if G is a subgraph of G’, it does not imply that the connectivity of G is less than that of G’, and vice versa. G G’

53 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 9/18/2006 Graph Data Analysis Pattern-Growth Approach Find a small frequent candidate graph Remove vertices (shadow graph) whose degree is less than the connectivity Decompose it to extract the subgraphs satisfying the connectivity constraint Stop decomposing when the subgraph has been checked before Extend this candidate graph by adding new vertices and edges Repeat

54 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide54 9/18/2006 Graph Data Analysis Subgraph Patterns from Data Mining Sequence patterns (De Raedt and Kramer IJCAI ’ 01) Frequent subgraphs (Deshpande et al, ICDM’03) Coherent frequent subgraphs (Huan et al. RECOMB’04) A graph G is coherent if the mutual information between G and each of its own subgraphs is above some threshold Closed frequent subgraphs (Liu et al. SDM ’ 05)

55 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide55 9/18/2006 Graph Data Analysis Graph Clustering Graph similarity measure Feature-based similarity measure Each graph is represented as a feature vector The similarity is defined by the distance of their corresponding vectors Frequent subgraphs can be used as features Structure-based similarity measure Maximal common subgraph Graph edit distance: insertion, deletion, and relabel Graph alignment distance

56 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide56 9/18/2006 Graph Data Analysis Graph Classification Local structure based approach Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length Graph pattern-based approach Subgraph patterns from domain knowledge Subgraph patterns from data mining Kernel-based approach Random walk (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) Optimal local assignment (Fröhlich et al. ICML’05) Boosting (Kudo et al. NIPS’04)

57 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide57 9/18/2006 Graph Data Analysis Graph Pattern-Based Classification Subgraph patterns from domain knowledge Molecular descriptors Subgraph patterns from data mining General idea Each graph is represented as a feature vector x = {x 1, x 2, …, x n }, where x i is the frequency of the i-th pattern in that graph Each vector is associated with a class label Classify these vectors in a vector space

58 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide58 9/18/2006 Graph Data Analysis Kernel Based Methods Graphs kernels: how to measure the similarity between two graphs?

59 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide59 9/18/2006 Graph Data Analysis Kernel-based Classification Random walk Marginalized Kernels (Gärtner ’02, Kashima et al. ’02, ICML’03, Mahé et al. ICML’04) and are paths in graphs and and are probability distributions on paths is a kernel between paths, e.g.,

60 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide60 9/18/2006 Graph Data Analysis Kernel-based Classification Optimal local assignment (Fröhlich et al. ICML’05) Can be extended to include neighborhood information e.g., where could be an RBF-kernel to measure the similarity of neighborhoods of vertices and, is a damping parameter

61 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide61 9/18/2006 Graph Data Analysis Boosting in Graph Classification Decision stumps Simple classifiers in which the final decision is made by single features. A rule is a tuple. If a molecule contains substructure, it is classified as. Gain Applying boosting

62 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide62 9/18/2006 Graph Data Analysis Graph Search Querying graph databases: Given a graph database and a query graph, find all the graphs containing this query graph query graph graph database

63 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide63 9/18/2006 Graph Data Analysis Scalability Issue Sequential scan Disk I/Os Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03

64 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide64 9/18/2006 Graph Data Analysis Summary: Graph Mining Graph mining has wide applications Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach Similarity search in graph databases Indexing and feature-based matching Further development and application exploration

65 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide65 9/18/2006 Graph Data Analysis References (1) T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02 C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant substructures of molecules”, ICDM'02 D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational Networks”, PKDD'05. M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds”, ICDM 2003 M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying structures”, BIOKDD'02 L. Dehaspe, H. Toivonen, and R. King. “Finding frequent substructures in chemical compounds”, KDD'98 C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”, KDD'04 H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed Molecular Graphs”, ICML’05 T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives”, COLT/Kernel’03

66 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide66 9/18/2006 Graph Data Analysis References (2) L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”, KDD'94 J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining spatial motifs from protein structure graphs”, RECOMB’04 J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence of isomorphism”, ICDM'03 J. Huan, J. Prins, J. Yang, and W. Wang. “Spin: Mining maximal frequent subgraphs from graph databases”. KDD'04 H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, ISMB'05 A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent substructures from graph data”, PKDD'00 C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”. Daylight Chemical Information Systems, Inc., 2003. G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04 H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”, ICML’03

67 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide67 9/18/2006 Graph Data Analysis References (3) M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004. T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph Classification”, NIPS’04 M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01 M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery Algorithm”, ICDM’04 C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of Noncrashing Bugs’'', SDM'05 P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph Kernels”, ICML’04 B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45--87, 1981. S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04

68 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide68 9/18/2006 Graph Data Analysis References (4) D. Shasha, J. T.-L. Wang, and R. Giugno. “ Algorithmics and applications of tree and graph searching ”, PODS'02 J. R. Ullmann. “ An algorithm for subgraph isomorphism ”, J. ACM, 23:31--42, 1976. N. Vanetik, E. Gudes, and S. E. Shimony. “ Computing frequent graph patterns from semistructured data ”, ICDM'02 C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “ Scalable mining of large disk-base graph databases ”, KDD'04 T. Washio and H. Motoda, “ State of the art of graph-based data mining ”, SIGKDD Explorations, 5:59- 68, 2003 X. Yan and J. Han, “ gSpan: Graph-Based Substructure Pattern Mining ”, ICDM'02 X. Yan and J. Han, “ CloseGraph: Mining Closed Frequent Graph Patterns ”, KDD'03 X. Yan, P. S. Yu, and J. Han, “ Graph Indexing: A Frequent Structure-based Approach ”, SIGMOD'04 X. Yan, X. J. Zhou, and J. Han, “ Mining Closed Relational Graphs with Connectivity Constraints ”, KDD'05 X. Yan, P. S. Yu, and J. Han, “ Substructure Similarity Search in Graph Databases ”, SIGMOD'05 X. Yan, F. Zhu, J. Han, and P. S. Yu, “ Searching Substructures with Superimposed Distance ”, ICDE'06 M. J. Zaki. “ Efficiently mining frequent trees in a forest ”, KDD'02


Download ppt "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."

Similar presentations


Ads by Google