Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining and Searching Graphs in Biological Databases

Similar presentations


Presentation on theme: "Mining and Searching Graphs in Biological Databases"— Presentation transcript:

1 Mining and Searching Graphs in Biological Databases
Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign In collaboration with Xifeng Yan (UIUC), Philip S. Yu (IBM), and others December 8, 2018 Mining and Searching Graphs in Biological Databases

2 Mining and Searching Graphs in Biological Databases
References of the Talk X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining, Proc Int. Conf. on Data Mining (ICDM'02) X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns, Proc ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD'03) X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based Approach, Proc ACM-SIGMOD Int. Conf. Management of Data (SIGMOD'04) X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”, Proc ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'05) H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery”, Proc Int. Conf. on Intelligent Systems for Molecular Biology (ISMB'05) December 8, 2018 Mining and Searching Graphs in Biological Databases

3 Mining and Searching Graphs in Biological Databases
Outline Mining frequent and closed subgraphs Graph indexing methods Similairty search in graph databases Exploration with graph mining and search Biological network analysis December 8, 2018 Mining and Searching Graphs in Biological Databases

4 Why Graph Mining and Searching?
Graphs are ubiquitous Web databases, XML databases Cheminformatics (chemical compound) Bioinformactics (protein structure, pathway) Workflow analysis Social network analysis Graph is a general model Trees, lattices, sequences, and items are degenerated graphs Graph algorithms: many have high complexity Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges and vertices), angles/geometry, weighted edges December 8, 2018 Mining and Searching Graphs in Biological Databases

5 Frequent Graph Pattern Mining
Frequent subgraphs A (sub)graph is frequent if its support in a given dataset is no less than a minimum support threshold Applications Mining biochemical structures Mining XML structures or Web communities Trees, lattices, sequences, and items are degenerated graphs Based for graph classification, clustering, comparison, and correlation analysis December 8, 2018 Mining and Searching Graphs in Biological Databases

6 Mining and Searching Graphs in Biological Databases
Example Graph Dataset (A) (B) (C) Frequent Patterns (min support is 2) (1) (2) December 8, 2018 Mining and Searching Graphs in Biological Databases

7 Subgraph Explosion Problem
An n-edge frequent graph may have 2n subgraphs. All of them are frequent too. Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5% December 8, 2018 Mining and Searching Graphs in Biological Databases

8 Mining and Searching Graphs in Biological Databases
Observation If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property If some of its subgraphs have the same support, it is unnecessary to output these subgraphs (closed graphs): This still ensures the mining result is complete—“lossless compression” December 8, 2018 Mining and Searching Graphs in Biological Databases

9 Mining and Searching Graphs in Biological Databases
gSpan (Yan & Han, ICDM’02) K edges Enumerate G0 in the graph dataset; At the same time, Count the frequency of G1, G2, … Gn. (K-1) edges G1 G0 G2 Gn Apriori: Step1: Join two k-1 edge graphs (these two graphs share a same k-2 edge subgraph) to generate a k-edge graph Step2: Join the tid-list of these two k-1 edge graphs, then see whether its count is larger than the minimum support Step3: Check all k-1 subgraph of this k-edge graph to see whether all of them are frequent Step4: After G successfully pass Step1-3, do support computation of G in the graph dataset, See whether it is really frequent. gSpan: Step1: Right-most extend a k-1 edge graph to several k edge graphs. Step2: Enumerate the occurrence of this k-1 edge graph in the graph dataset, meanwhile, counting these k edge graphs. Step3: Output those k edge graphs whose support is larger than the minimum support. Pros: 1: gSpan avoid the costly candidate generation and testing some infrequent subgraphs. 2: No complicated graph operations, like joining two graphs and calculating its k-1 subgraphs. 3. gSpan is very simple The key is how to do right most extension efficiently in graph. We invented DFS code for graph. Output Frequent Ones in G1, G2, …, Gn. December 8, 2018 Mining and Searching Graphs in Biological Databases

10 Mining and Searching Graphs in Biological Databases
gSpan (cont.) gSpan extends a k-edge graph to a (k+1)-edge graph by adding one edge gSpan chooses only one depth-first search tree in a graph which produces the minimum DFS code, the canonical label of the graph gSpan only adds new edges on the vertices along the right-most path of the depth-first search tree December 8, 2018 Mining and Searching Graphs in Biological Databases

11 Mining and Searching Graphs in Biological Databases
Right-Most Extension 1 1 2 2 3 3 5 5 4 4 1: Given a graph in (a), suppose, the DFS traverse the graph in the order 0, 1, 2, 3, 4 2: The vertices 0, 1, 2 and 4 constitute the right-most path, and vertex 4 is called the right-most vertex (the last discovered vertex) 3: Any extension must take place on the right-most path: first, the backward edge can only extend from the right most vertex; Second, the forward edge can extend from any vertices on the right-most path. 4: The extension of a graph is equal to the extension of its DFS code by appending a new edge in the end. (a) (b) Backward-edge extension December 8, 2018 Mining and Searching Graphs in Biological Databases

12 Right-Most Extension (cont.)
1 1 2 2 3 3 5 5 4 4 1: Given a graph in (a), suppose, the DFS traverse the graph in the order 0, 1, 2, 3, 4 2: The vertices 0, 1, 2 and 4 constitute the right-most path, and vertex 4 is called the right-most vertex (the last discovered vertex) 3: Any extension must take place on the right-most path: first, the backward edge can only extend from the right most vertex; Second, the forward edge can extend from any vertices on the right-most path. 4: The extension of a graph is equal to the extension of its DFS code by appending a new edge in the end. (a) (b) Forward-edge extension December 8, 2018 Mining and Searching Graphs in Biological Databases

13 Avoid Duplicate Extension
(1) (2) (2) (1) They are the same graph, but they grow in different ways. The cost to detect them is huge if there are lots of such duplicate growths. December 8, 2018 Mining and Searching Graphs in Biological Databases

14 Mining and Searching Graphs in Biological Databases
Right-Most Extension (1) (2) (2) (1) (1) Theorem: Any graph can be grown by right-most extension We will never generate this December 8, 2018 Mining and Searching Graphs in Biological Databases

15 Mining and Searching Graphs in Biological Databases
Search Space in gSpan 1-edge ... 2-edge ... G ... ... ... G1 Gn 3-edge ... ... ... December 8, 2018 Mining and Searching Graphs in Biological Databases

16 Mining and Searching Graphs in Biological Databases
DFS Code Flatten a graph into a sequence such that any right-most extension takes place in the end of the sequence. X a e0: (0,1,X,a,X) X a b b e2: (2,0,Y,b,X) c e4: (3,1,Z,c,X) Y a e1: (1,2,X,a,Y) X c a d Y Z d e5: (2,4,Y,d,Z) 1: Show how to construct a DFS code, which is related with right most extension 2: One graph has lots of DFS code, among them, there is a minimum DFS code. We invent a DFS lexicographic ordering system. Among this ordering, we take minimum DFS code as canonical label. b Z b e3: (2,3,Y,b,Z) Z Z December 8, 2018 Mining and Searching Graphs in Biological Databases

17 Mining Closed Frequent Graphs
Closed graphs—Only register those subgraphs whose support is different from its supergraphs A concept derived from closed itemset mining (Nasquier et al. ICDT’99) How much can we save? For the same dataset, only 2,000 graphs among 1,000,000 frequent graphs are closed frequent graphs. December 8, 2018 Mining and Searching Graphs in Biological Databases

18 A Naïve Algorithm for Mining Closed Graphs
Run frequent graph mining algorithm first Prune non-closed graphs then Question: Can we prune the search space in the Step 1 directly? That is, can we generate a partial set of frequent graphs and then prune non-closed graphs in it? December 8, 2018 Mining and Searching Graphs in Biological Databases

19 CloseGraph (Yan & Han, KDD’03)
K edges Enumerate G0 in the graph dataset; At the same time, Count the frequency of G1, G2, … Gn. (K-1) edges G1 G0 G2 Gi If it is frequent, in which condition, can we stop searching its children and all its descendants ? Output frequent ones in G1, G2, …, Gn. December 8, 2018 Mining and Searching Graphs in Biological Databases

20 Mining and Searching Graphs in Biological Databases
Search Space in gSpan 1-edge ... 2-edge ... G ... ... ... G1 Gn 3-edge ... ... ... December 8, 2018 Mining and Searching Graphs in Biological Databases

21 Search Space and Pruning
1-edge ... 2-edge ... ... If it is not closed, can we do it? ... 3-edge G1 1: Show that G1=G2  Minimum DFS code of G1 should be equal to Minimum DFS code of G2, thus by calculating the minimum DFS code, we can tell whether two graphs are the same. 2: MOST IMPORTANT RESULT !!!: if two graphs G1 and G2 are the same, assume G1 is discovered first, then any graph which is right most extension of graph G2 must be discovered before. Therefore, we need not check the whole branch under G2 in the search tree. (This is the efficiency comes from) 3: ANOTHER MOST IMPORTANT RESULT: G2 is presented as a code C, thus we need not even calculate Min(G2), we need only to know whether C is equal to Min(G2). If not, we know G2 must be discovered before. Very critical for the efficiency. ... ... PRUNED ... December 8, 2018 Mining and Searching Graphs in Biological Databases

22 Mining and Searching Graphs in Biological Databases
Experimental Result The AIDS antiviral screen compound dataset from NCI/NIH We select the up-to-date release, March 2002 Release The dataset contains 43,905 chemical compounds Among these 43,905 compounds, 422 of them belongs to CA, 1081 are of CM, and the remaining are in class CI December 8, 2018 Mining and Searching Graphs in Biological Databases

23 Experimental Result (cont.)
20% 10% 5% December 8, 2018 Mining and Searching Graphs in Biological Databases

24 Experimental Result (cont.)
December 8, 2018 Mining and Searching Graphs in Biological Databases

25 Experimental Result (cont.)
CA December 8, 2018 Mining and Searching Graphs in Biological Databases

26 Mining and Searching Graphs in Biological Databases
Outline Mining frequent and closed subgraphs Graph indexing methods Similairty search in graph databases Exploration with graph mining and search Biological network analysis December 8, 2018 Mining and Searching Graphs in Biological Databases

27 Querying Graph Database
Graph query problem Given a graph database and a query graph, find all graphs containing this query graph “Graph”: broadly includes any form of data comprising one or more features, such as topological graphs, images, sequences and their combinations query graph graph database December 8, 2018 Mining and Searching Graphs in Biological Databases

28 Mining and Searching Graphs in Biological Databases
Scalability Issue Sequential scan Disk I/O Subgraph isomorphism testing An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03 (a) (b) (c) Query graph Sample database Two issues have to be solved: Disk I/O for false positive and subgraph isomorphism testing for false positive. Therefore, an indexing mechanism is needed. The list just shows a partial list of the previous research. DayLight system is a commercial product targeted for chemical information system. The following is copied from Once a large graph database is formed, how can we do database management, and further data warehouse and data mining. 2. GraphGrep uses a path-based approach to do graph indexing. Its implementation is available publicly. 3. Grace develops a hierarchical vector space method. It tries to extract features from the graphs in a hierarchical way. DayLight System “At Infinity Pharmaceuticals we use the Daylight toolkits for in-house development and have implemented DayCart (as a central component of our chemical registration system. This system has been fully integrated within all aspects of our drug discovery processes and software applications. We have successfully "pushed" the registration process to the chemists; they can quickly and easily register compounds within the context of their electronic notebooks. This has provided a significant enhancement to both the efficiency of data capture and the quality of the data within the chemistry database(s). DayCart integrates well into our overall application and data environment providing a flexible chemical data model which facilitates a data architecture that is tailored to Infinity's processes and scientific approach while at the same time allows speed and scalability. The benefit delivered to Infinity is measured in the strong capability to rapidly and efficiently develop intellectual property and access corporate knowledge effectively. “ December 8, 2018 Mining and Searching Graphs in Biological Databases

29 Mining and Searching Graphs in Biological Databases
Indexing Strategy Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Remarks Index substructures of a query graph to prune graphs that do not contain these substructures Our work, also with all the previous work follows this indexing strategy. December 8, 2018 Mining and Searching Graphs in Biological Databases

30 Mining and Searching Graphs in Biological Databases
Framework Two steps in processing graph queries Step 1. Index Construction Enumerate structures in the graph database, build an inverted index between structures and graphs Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing these structures Prune the false positive answers by performing subgraph isomorphism test December 8, 2018 Mining and Searching Graphs in Biological Databases

31 Mining and Searching Graphs in Biological Databases
Cost Analysis Query Response Time Disk I/O time Isomorphism testing time Graph index access time T_io is the time of fetching a graph from a disk, which may result in one disk block IO. I did not show it in the paper. If the graph dataset can not be held in the main memory, then we need also count the IO time Our goal is to minimize |Cq| Size of candidate answer set Remark: make |Cq| as small as possible December 8, 2018 Mining and Searching Graphs in Biological Databases

32 Mining and Searching Graphs in Biological Databases
Path-Based Approach Sample database (a) (b) (c) Paths 0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C, ... 3-length: ... Built an inverted index between paths and graphs December 8, 2018 Mining and Searching Graphs in Biological Databases

33 Path-Based Approach (cont.)
Query graph 0-length: SC={a, b, c}, SN={a, b, c} 1-length: SC-C={a, b, c}, SC-N={a, b, c} 2-length: SC-N-C = {a, b}, … Intersect these sets, we obtain the candidate answers - graph (a) and graph (b) - which may contain this query graph. December 8, 2018 Mining and Searching Graphs in Biological Databases

34 Problems of Path-Based Approach
Sample database (a) (b) (c) Query graph Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b). Only graph (C) contains the query graph. Unfortunately, if we use path only, we cannot prune graphs (a) and (b) The question is why we use path. The reason is path is very simple, easier to manipulate. December 8, 2018 Mining and Searching Graphs in Biological Databases

35 gIndex: Indexing Graphs by Data Mining
Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database The reason of doing frequent graphs is just because the number of frequent graphs is much less than the number of all subgraphs in the graph database although this number is still too large for indexing. December 8, 2018 Mining and Searching Graphs in Biological Databases

36 Frequent Structures: Threshold Issue
How to set up the minimum support threshold? If it is too low, it may generate too many frequent graphs If it is too high, it may miss important structures Should we enforce a uniform threshold for the different size of structures? Size-increasing support threshold December 8, 2018 Mining and Searching Graphs in Biological Databases

37 Frequent Structures: Threshold Issue
Intuition: large structures with low support will likely be indexed well by their substructures that have the similar support Size-increasing support threshold The support threshold increases when the indexed structures become larger In our current implementation, the curve is determine manually, the experiments show it is not sensitive. We are implementing one which can be determined automatically. The size-increasing support threshold has two effects: Reduce the number of frequent patterns Save space for indexing low-support small structures, which is more important than low-support large structures Size-increasing support threshold is a special case of discriminative structures which will be introduced in the next two slides. The reason of choosing the size-increasing support threshold is that the low-support large structures are not as discriminative as the low-support small structures. December 8, 2018 Mining and Searching Graphs in Biological Databases

38 Frequent Structures: Volume Issue
The number of frequent structures may exceed the number of graphs in the database when the support is low 1,000 graphs may generate 1,000,000 frequent structures It is time and memory expensive to compute and index all frequent structures Even with size-increase support threshold, we still have lots of frequent structures to index. discriminative structures December 8, 2018 Mining and Searching Graphs in Biological Databases

39 Mining and Searching Graphs in Biological Databases
Redundant Structures Sample database (a) (b) (c) All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Remove these redundant structures Only index structures that provide more information than existing structures December 8, 2018 Mining and Searching Graphs in Biological Databases

40 Discriminative Structures
Pinpoint the most useful frequent structures Given a set of sturctures and a new structure , we measure the extra indexing power provided by , When is small enough, is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude Achieve good performance December 8, 2018 Mining and Searching Graphs in Biological Databases

41 Mining and Searching Graphs in Biological Databases
Experimental Setting The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds Query graphs are randomly extracted from the dataset. GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10 The fingerprint set size is set as large as possible until the program takes 1G memory. Here it is 10K. Other parameter setting is provided in the paper December 8, 2018 Mining and Searching Graphs in Biological Databases

42 Experiments: Index Size
Number of features Database Size Remarks: It shows the raw feature set of path-based approach and structure-based approach. Of course, we can compress the paths into an arbitrary size of index with performance loss. In our experiment setting, we do not compress the path-based index in order to illustrate the best performance it can achieve. It is also possible to apply discriminative filtering on paths. We have not done experiments on it. The number of frequent structures or discriminative frequent structures does not change over the different size of databases. December 8, 2018 Mining and Searching Graphs in Biological Databases

43 Experiments: Low Support Queries
Answer Set Size Query Size 1. That is, the number of graphs containing the query graph is less than 50. 2. Query size are about the number of edges in a query. 3. Answer set size is the number of candidate graphs that may contain the query graph. 4. The actual match is the lower bound that all algorithms can achieve. That is, it is the actual number of graphs that really contain the query graph. The performance of queries whose support is less than 50 December 8, 2018 Mining and Searching Graphs in Biological Databases

44 Experiments: High Support Queries
Answer Set Size Query Size The performance of queries whose support is between 50 and 500 December 8, 2018 Mining and Searching Graphs in Biological Databases

45 Experiments: Incremental Maintenance
The fact that the number of frequent structures does not change over the different size of databases inspires us to design an algorithm which construct the index incrementally. 2. The black line shows the performance of indexing from the scratch, for example, from a database with 2,000 graphs, 4,000 graphs, 6,000 graphs, 8,000 graphs and 10,000 graphs. 3. The red line shows the performance of incremental indexing, for example, we first built an index for 2,000 graphs, then we add another 2,000 graphs into the database, update the indexing of the original discriminative frequent graphs, then we add another 2,000, … and so on Frequent structures are stable to database updating. Index can be built based on a small portion of a graph database, but be used for the whole database December 8, 2018 Mining and Searching Graphs in Biological Databases

46 Mining and Searching Graphs in Biological Databases
Outline Mining frequent and closed subgraphs Graph indexing methods Similairty search in graph databases Exploration with graph mining and search Biological network analysis December 8, 2018 Mining and Searching Graphs in Biological Databases

47 Graph Query Processing
Chemical Compounds (a) caffeine (b) diurobromine (c) viagra Query Graph December 8, 2018 Mining and Searching Graphs in Biological Databases

48 Precise vs. Approximate Search in Graphs
Given a graph database and a query graph Q, Find graphs containing Q exactly (Precise Matching, gIndex, SIGMOD’04) Find graphs containing Q approximately (Approximate Matching, Grafil, SIGMOD’05) December 8, 2018 Mining and Searching Graphs in Biological Databases

49 Mining and Searching Graphs in Biological Databases
Method I Compute the similarity between the graphs in the database and the query graph directly (costly) sequential scan subgraph similarity computation December 8, 2018 Mining and Searching Graphs in Biological Databases

50 Mining and Searching Graphs in Biological Databases
Method II Form a set of subgraph queries from the original query graph and use the exact subgraph search (costly) If we allow 3 edges to be missed in a 20-edge query graph, it may generate 1,140 subgraphs. December 8, 2018 Mining and Searching Graphs in Biological Databases

51 Index: Precise vs. Appr. Search
Precise Search Use frequent patterns as indexing features Select features in the database space based on their selectivity Build the index Approximate Search Hard to build indices covering similar subgraphs – explosive number of subgraphs in databases Idea: (1) keep the index structure (2) select features in the query space December 8, 2018 Mining and Searching Graphs in Biological Databases

52 Substructure Similarity Measure
Structure-based similarity measure The largest overlapping part of two graphs Relaxation: the number of edges that can be relabeled or deleted (relaxation of the query graph) G Q December 8, 2018 Mining and Searching Graphs in Biological Databases

53 Mining and Searching Graphs in Biological Databases
Structural Features Graph Database (a) (b) (c) Structural Features (small fragments) atom path bond subgraph December 8, 2018 Mining and Searching Graphs in Biological Databases

54 Substructure Similarity Measure
Feature-based similarity measure Each graph is represented as a feature vector X = {x1, x2, …, xn} The similarity is defined by the distance of their corresponding vectors Easy to index Very fast Rough measure December 8, 2018 Mining and Searching Graphs in Biological Databases

55 Substructure Similarity Search
Structure-based similarity Accurate measure Slow Can we transform structure-based to feature-based? Feature-based similarity Rough measure Fast December 8, 2018 Mining and Searching Graphs in Biological Databases

56 Mining and Searching Graphs in Biological Databases
Intuition Graph (G1) If graph G contains the major part of a query graph Q, G should share a number of common features with Q Query (Q) Graph (G2) Given a relaxation ratio, calculate the maximal number of features that can be missed ! Suppose we can delete any red edge in the query graph Q. Substructure At least one of them should be contained December 8, 2018 Mining and Searching Graphs in Biological Databases

57 Mining and Searching Graphs in Biological Databases
Feature-Graph Matrix An occurrence table between feature and graph G1 G2 G3 G4 G5 f1 1 f2 2 f3 f4 Assume a query graph has 4 features and only 1 feature to miss due to the relaxation threshold December 8, 2018 Mining and Searching Graphs in Biological Databases

58 Query Processing Framework
Three steps in processing approximate graph queries Step 1. Index Construction Select small structures as features in a graph database, and build the feature-graph matrix between the features and the graphs in the database. December 8, 2018 Mining and Searching Graphs in Biological Databases

59 Mining and Searching Graphs in Biological Databases
Framework (cont.) Step 2. Feature Miss Estimation Determine the indexed features belonging to the query graph Calculate the upper bound of the number of features that can be missed for an approximate matching, denoted by J On the query graph, not the graph database December 8, 2018 Mining and Searching Graphs in Biological Databases

60 Mining and Searching Graphs in Biological Databases
Framework (cont.) Step 3. Query Processing Use the feature-graph matrix to calculate the difference in the number of features between graph G and query Q, FG – FQ If FG – FQ > J, discard G. The remaining graphs constitute a candidate answer set December 8, 2018 Mining and Searching Graphs in Biological Databases

61 Mining and Searching Graphs in Biological Databases
Experimental Results Database Chemical compounds of Anti-Aids Drug from NCI/NIH, randomly select 10,000 compounds Query Randomly select 30 graphs with 16 and 20 edges as query graphs Our algorithm, Grafil: Graph Filter Competitive algorithms Edge: use edges only All: use all the features December 8, 2018 Mining and Searching Graphs in Biological Databases

62 Experimental Results (cont.)
When the number of edge relaxations increases, the performance of Edge and Grafil will be similar, as expected. Queries with 16 edges December 8, 2018 Mining and Searching Graphs in Biological Databases

63 Mining and Searching Graphs in Biological Databases
Outline Mining frequent and closed subgraphs Graph indexing methods Similairty search in graph databases Exploration with graph mining and search Biological network analysis December 8, 2018 Mining and Searching Graphs in Biological Databases

64 Mining and Searching Graphs in Biological Databases
Biological Networks Protein-protein interaction network Metabolic network Transcriptional regulatory network Co-expression network Genetic Interaction network December 8, 2018 Mining and Searching Graphs in Biological Databases

65 Data Mining Across Multiple Networks
f f a b c d e f g h i j k j j a a h c c h e e b b k k d d i g i g f f f j j a j a h a h c h c c e e e b b k b k k d d g i d g i g i December 8, 2018 Mining and Searching Graphs in Biological Databases

66 Data Mining Across Multiple Networks
b d e f g h i j k c a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k f f j a j c h a h c e e b k b k d g i d g i December 8, 2018 Mining and Searching Graphs in Biological Databases

67 Mining and Searching Graphs in Biological Databases
Identify frequent co-expression clusters across multiple microarray data sets a b c d e f g h i j k . a b c d e f g h i j k . c1 c2… cm g … .2 g … .4 g … .2 g … .4 g … .1 g … .5 g … .8 g … .3 . December 8, 2018 Mining and Searching Graphs in Biological Databases

68 Mining and Searching Graphs in Biological Databases
Our Solution We develop a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: All edges occur in >= k graphs (frequency) All edges should exhibit correlated occurrences in the given graph set (coherency) The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) m: #edges, n: #nodes December 8, 2018 Mining and Searching Graphs in Biological Databases

69 CODENSE: Mine Coherent Dense Subgraphs
(1) Builds a summary graph by eliminating infrequent edges f a b c d e g h i G3 G2 G6 G5 G4 f a b d e g h i c G1 a b d e g h i c f summary graph Ĝ December 8, 2018 Mining and Searching Graphs in Biological Databases

70 Mining and Searching Graphs in Biological Databases
CODENSE: Mine Coherent Dense Subgraphs (2) Identify dense subgraphs of the summary graph a b d e g h i c f summary graph Ĝ Sub(Ĝ) Step 2 MODES Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse is not true. December 8, 2018 Mining and Searching Graphs in Biological Databases

71 Mining and Searching Graphs in Biological Databases
CODENSE: Mine Coherent Dense Subgraphs (3) Construct the edge occurrence profiles for each dense summary subgraph e g h i c f Sub(Ĝ) Step 3 1 e-f c-i c-h c-f c-e G6 G5 G4 G3 G2 G1 E edge occurrence profiles December 8, 2018 Mining and Searching Graphs in Biological Databases

72 Mining and Searching Graphs in Biological Databases
CODENSE: Mine Coherent Dense Subgraphs (4) builds a second-order graph for each dense summary subgraph 1 e-f c-i c-h c-f c-e G6 G5 G4 G3 G2 G1 E edge occurrence profiles Step 4 e-h f-h e-i e-g g-i h-i second-order graph S g-h f-i December 8, 2018 Mining and Searching Graphs in Biological Databases

73 CODENSE: Mine coherent dense subgraphs
(5) Identify dense subgraphs of the second-order graph c-f c-h c-e e-h e-f f-h c-i e-i e-g g-i h-i second-order graph S g-h f-i Step 4 Sub(S) Observation: If a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd-order graph must be dense. December 8, 2018 Mining and Searching Graphs in Biological Databases

74 CODENSE: Mine coherent dense subgraphs
(6) Identify the coherent dense subgraphs c-f c-h c-e e-h e-f f-h e-i e-g g-i h-i Sub(S) g-h Step 5 c e f h g i Sub(G) December 8, 2018 Mining and Searching Graphs in Biological Databases

75 CODENSE: Mine coherent dense subgraphs
1 e-f c-i c-h c-f c-e G6 G5 G4 G3 G2 G1 E edge occurrence profiles c e f h g i Step 4 Step 5 Sub(G) a b d e-h f-h e-i e-g g-i h-i second-order graph S g-h f-i Step 1 Step 3 summary graph Ĝ Sub(Ĝ) Step 2 Sub(S) Step 6 MODES Add/Cut Restore G and MODES December 8, 2018 Mining and Searching Graphs in Biological Databases

76 Mining and Searching Graphs in Biological Databases
Applying CoDense to 39 yeast microarray data sets f f a j a h j c1 c2… cm g … .2 g … .4 c h c e e b k b k d g i d g i f f c1 c2… cm g … .2 g … .4 a e j a j c h c h e b b k k d g i d g i c1 c2… cm g … .1 g … .5 f f j a j a c h c h e e b b k k d g i d g i f f c1 c2… cm g … .8 g … .3 a j h a j c h c e e b b k k d g i d g i December 8, 2018 Mining and Searching Graphs in Biological Databases

77 Mining and Searching Graphs in Biological Databases
YDR115W MRP49 PHB1 MRPL51 PET100 ATP12 ATP17 MRPL37 MRPL38 ACN9 MRPL32 MRPL39 MRPS18 FMC1 December 8, 2018 Mining and Searching Graphs in Biological Databases

78 Mining and Searching Graphs in Biological Databases
ATP17 MRP49 MRPL51 PHB1 PET100 ATP12 PET100 YDR115W MRPL38 ACN9 MRPL32 MRPL39 MRPS18 FMC1 Yellow: YDR115W, FMC1, ATP12,MRPL37,MRPS18 GO: (protein metabolism; pvalue = ) December 8, 2018 Mining and Searching Graphs in Biological Databases

79 Mining and Searching Graphs in Biological Databases
YDR115W MRP49 PHB1 MRPL51 PET100 ATP12 ATP17 MRPL37 MRPL38 ACN9 MRPL32 MRPL39 MRPS18 FMC1 Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO: (generation of precursor metabolites and energy; pvalue= ) December 8, 2018 Mining and Searching Graphs in Biological Databases

80 Mining and Searching Graphs in Biological Databases
Conclusions Graph mining has wide applications Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach Graph indexing techniques Frequent and discirminative subgraphs are high-quality indexing fatures Similairty search in graph databases Indexing and approximate matching help similar subgraph search Biological network analysis Mining coherent, dense, multiple biological networks December 8, 2018 Mining and Searching Graphs in Biological Databases


Download ppt "Mining and Searching Graphs in Biological Databases"

Similar presentations


Ads by Google