Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski.

Similar presentations


Presentation on theme: "1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski."— Presentation transcript:

1 1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Bioinformatics, Vol. 20, Sup. 1, 2004, pages i200-i207. Presented by: Royi Ronen

2 2 Abstract Motivation –Network interaction data is abundant –Analyzing this data is important –Problems are close to the subgraph isomorphism problem – Hard! Results –An efficient algorithm for detecting frequently occurring patterns in bio-network –The algorithm simplifies the subgraph isomorphism problem to a different, tractable, problem with biological applications –Mining the KEGG database yields positive empiric results

3 3 Outline Introduction Model Approach: Graph Mining –Related Work –Formalism for metabolic pathways –The Algorithm Discussion and Empiric Results Conclusion Future Work

4 4 Introduction Experimental data relating to biological sequences (that are highly available and accessible) play an important role in tasks such as discovering common sequences and motifs Biomolecular interaction data are abstracted as graphs –Example: A hypergraph can represent a metabolic pathway where nodes represent compounds –Can be reduced to a directed graph where nodes are enzymes and edges relate them

5 5 Introduction Key problems in this context: –Aligning multiple graphs –Finding frequently occurring sub-graphs in a collection of graphs A solution can lead to the understanding of –Motifs of cellular interactions –Evolutionary relationships –Differences between networks in different organisms –Patterns of gene regulation

6 6 Introduction In the paper –Finding frequently occurring subgraphs in a collection of graphs, each representing a metabolic pathway –Close to the NP-Hard subgraph isomorphism problem –End of story? No! –The problem can be simplified and made tractable and still capture the biological information –Nodes will be “uniquely labeled”, according to the represented enzyme –Experimental results: discovering “interesting” patterns from KEGG takes seconds

7 7 Outline Introduction☺ Model Approach: Graph Mining –Related Work –Formalism for metabolic pathways –The Algorithm Discussion and Empiric Results Conclusion Future Work

8 8 Metabolic Pathways Oldest kind of biological network Group the reactions that belong to a process Publicly available (e.g., KEGG) Chemical compounds are linked to each other by a product-substrate relationship In a hypergraph –Nodes are compounds –A hyperedge is a reaction (or an enzyme) –Hyperedge direction is important to distinguish between substrates and products a b c

9 9 Metabolic Pathways Simplification: –Regular graph, nodes represent enzymes, an edge connects enzyme a to enzyme b iff a’s product is b’s substrate (more accurately, if such a relation exists) –Edges may be labeled by the compound that relates a to b. –A specific enzyme may appear more than once in the same pathway, but we consider merged nodes at the price of losing temporal information Various problems related to understanding the molecular interaction in the cell can be solved using graph related frameworks, mostly to provide a means to investigate units with well defined functionality Paper focus: Mining pathways for frequent connected subgraphs, which is important because functional modules are expected to repeat among several pathways or organisms (or both) ab com.

10 10 Outline Introduction☺ Model ☺ Approach: Graph Mining –Related Work –Formalism for metabolic pathways –The Algorithm Discussion and Empiric Results Conclusion Future Work

11 11 Related Work Subgraph isomorphism –Unlabeled version. Hardness usually “tackled” by ordering nodes and edges for efficient processing –Labeled Version. Easier, suitable for biological networks Frequent itemset mining –Multiple sets of items (transactions) from domain D are given –Itemset X implies itemset Y with c confidence if c% of sets containing X also contain Y –X→Y has support s if s% of the sets contain X and Y

12 12 Graph Formalism for Metabolic Pathways A Metabolic Pathway is a triplet, P(M,Z,R) –M, a set of metabolites –Z, a set of enzymes –R, a set of reactions, where each reaction r is associated with A set of enzymes Z(r) from Z A set of substrates S(r) from M A set of products T(r) from M metabolite enzyme

13 13 Graph Formalism for Metabolic Pathways A Graph G(V,E) for P(M,Z,R) is defined –For every enzyme z i in Z - a node v i exists –(v i,v j ) in E iff z j consumes the product of z i Example: enzyme metabolite enzyme

14 14 Mining Metabolic Pathways The Problem: Given a collection of n graphs and a support threshold ε, find all maximal connected subgraphs that are contained in at least εn of the graphs The support of a subgraph which appears in n’ graphs is n’/n. A frequent subgraph is maximal if it is not contained by another frequent subgraph

15 15 Subgraph Isomorphism Simplified Nodes are labeled by enzyme identifiers Only edges are needed to define a graph. Their labels conceptually identify the nodes Edges are items, uniquely specified by labels which refer to enzymes The problem can therefore be reduced to mining frequent itemset The graph G 1 here is {ab,ac,de} Connectivity has to be considered

16 16 Subgraph Homeomorphism Simplified A connected edgeset corresponds to a connected subgraph –A unique edge is a set of two node labels –A set of unique edges ES={e 1, e 2 …, e k } is called connected iff every subset ES’ of ES shares at least one node with the remaining edges ES\ES’. Connection to frequent itemset mining –Input Graphs correspond to transactions –Connected edgesets correspond to itemsets –Approach: build frequent sets bottom up (small to large) –Edge addition preserves connectivity

17 17 Subgraph Homeomorphism Simplified Through the search, only connected edgesets are considered –Captures the connected nature of pathways Avoiding redundancy coming from considering the same sets in different order is important.

18 18 The Algorithm

19 19 The Algorithm The procedure is invoked for each frequent edge e i – Mine({}, {e i }, N(e i ), {e 1,e 2,…,e k }) The support is embodied in the “if frequent” statement Example: consider 5 enzymes, a, b, c, d and e, which participate (vacuously or not) in 4 pathways G 1,G 2,G 3,G 4. We mine with support = ¾.

20 20 Example ab, ac and de are the only frequent edges Mine({}, {ab}, N(ab), {ab,ac,bd,de,ce} Mine({}, {ac}, N(ac), {ab,ac,bd,de,ce} Mine({}, {de}, N(de), {ab,ac,bd,de,ce} {ab,ac},{de} are the frequent subgraphs

21 21 Example {ab,ac},{de} are the frequent maximal subgraphs Mining development:

22 22 Polynomial Bound The paper does not prove complexity, but only justifies “efficiency” in an empiric way We show a polynomial bound for time complexity –Determining which are the frequent edges can be done using sorting –Determining the neighbors of an edge is linear (requires one pass) –In every level of the recursion, the algorithm extends a frequent subgraph with a new frequent edge. This is a linear number of procedures –Each such procedure can be done in polynomial time complexity, where n is the number of edges in the input

23 23 Outline Introduction☺ Model: ☺ Approach: Graph Mining ☺ –Related Work ☺ –Formalism for metabolic pathways ☺ –The Algorithm ☺ Discussion and Empiric Results Conclusion Future Work

24 24 Empiric Results The bold subgraph was mined and appears in 29% of the organisms in KEGG The solid subgraph appears in 19.3% The entire graph appears in 14.2% Glutamate

25 25 Empiric Results 32.1%, 19.2%, 11.5%25.6%, 21.8%, 15.4% Alanine-aspartate Pyrimidine

26 26 Empiric Results Run time results for Pentium 4, 2 GHz, 0.5 GB of RAM Sub pathway of 16 edges discovered in 3 sec. The entire graph appears in 14.2%

27 27 Outline Introduction☺ Model: ☺ Approach: Graph Mining ☺ –Related Work ☺ –Formalism for metabolic pathways ☺ –The Algorithm ☺ Discussion and Empiric Results ☺ Conclusion Future Work

28 28 Conclusion Framework for mining biological networks Graph simplification without losing biological meaning Efficient graph mining Good response times

29 29 Outline Introduction☺ Model: ☺ Graph Mining ☺ –Related Work ☺ –Formalism for metabolic pathways ☺ –The Algorithm ☺ Discussion and Empiric Results ☺ Conclusion ☺ Future Work

30 30 Future Work Adding flexibility for capturing biologically meaningful info and concepts, such as probabilistic methods Probabilistic models for investigating the significance of discovered patterns (but unlike the previous case, probability does not model biology) Approximate matching rather than exact –What is an approximation in this case? Suitable definition needed

31 31 NEXT PAPER (IN BRIEF)…

32 32 Seminar in Bioinformatics Pairwise Local Alignment of Protein Interaction Networks Guided by Models of Evolution Paper by: M. Koyuturk, A. Grama and W. Szpankowski Appeared in: Journal of Comp. Biology, 13(2), 182-199, 2006. Presented by: Royi Ronen

33 33 The Problem Protein-Protein-Interaction networks are modeled as graphs A PPI network is an undirected graph (V,E) –Elements in V represent proteins –Elements in E represent pairs which interact The paper solves the problem of aligning two graphs (rather than many)

34 34 Homology Function S(,) Consider two Graphs: G(U,E), H(V,F) For each pair from the union of V and U, S assigns a score: –If the pair belongs to the same (a different) species, the confidence that they are paralogous (orthologous). 0 is the lowest value –Values of S are determined by an algorithm out of the scope of the paper (INPARANOID) Some definitions: –Match: A conserved interaction between orthologous pairs –Mismatch: A lack of interaction between a pair whose orthologs interact –Duplication: Paralogous proteins (tend to diverge in the long run)

35 35 Proposed Solution Every pair of node subsets induces an alignment {M,N,D} which is associated with a score M - Pairs of edges, with positive S values to nodes, which exist in both graphs. Each associated with a positive score N - Pairs of edges, with positive S values to nodes, which exist in one graph but not in the other. Each associated with a negative score D - Pairs of nodes from the same graph with positive S. Each associated with a negative score The total score is the sum of all the scores, and we wish to find alignment with locally maximal scores

36 36 Proposed Solution An algorithm is proposed in order to avoid considering all possible subsets The heuristics tries to expand a set so that its scores is made higher Rings a bell?

37 37 Experimental Results Using this alignment method and a scoring algorithm for S(,) called INPARANOID, PPI networks of Human and Mouse were aligned Data taken from the DIP Database Details: –Homo Sapiens - 1369 interaction between 1065 proteins –Mus Musculus – 286 interactions between 329 proteins

38 38 Experimental Results INPARANOID discovered 237 ortholog clusters 305 matched interactions were discovered; 205 mismatches, 536 duplications in Human; 149 mismatches, 384 duplications in Mouse. Examples: –Conserved subnet with one-way mismatches –Conserved subnet with two-way mismatches –Duplications

39 39 Example 1 Graphs aligned Biological meaning –Similarity and differences between the species –Insight on evolutionary events

40 40 Example 2 Another graph alignment result with local maximum score

41 41 Example 3 Instance of duplication between mouse and human The regulator regulates homologs


Download ppt "1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological networks Paper by: M. Koyuturk, A. Grama and W. Szpankowski."

Similar presentations


Ads by Google