Presentation is loading. Please wait.

Presentation is loading. Please wait.

Network motifs: discovery and applications Guy Zinman Seminar in Bioinformatics Technion, Spring 2005.

Similar presentations


Presentation on theme: "Network motifs: discovery and applications Guy Zinman Seminar in Bioinformatics Technion, Spring 2005."— Presentation transcript:

1 Network motifs: discovery and applications Guy Zinman Seminar in Bioinformatics Technion, Spring 2005

2 Outline Theory of network motifs Definition, Algorithm Application to E. Coli transcription network The dynamic behavior of the motifs Finding active subnetworks Simulated annealing experiments

3 Network

4 Dictionary definition: A group or system of (electric) components and connecting circuitry designed to function in a specific manner. Network is the backbone of a complex system Studies of networks are similar to paleontology: learning about an animal from its backbone

5 Network motifs The notion of motif, widely used for sequence analysis, is generalized to the level of networks. Network Motifs are defined as patterns of interconnections that recur in many different parts of a network at frequencies much higher than those found in randomized networks.

6 Network motifs (cont.) Such motifs are found in networks from: Biochemistry Transcriptional regulation networks Neurobiology Neuron connectivity Ecology Food webs Engineering Electoronic circuits World Wide Web

7 Network motifs (cont.)

8

9 Schematic view of motif detection Occurrence of the FFL motif:

10 Random vs designed/evolved features Large networks may contain information about design principles and/or evolution of the complex system Which features are there for a reason: design principles (e.g. feed-forward loops) constraints (e.g. the all nodes on the Internet must be connected to each other) evolution, growth dynamics (e.g. network growth is mainly due to gene duplication)

11 Network motifs Alon U. et al: “Network Motifs: Simple building Blocks of Complex Networks”; Science, 2002. Different motifs were found in different classes of network. The motif reflect the underlying processes that generate each type of network.

12 Motifs detected Two significant motifs: Both appeared numerous times in non- homologous gene systems that perform diverse biological functions

13 Motifs detected

14

15 Main tasks for detecting network motifs There are two main tasks in detecting network motifs: (1) generating an ensemble of proper random networks (2) counting the subgraphs in the real network and in random networks.

16 The algorithm Starting point: graph with directed edges Scan for n-node subgraphs (n=3,4) and count number of occurrences Compare to Erdos-Renyi randomized graph (randomization preserves in-, out- and in+out- degree of each node)

17 All 3-node connected subgraphs 13 different isomorphic types of 3-node connected subgraph There are: 199 4-node subgraphs, 9364 5-node subgraphs ……

18 Generation of randomized network Algorithm A Employ a Markov-chain algorithm based on starting with the real network and repeatedly swapping randomly chosen pairs of connections (X1 => Y1, X2 => Y2 is replaced by X1 => Y2, X2 => Y1) until the network is well randomized. Switching is prohibited if the either of the connections X1 => Y2 or X2 => Y1 already exist.

19 Generation of randomized network Algorithm B Each network was presented as a connectivity matrix M, such that Mij = 1 if there is a connection directed from node i to node j, and 0 otherwise. The goal is to create a randomized connectivity matrix Mrand, which has the same number of nonzero elements in each row and column as the corresponding row and column of the real connectivity matrix.

20 Generation of randomized network Ri = ∑jMrand,ij = ∑jMij, Ci = ∑iMrand,ij = ∑iMij. To generate the randomized networks, we start with an empty matrix Mrand. We then repeatedly randomly choose a row n according to the weights pi = Ri/∑Ri and a column m according to the weights qj = Rj/∑Rj. If Mrand,nm = 0, we set Mrand,mn = 1. We then set Rm = Rm – 1 and Cn = Cn – 1. If the entry (m, n) was previously entered to the randomized matrix, that is, ifMrand,mn = 1, or if m = n, we choose a new (m, n). This process is repeated until all Ri = 0 and Cj = 0.

21 Network motif detection For each nonzero element (i,j): Looping through all connected elements Mik = 1, Mki = 1, Mjk = 1, and Mkj = 1. This is recursively repeated with elements (i, k), (k, i), (j,k), and (k, j) until an n-node subgraph is obtained. A table is formed that counts the number of appearances of each type of subgraph in the network, correcting for the fact that multiple submatrices of M can correspond to one isomorphic architecture owing to symmetries.

22 Network motif detection This process is repeated for each of the randomized networks. The number of appearances of each type of subgraph in the random ensemble is recorded, to assess its statistical significance. The present concepts and algorithms are easily generalized to nondirected or directed graphs with several “colors” of edges and nodes, multipartite graphs, and so forth.

23 Criteria for Network Motif Selection The probability that it appears in a randomized network an equal or greater number of times than in the real network is smaller than P = 0.01. Reminder: p-value: the probability to get the given result when the tested subject is not affected by the experiment. if p-value < 0.01 than the subject is considered to be affected (the hypothesis is correct).

24 Run time complexity The performance of this algorithm scales with the total number of n-node subgraphs in the network. The number of subgraphs and the algorithm runtime also increase dramatically for subgraphs with n ≥ 5.

25 Sampling method for subgraph counting Kashtan et al.: “Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs”; Bioinformatics, 2004. This algorithm samples subgraphs in order to estimate their relative frequency. The runtime of the algorithm asymptotically does not depend on the network size. Surprisingly, few samples are needed to detect network motifs reliably.

26 Subgraph sampling Procedure description: pick a random edge from the network and then expand the subgraph iteratively by picking random neighboring edges until the subgraph reaches n nodes. For each random choice of an edge, in order to pick an edge that will expand the subgraph size by one, prepare a list of all such candidate edges and then randomly choose an edge from the list.

27 Subgraph sampling Finally, the sampled subgraph is defined by the set of n nodes and all the edges that connect between these nodes in the original network. Finding n-node subgraphs for n ≥5 is much easier now….

28 Comparing sampling method results with exhaustive enumeration

29 Transcriptional Regulation Network of Escherichia coli Operon – a group of contiguous genes that are transcribed into a single mRNA molecule. The transcriptional network is represented as a directed graph: each operon represents a node and edges represent direct transcriptional interactions.

30 Application to E. Coli Alon U.: “Network motifs in the transcriptional regulation network of Eschersichia coli”; Nature Genetics, 2002. Database - RegulonDB contains interactions between Transcription Factors and the operons they regulate Contains 577 interactions, 424 operons and 116 TFs 35 more TFs were added from literature Previously described algorithm was run on this data (1000 random networks)

31 Significant motifs Feedforward loop found in 22 different systems, 10 TFs and 40 operons P-Val=0.001

32 Concentration of FFL

33 Same in the yeast regulatory network Young et. al: Transcriptional Regulatory Networks in Saccharomyces cerevisiae; Science, 2002

34 Can you think of a possible role for this motif?

35 Dynamics for the FFL

36 Mangan et al., “Structure and function of the feed-forward loop”; PNAS, 2003. Consider Sx and Sy as Input signal – small molecules That activate or inhibit the Activity of X and Y.

37 Coherency of FFLs The FFL is ‘coherent’ if the direct effect of the general TF on the effector has the same sign. 85% of the FFL found were coherent.

38 Significant motif Single Input Motif (SIM) Single Transcription Factor controls set of operons. All operons in a SIM are regulated with the same sign. Appeared in 24 different systems

39 Dynamics for the SIM

40 Significant motif Dense Overlapping Regulon (DOR) - a layer of overlapping interactions between operons and a group of TFs, much denser than this structure would appear in an Erdos-Renyi random graph

41 E. Coli network

42 Dor detection Briefly… Define a (nonmetric) distance measure between operon k and j. The operons were clustered. DORs corresponded to clusters with more than C=10 connections, with ratio of connections to TF greater than R=2.

43 mFinder A software tool for estimating subgraph concentrations and detecting network motifs. www.weizmann.ac.il/mcb/UriAlon/

44 Discussion The concept of homology between genes based on sequence motifs has been crucial for understanding the function of uncharacterized genes. Likewise, the notion of similarity between connectivity patterns in networks, based on network motifs, may be helpful in gaining insight into the dynamic behavior of newly identified gene circuits.

45 Discussion Until now we considered only transcription interactions specifically manifested by transcription factors that bind regulatory sites. This transcriptional network can be thought of as ‘slow’ part of the cellular regulation network (time scale of minutes).

46 Discussion An additional layer of faster interactions, which include interaction between proteins (often subsecond timescale), contributes to the full regulatory behavior.

47 Finding active subnetworks Ideker, T.: “Discovering regulatory and signaling circuits in molecular interaction networks”; Bioinformatics, 2002. Integrates protein-protein and protein-DNA interactions with mRNA expression data, in a goal of better understanding the molecular mechanism of the observed gene expression. Uses a method of searching the network to find ‘active subnetwork’, i.e., connected sets of genes with unexpectedly high levels of differential expression, under one or more perturbation.

48 Methodology Using a molecular interaction network to analyze changes in expression over 20 perturbations to the yeast galactose utilization (GAL) pathway. Determining which conditions significantly affected the gene expression in each active subnetwork.

49 The means Combining a rigorous statistical measure for scoring subnetworks with a search algorithm for identifying subnetworks with high score.

50 To rate the biological activity of a particular subnetwork, begin with assessing the significance of differential expression for each gene. The error model provided by VERA (Variability and ERror Assessment) program. VERA estimates the parameters of a statistical model using the method of maximum likelihood. Output: p-values (p i ), representing the significance of expression change. Basic z-score calculation

51 Each p i is converted to z-score: z i = Φ -1 (1-p i ) Φ-1 = The inverse normal CDF (cumulative distribution function) Smaller p-values correspond to larger z-score z-score - quantifies how different from normal the given value is:

52 Aggregate z-score for an entire subnetwork A of k genes: Notice: z A will also be distributed according the standard normal (because the variables are independent). Subnetworks of all sizes are comparable under this scoring system, independent of k. A high z A indicates a biologically active subnetwork. Scoring of Subnetworks

53 Calibrating z against background distribution Randomly sample gene sets of size k using a Monte Carlo approach, compute their scores z A, and calculate standard deviation parameters for each k. The corrected subnet score S A is:

54 Scoring an example subnetwork ZaZa ZbZb ZcZc ZdZd ZAZA SASA

55 Scoring over multiple conditions Starting with a matrix of p-values (genes vs. conditions) and corresponding z-scores. Producing m different aggregate scores, one for each condition, and sorting them. Finding the probability that at least j of the m conditions had scores above z A(j) Monte Carlo technique is used for estimating the mean and the standard deviation from random gene set of size k.

56 Scoring over multiple conditions

57 Finding the maximal scoring Problem: Finding the maximal scoring connected subgraph is NP-hard.

58 The Difficulty in Searching Global Optima Global maxima Local maxima subnetwork significance score

59 Rugged landscapes and local maxima problem

60 Monte Carlo random search Known also as the ‘Metropolis algorithm’ A simulation technique for conformational sampling and optimization based on a random search for energetically favourable conformations Finding global (or at least “good” local) maximum by biased random walk may take some luck …

61 Global maxima Local maxima subnetwork significance score

62 Climbing mountains easier: simulated annealing Global maxima Local maxima subnetwork significance score In order to get out from a local maxima one needs to allow for locally unfavorable moves

63 Introduction to simulated annealing Simulated annealing (Kirkpatrick et al.,1983). Mathematical method developed together with Monte Carlo techniques to avoid false maxima Method simulates slow cooling of a solidifying solution to form a single crystal Origin: The annealing process of heated solids Intuition: By allowing occasional descent in the search process, we might be able to escape the trap of local maxima. In our context: Allow nodes to be removed from the subsets, even if the resulting subnetwork’s score is a (little) lower.

64 What can be an adverse effect of this method?

65 Consequences of the Occasional Ascents Help escaping the local optima. desired effect Might pass global optima after reaching it adverse effect So the result is not guaranteed to be optimal. But here we don’t care- any high-scoring subnetwork is suspected to be biologically significant.

66 Climbing mountains easier: simulated annealing Defining a “temperature” function. Increasing the effective “temperature” means higher probability of accepting moves that increase the energy Thus, the likelihood of escaping from a local maximum may be tuned.

67 Control of Annealing Process Acceptance of a search step (Metropolis Criterion): Assume the performance change in the search direction is. Accept a descending step only if it pass a random test, i.e. with probability p = Always accept a ascending step, i.e.

68 Control of Annealing Process Cooling Schedule: T, the annealing temperature, is the parameter that control the frequency of acceptance of decending steps. We gradually reduce temperature T(k) between 1 and 0. The probability to accept declining steps is proportional!

69 In our context Input: Graph G = (V,E) of molecular interactions, N – number of iteration T i – temperature function which decreases from T start to T end Output: G w – Subgraph of G Initialize G w by setting each node to an ‘active/inactive’ state randomly (with p = ½).

70 Simulated Annealing Algorithm For i = 1 to N DO Randomly pick a node v from V and toggle it’s state. Compute the score si for the working subgraph G w IF (s i > s i-1 ), keep v toggled; ELSE keep v toggled with probability

71 Heuristics for improved annealing Look for M active subnetworks simultaneously. M is a user defined variable Maintaining multiple components can improve the efficiency of annealing. Can be done by: multiple annealing runs Or by extending the annealing approach to maintain a graph state vector of the top M component scores.

72 Galactose metabolic flow

73 Results: Experiment #1 small network of 362 interaction. 2 conditions of the expression data: gal80 deletion vs. WT. 5 significant subnetworks were found, including 41 out of 77 significant genes.

74 Score and temperature vs. number of iteration Temperature cooling is geometric from 1 to 0. N = By the end of the run, each of the 5 subnetworks reach a (local) maximum.

75 Evaluation of the subnetworks Z-score distribution with real data Z-score distribution with random data ( scrambled nodes z- scores ) Z-score distribution of the top 5 active networks.

76 Experiment #2 Network consists of all known interactions: 7145 protein-protein interactions from BIND 317 regulation interactions from TRANSFAC Expression data includes 20 perturbations to genes in the Galactose pathway. 7 active subnetworks found. The biggest consists of 340 genes. Repeating annealing with the network above, generated 5 significant sub-sub-networks. All results were evaluated with methods similar to what we have seen. Results:

77

78 Discussion

79 Cytoscape www.cytoscape.org

80 Summary Theory of network motifs Definition, Alogorithm Application to E. Coli transcription network The dynamic behavior of the motifs Finding active subnetworks Simulated annealing 2 experiments

81 References S Shen-Orr, R Milo, S Mangan & U Alon, Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics, 31:64-68 (2002). R Milo, S Shen-Orr, S Itzkovitz, N Kashtan, D Chklovskii & U Alon, Network Motifs: Simple Building Blocks of Complex Networks Science, 298:824-827 (2002). Ideker, T., Ozier, O., Schwikowski, B., and Siegel, A. Discovering regulatory and signaling circuits in molecular interaction networks. Bioinformatics 18 : S233 (2002).

82 S. Mangan and U. Alon Structure and function of feed forward loop network motif. PNAS 100:11980-11985 (2003). N. Kashtan, S. Itzkovitz, R. Milo and U. Alon Efficient sampling algorithm for estimating subgraph concentration and detecting network motifs; Bioinformatics 20:1746-175 (2004). S. kirkpatrick, C. D. Gelatt and M. P. Vecchi Optimization by simulated annealing Science 220:671-680 (1983).

83 Thank you


Download ppt "Network motifs: discovery and applications Guy Zinman Seminar in Bioinformatics Technion, Spring 2005."

Similar presentations


Ads by Google