Presentation is loading. Please wait.

Presentation is loading. Please wait.

2. Lecture WS 2008/09Bioinformatics III1 V2 – network topologies Content of today‘s lecture 1 some definitions on mathematical graphs 2Dijkstra‘s algorithm.

Similar presentations


Presentation on theme: "2. Lecture WS 2008/09Bioinformatics III1 V2 – network topologies Content of today‘s lecture 1 some definitions on mathematical graphs 2Dijkstra‘s algorithm."— Presentation transcript:

1 2. Lecture WS 2008/09Bioinformatics III1 V2 – network topologies Content of today‘s lecture 1 some definitions on mathematical graphs 2Dijkstra‘s algorithm 3Albert-Barabasi algorithm to construct scale-free model 4Analysis of domain connectivities 5Models for Network Growth - Random graphs: classical field in graph theory. Well studied analytically and numerically. Literature (heavy): Bela Bollobas, Modern Graph Theory; Random Graphs - Scale-free networks: quite new. Properties were mostly studied numerically and heuristically (sofar).

2 2. Lecture WS 2008/09Bioinformatics III2 1 Definitions on Mathematical Graphs A graph G is an ordered pair of disjoint sets (V,E) such that E is a subset of the set V (2) of unordered pairs of V. The set V is the set of vertices; E is the set of edges. V and E are assumed always finite here. A weighted graph has a real valued weight assigned to each edge. A subgraph of a graph G is a graph whose vertex and edge sets are subsets of those of G. (Left) An undirected graph is shown consisting of 4 vertices (A, B, C, and D) and 5 edges (connections) This example could represent the results from a yeast two-hybrid experiment probing binary protein-protein interactions that gave positive results for 5 interactions A-B, A-D, A-C, B-C, and C-D. (Right) Almost the same system is shown but this time as a directed graph with arrows (arcs) instead of edges. This example could, for example, visualize a gene regulatory network where a transcription factor A controls the expression of genes B, C, D etc. Here, A,B,C, and D are the 4 vertices of the graph, and the five arcs are the directed edges of the graph.

3 2. Lecture WS 2008/09Bioinformatics III3 1 Definitions A path in a graph is a sequence of vertices such that from each of its vertices there is an edge to the successor vertex. The first vertex is called the start vertex and the last vertex is called the end vertex. Both of them are called end or terminal vertices of the path. The other vertices in the path are internal vertices. Two paths are independent (alternatively called internally vertex-disjoint) if they do not have any internal vertex in common, Vertices A and D are connected by five paths, (A  B  D, A  B  E  D, A  B  E  C  D, A  E  D, A  E  C  D ). Only two of these paths are independent, A  B  D and either A  E  D or A  E  C  D.

4 2. Lecture WS 2008/09Bioinformatics III4 1 Definitions Given an undirected graph, two vertices u and v are called connected if there exists a path from u to v. Otherwise they are called disconnected. The graph is called connected graph if every pair of vertices in the graph is connected. A connected component is a maximal connected subgraph. Maximal means here that it can only be enlarged by rearranging edges. The Giant component is a network theory term referring to a connected subgraph that contains a majority of the entire graph's vertices. A walk is an alternating sequence of vertices and edges, beginning and ending with a vertex. The length l of a walk is the number of edges that it uses. A trail is a walk in which all the edges are distinct. A cycle denotes here a closed path with no repeated vertices other than the starting and ending vertices.

5 2. Lecture WS 2008/09Bioinformatics III5 1 Shortest path problem The shortest path problem is the problem of finding a path between two vertices such that the sum of the weights of its constituent edges is minimized. More formally, given a weighted graph (V,E), and two elements n, n'  V, find a path P from n to n' so that is minimal among all paths connecting n to n'. The all-pairs shortest path problem is a similar problem, in which we have to find such paths for every two vertices n to n'. A tree finally is a graph in which any two vertices are connected by exactly one path. Alternatively, a tree may be defined as a connected graph with no cycles. A labeled tree is a tree in which each vertex is given a unique label.

6 2. Lecture WS 2008/09Bioinformatics III6 2 Dijkstra’s algorithm Dijkstra's algorithm solves the shortest path problem for a directed graph with non-negative edge weights. Input: weighted directed graph G = (V,E) and a source vertex s in G. Each edge of the graph is an ordered pair of vertices (u,v) representing a connection from vertex u to vertex v. The weight w(u,v) is the non-negative cost of moving from vertex u to vertex v. The cost of an edge can be thought of as the distance between those two vertices. The cost of a path between two vertices is the sum of costs of the edges in that path. For a given pair of vertices s and t in V, the algorithm finds the path from s to t with lowest cost (i.e. the shortest path).

7 2. Lecture WS 2008/09Bioinformatics III7 2 Description of the algorithm The algorithm works by keeping for each vertex v the cost d[v] of the shortest path found so far between s and v. Initially, this value is 0 for the source vertex s (d[s]=0), and infinity for all other vertices, representing the fact that we do not know any path leading to those vertices (d[v]=∞  v in V, except s). When the algorithm finishes, d[v] will be the cost of the shortest path from s to v -- or infinity, if no such path exists. The basic operation of Dijkstra's algorithm is edge relaxation: if there is an edge from u to v, then the shortest known path from s to u (d[u]) can be extended to a path from s to v by adding edge (u,v) at the end. This path will have length d[u]+w(u,v). If this is less than the current d[v], we can replace the current value of d[v] with the new value.

8 2. Lecture WS 2008/09Bioinformatics III8 2 Description of the algorithm Edge relaxation is applied until all values d[v] represent the cost of the shortest path from s to v. The algorithm is organized so that each edge (u,v) is relaxed only once, when d[u] has reached its final value. The algorithm maintains two sets of vertices S and Q. Set S contains all vertices for which we know that the value d[v] is already the cost of the shortest path and set Q contains all other vertices. Set S starts empty, and in each step one vertex is moved from Q to S. This vertex is chosen as the vertex with lowest value of d[u]. When a vertex u is moved to S, the algorithm relaxes every outgoing edge (u,v).

9 2. Lecture WS 2008/09Bioinformatics III9 2 Pseudocode In the following algorithm, u := Extract-Min(Q) searches for the vertex u in the vertex set Q that has the smallest d[u] value. That vertex is removed from the set Q and returned to the user. Q := update(Q) updates the weight field of the current vertex in the vertex set Q. 1 function Dijkstra(G, w, s) 2 for each vertex v in V[G] // Initialization 3 do d[v] := infinity 4 previous[v] := undefined 5 d[s] := 0 6 S := empty set 7 Q := set of all vertices 8 while Q is not an empty set 9 do u := Extract-Min(Q) 10 S := S U {u} 11 for each edge (u,v) outgoing from u 12 do if d[v] > d[u] + w(u,v) // Relax (u,v) 13 then d[v] := d[u] + w(u,v) 14 previous[v] := u 15 Q := Update(Q)....enddo... end... endo... end If we are only interested in a shortest path between vertices s and t, we can terminate the search at line 9 if u = t.

10 2. Lecture WS 2008/09Bioinformatics III10 2 Pseudocode Now we can read the shortest path from s to t by iteration: 1 S := empty sequence 2 u := t 3 while defined u 4 do insert u to the beginning of S 5 u := previous[u] Now sequence S is the list of vertices on the shortest path from s to t. first iteration..........................intermediate step.....................final iteration

11 2. Lecture WS 2008/09Bioinformatics III11 2 Running time The simplest implementation of the Dijkstra's algorithm stores the n vertices of set Q in an ordinary linked list or array, and operation Extract-Min(Q) is simply a linear search through all vertices in Q. In this case, the running time is O(n 2 ). For sparse graphs, that is, graphs with much less than n 2 edges, Dijkstra's algorithm can be implemented more efficiently. Description of Dijkstra‘s algorithm taken from www.wikipedia.org

12 2. Lecture WS 2008/09Bioinformatics III12 n nodes (vertices) joined by edges that have been chosen and placed between pairs of nodes uniformly at random. G n,p : each possible edge in the graph on n nodes is present with probability p and absent with probability 1 – p. Average number of edges in G n,p : Each edge connects two vertices  average degree of a vertex: 3 Erdös-Renyi model of a random graph

13 2. Lecture WS 2008/09Bioinformatics III13 Erdös and Renyi studied how the expected topology of a random graph with n nodes changes as a function of the number of edges m. When m is small, the graph is likely fragmented into many small connected components having vertex sets of size at most O(log n). As m increases the components grow at first by linking to isolated nodes, and later by fusing with other components. A transition happens at m = n/2, when many clusters cross-link spontaneously to form a unique largest component called the giant component. Its vertex set size is much larger than the vertex set sizes of any other components. It contains O(n) nodes, while the second largest component contains O(log n) nodes. In statistical physics, this phenomenon is called percolation. 3 Erdös-Renyi model: components

14 2. Lecture WS 2008/09Bioinformatics III14 The shortest path length between any pairs of nodes in the giant component grows like log n. Therefore, these graphs are called „small worlds“. The properties of random graphs have been studied very extensively. Literature: B. Bollobas. Random Graphs. Academic, London, 1985, 2004 However, random graphs are no adequate models for real-world networks because (1)real networks appear to have a power-law degree distribution, (while random graphs have Poisson distribution) and (2)real networks show strong clustering while the clustering coefficient of a random graph is C = p, independent of whether two vertices have a common neighbor. 3 Erdös-Renyi model: shortest path length

15 2. Lecture WS 2008/09Bioinformatics III15 Aim: allow a power-law degree distribution in a graph while leaving all other aspects as in the random graph model.  Given a degree sequence (e.g. power-law distribution) one can generate a random graph by assigning to a vertex i a degree k i from the given degree sequence. Then choose pairs of vertices uniformly at random to make edges so that the assigned degrees remain preserved. When all degrees have been used up to make edges, the resulting graph is a random member of the set of graphs with the desired degree distribution. Problem: method does not allow to specify clustering coefficient. On the other hand, this property makes it possible to exactly determine many properties of these graphs in the limit of large n. E.g. almost all random graphs with a fixed degree distribution and no nodes of degree smaller than 2 have a unique giant component. 3 Generalized Random Graphs

16 2. Lecture WS 2008/09Bioinformatics III16 3 Barabasi’s construction algorithm for scale-free model Input (n 0, m, t) where n 0 is the initial number of vertices, m (m  n 0 ) is the number of added edges every time one new vertex is added to the graph, and t is the number of iterations. Algorithm a) Start with n 0 isolated nodes. b) Every time we add one new node v, m edges will be linked to the existing nodes from v with a preferential attachment probability where k i is the number of links (degree) of the i-th node. Eventually, the graph will have (n 0 + t) nodes and (mt) edges. Problem of „pure“ mathematicians with this algorithm: how to start from n 0 = 0?

17 2. Lecture WS 2008/09Bioinformatics III17 3 Properties of Barabasi-Albert scale-free model P(k)  k -  with  = 3. Real networks often show   2.1 – 2.4 Observation: if either growth or preferential attachment is eliminated, the resulting network does not exhibit scale-free properties. The average path length in the BA-model is proportional to ln n/ln ln n which is shorter than in random graphs  scale-free networks are ultrasmall worlds. Observation: non-trivial correlations = clustering between the degrees of connected nodes. Numerical result for BA-model C  n -0.75. No analytical predictions of C sofar.

18 2. Lecture WS 2008/09Bioinformatics III18 3 Properties of scale-free models Scale-free networks are resistant to random failures („robustness“) because a few high-degree hubs dominate their topology; a deliberate node that fails probably has a small degree, and thus not severly affects the rest of the network. However, scale-free networks are quite vulnerable to attacks on the hubs. See example of last lecture about lethality of gene deletions in yeast. These properties have been confirmed numerically and analytically by studying the average path length and the size of the giant component.

19 2. Lecture WS 2008/09Bioinformatics III19 3 Properties of Barabasi-Albert scale-free model BA-model is a minimal model that captures the mechanisms responsible for the power-law degree distribution observed in real networks. A discrepany is the fixed exponent of the predicted power-law distribution (  = 3).  Does the BA-model describe the „true“ biological evolution of networks? Recent efforts: - study variants with cleaner mathematical properties (Bollobas, LCD-model) - include effects of adding or re-wiring edges, allow nodes to age so that they can no longer accept new edges or vary forms of preferential attachment. These models also predict exponential and truncated power-law degree distribution in some parameter regimes.

20 2. Lecture WS 2008/09Bioinformatics III20 4 Scale-free behavior in protein domain networks ‚Domains‘ are fundamental units of protein structure. Most proteins only contain one single domain. Some sequences appear as multidomain proteins. On average, they have 2-3 domains, but can have up to 130 domains! Most new sequences show homologies to parts of known protein sequences  most proteins may have descended from relatively few ancestral types. Sequences of large proteins often seem to have evolved by joining preexisting domains in new combinations, „domain shuffling“: domain duplication or domain insertion. Wuchty Mol. Biol. Evol. 18, 1694 (2001)

21 2. Lecture WS 2008/09Bioinformatics III21 4 Protein Domain databases Prosite (http://expasy.proteome.org.au/prosite/) contains 1360 biologically significant motifs and profiles.http://expasy.proteome.org.au/prosite/ Wuchty Mol. Biol. Evol. 18, 1694 (2001) number of links to other domains P(number of links to other domains)

22 2. Lecture WS 2008/09Bioinformatics III22 4 Which ones are highly connected domains? The majority of highly connected InterPro domains appear in signalling pathways. List of the 10 best linked domains in various species. Wuchty Mol. Biol. Evol. 18, 1694 (2001)  evolutionary trend toward compartementalization of the cell and multicellularity demands a higher degree of organization. From left to right: Number of links increases. Number of signalling domains (PH, SH3), their ligands (proline-rich extensions), and receptors (GPCR/RHODOPSIN) increases.

23 2. Lecture WS 2008/09Bioinformatics III23 4 Evolutionary Aspects BA-model of scale-free networks is constructed by preferential attachment of newly added vertices to already well connected ones.  Fell and Wagner (2000) argued that vertices with many connections in metabolic network were metabolites originating very early in the course of evolution where they shaped a core metabolism.  Analogously, highly connected domains could have also originated very early. Is this true? Wuchty Mol. Biol. Evol. 18, 1694 (2001) No. Majority of highly connected domains in Methanococcus and in E.coli are concerned with maintanced of metabolism. None of the highly connected domains of higher organisms is found here. On the other hand, helicase C has roughly similar degrees of connection in all organisms.

24 2. Lecture WS 2008/09Bioinformatics III24 5 Network growth mechanism How can we know what is the „true“ growth mechanism of real biological networks? Question 1: Is it important to know this? Yes. Question 2: What measure do we use to distinguish networks produced by different growth mechanisms?  Look at the fine structure (motifs) of biological networks.

25 2. Lecture WS 2008/09Bioinformatics III25 5 Analysis of Drosophila melanogaster protein interaction network Data set: protein-protein interaction map for Drosophila by Giot et al. Problem: data set is subject to numerous false positives. Giot et al. assign a confidence score p  [0,1] to each interaction measuring how likely the interaction occurs in vivo. What threshold p* should be used? Measure size of the components for all possible values of p*. Observe: for p*= 0.65, the two largest components are connected  use this value as threshold. Edges in the graph correspond to interactions for which p > p*. Remove self-interactions and isolated vertices  3359 (4625) nodes with 2795 (4683) edges for p*= 0.65 (0.5) Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

26 2. Lecture WS 2008/09Bioinformatics III26 5 Network evolution models considered Duplication-mutation-complementation (DMC) algorithm: based on model that proposes that most of the duplicate genes observed today have been preserved by functional complementation. If either the gene or its copy loses one of its functions (edges), the other becomes essential in assuring the organisms‘s survival. Algorithm: duplication step is followed by mutations that preserve functional complementarity. At every time step choose a node v at random. A twin vertex v twin is introduced copying all of v‘s edges. For each edge of v, delete with probability q del either the original edge or its corresponding edge of v twin. Cojoin twins themselves with independent probability q con representing an interaction of a protein with its own copy. No edges are created by mutations  DMC algorithm assumes that the probability of creating new advantageous functions by random mutations is negligible. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

27 2. Lecture WS 2008/09Bioinformatics III27 5 Network evolution models considered Variant of DMC: Duplication-random mutations (DMR) algorithm: Possible interactions between twins are neglected. Instead, edges between v twin and the neighbors of v can be removed with probability q del and new edges can be created at random between v twin and any other vertices with probability q new /N, where N is the current total number of vertices. DMR emphasizes the creation of new advantageous functions by mutation. Other models: - linear preferential attachment (LPA) (Barabasi) - random static networks (Erdös-Renyi) (RDS) - random growing networks (RDG – growing graphs where new edges are created randomly between existing nodes) - aging vertex networks (AGV – growing graphs modeling citation networks, where the probability for new edges decreases with the age of the vertex) - small-world network (SMV – interpolation between regular ring lattices and randomly connected graphs). Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

28 2. Lecture WS 2008/09Bioinformatics III28 5 Training set Create 1000 graphs as training data for each of the seven different models. Every graph is generated with the same number of edges and nodes as measured in Drosophila. Quantify topology of a network by counting all possible subgraphs up to a given cut-off, which could be the number of nodes, number of edges, or the length of a given walk. Here: count all subgraphs that can be constructed by a walk of length=8 (148 non- isomorphic subgraphs) or length=7 (130 non-isomorphic subgraphs). Use these counts as input features for classifier. Note that the average shortest path between two nodes of the Drosophila network‘s giant component is 11.6 (9.4) for p*=0.65 (0.5).  Walks of length=8 can traverse large parts of the network. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

29 2. Lecture WS 2008/09Bioinformatics III29 5 Visualization of subgraphs A qualitative and more intuitive way of interpreting the classification result is visualizing the subgraph profiles. Subgraphs associated with Figures 3 and 1. A representatie subset of 50 subgraphs out of 148 is shown. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

30 2. Lecture WS 2008/09Bioinformatics III30 5 Learning algorithm: Alternating Decision Tree Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15 Rectangles: decision nodes. A given network‘s subgraph counts determine paths in the tree dictated by inequalities specified by the decision nodes. For each class, the Alternative Decision Tree outputs a real-valued prediction score, which is the sum of all weights over all paths. The class with the heighest score wins.

31 2. Lecture WS 2008/09Bioinformatics III31 5 Performance on training set Can the Decision Tree separate the graphs generated by the different growth mechanisms? The confusion matrix shows truth and prediction for the test sets. 5 out of 7 have nearly perfect prediction accuracy. AGV is constructed as an interpolation between LPA and a ring lattice  the AGV, LPA and SMW mechanisms are equivalent in specific parameter regimes and show a non-negligible overlap. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

32 2. Lecture WS 2008/09Bioinformatics III32 5 Task: discriminate different growth mechanisms Ten graphs of two different mechanisms exhibit similar average geodesic lengths and almost identical degree distribution and clustering coefficients. (a) cumulative degree distribution p(k > k 0 ), average clustering coefficient and average geodesic length, all quantities averaged over a set of 10 graphs.  global topology descriptors cannot separate between growth mechanisms (b) Prediction score for all ten graphs and all five cross-validated ADTs. The two sets of graphs can now be perfectly separated by the classifier. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

33 2. Lecture WS 2008/09Bioinformatics III33 5 Learning algorithm: Alternating Decision Tree Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15 Figure shows the first few descision nodes (out of 120) of a resulting ADT. The prediction scores reveal that a high count of 3-cycles suggests a DMC network. DMC mechanism indeed facilitates creation of many 3-cycles by allowing 2 copies to attach to eachother, thus creating 3-cycles with their common neighbors. A low count in 3-cycles but a high count in 8-edge linear chains is a good precictor for LPA and DMR networks.

34 2. Lecture WS 2008/09Bioinformatics III34 5 Subgraph profiles The average subgraph count of the training data for every mechanism is shown for the 50 representative subgraphs S1-S50. Black lines indicate that this model is closest to Drosophila based on the absolute difference between the subgraph counts. For 60% of the subgraphs (S1-S30), the counts for Drosophila are closest to the DMC model. All of these subgraphs contain one or more cycles, including highly connected subgraphs (S1) and long linear chains ending in cycles (S16, S18, S22, S23, S25). Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15 The DMC algorithm is the only mechanism that produces such cycles with a high occurrence.

35 2. Lecture WS 2008/09Bioinformatics III35 5 Robustness against noise Edges in Drosophila network are randomly replaced and the network is classified. Plotted are prediction scores for each of the 7 classes as more and more edges are replaced. Every point is an average over 200 independent random replacements. For high noise level (beyond 80%), the network is classified as an Erdös-Renyi (RDS) graph. For low noise (< 30%), the confidence in the classification as a DMC network is even higher than in the classification as an RDS network for high noise. The prediction score y(c) for class c is related to the estimated probability p(c) for the tested network to be in class c by Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

36 2. Lecture WS 2008/09Bioinformatics III36 Conclusions Very nice (!) method that allows to infer growth mechanisms for real networks. Method is robust against noise and data subsampling, no prior assumption about network features/topology required. Learning algorithm does not assume any relationships between features (e.g. orthogonality). Therefore the input space can be augmented with various features in addition to subgraph counts. The protein interaction network of Drosophila is confidently classified as DMC network. However, further growth mechanisms need to be explored in future. Input from evolutionary biology is needed. Here, we mostly concentrated on the technique of characterizing the resulting network topologies. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15


Download ppt "2. Lecture WS 2008/09Bioinformatics III1 V2 – network topologies Content of today‘s lecture 1 some definitions on mathematical graphs 2Dijkstra‘s algorithm."

Similar presentations


Ads by Google