Presentation is loading. Please wait.

Presentation is loading. Please wait.

2. Lecture WS 2006/07Bioinformatics III1 V2 – network topologies - Random graphs: classical field in graph theory. Well studied analytically and numerically.

Similar presentations


Presentation on theme: "2. Lecture WS 2006/07Bioinformatics III1 V2 – network topologies - Random graphs: classical field in graph theory. Well studied analytically and numerically."— Presentation transcript:

1 2. Lecture WS 2006/07Bioinformatics III1 V2 – network topologies - Random graphs: classical field in graph theory. Well studied analytically and numerically. Literature (heavy): Bela Bollobas, Modern Graph Theory; Random Graphs - Scale-free networks: quite new. Properties were mostly studied numerically and heuristically (sofar). - Evolution of domain linkage networks. - Classification of network topologies.

2 2. Lecture WS 2006/07Bioinformatics III2 Definitions A graph G is an ordered pair of disjoint sets (V,E) such that E is a subset of the set V (2) of unordered pairs of V. V and E are assumed always finite. The set V is the set of vertices; E is the set of edges. A weighted graph has a real valued weight assigned to each edge. A subgraph of a graph G is a graph whose vertex and edge sets are subsets of those of G. Given an undirected graph, two vertices u and v are called connected if there exists a path from u to v. Otherwise they are called disconnected. The graph is called connected graph if every pair of vertices in the graph is connected. The Giant component is a network theory term referring to a connected subgraph that contains a majority of the entire graph's vertices.

3 2. Lecture WS 2006/07Bioinformatics III3 Definitions A path in a graph is a sequence of vertices such that from each of its vertices there is an edge to the successor vertex. The first vertex is called the start vertex and the last vertex is called the end vertex. Both of them are called end or terminal vertices of the path. The other vertices in the path are internal vertices. Two paths are independent (alternatively, internally vertex-disjoint) if they do not have any internal vertex in common.

4 2. Lecture WS 2006/07Bioinformatics III4 Shortest path problem The shortest path problem is the problem of finding a path between two vertices such that the sum of the weights of its constituent edges is minimized. More formally, given a weighted graph (V,E), and two elements n, n'  V, find a path P from n to n' so that is minimal among all paths connecting n to n'. The all-pairs shortest path problem is a similar problem, in which we have to find such paths for every two vertices n to n'.

5 2. Lecture WS 2006/07Bioinformatics III5 Dijkstra’s algorithm Dijkstra's algorithm solves the shortest path problem for a directed graph with non-negative edge weights. Input: weighted directed graph G = (V,E) and a source vertex s in G. Each edge of the graph is an ordered pair of vertices (u,v) representing a connection from vertex u to vertex v. The weight w(u,v) is the non-negative cost of moving from vertex u to vertex v. The cost of an edge can be thought of as (a generalization of) the distance between those two vertices. The cost of a path between two vertices is the sum of costs of the edges in that path. For a given pair of vertices s and t in V, the algorithm finds the path from s to t with lowest cost (i.e. the shortest path). It can also be used for finding costs of shortest paths from a single vertex s to all other vertices in the graph.

6 2. Lecture WS 2006/07Bioinformatics III6 Description of the algorithm The algorithm works by keeping for each vertex v the cost d[v] of the shortest path found so far between s and v. Initially, this value is 0 for the source vertex s (d[s]=0), and infinity for all other vertices, representing the fact that we do not know any path leading to those vertices (d[v]=∞  v in V, except s). When the algorithm finishes, d[v] will be the cost of the shortest path from s to v -- or infinity, if no such path exists. The basic operation of Dijkstra's algorithm is edge relaxation: if there is an edge from u to v, then the shortest known path from s to u (d[u]) can be extended to a path from s to v by adding edge (u,v) at the end. This path will have length d[u]+w(u,v). If this is less than the current d[v], we can replace the current value of d[v] with the new value.

7 2. Lecture WS 2006/07Bioinformatics III7 Description of the algorithm Edge relaxation is applied until all values d[v] represent the cost of the shortest path from s to v. The algorithm is organized so that each edge (u,v) is relaxed only once, when d[u] has reached its final value. The algorithm maintains two sets of vertices S and Q. Set S contains all vertices for which we know that the value d[v] is already the cost of the shortest path and set Q contains all other vertices. Set S starts empty, and in each step one vertex is moved from Q to S. This vertex is chosen as the vertex with lowest value of d[u]. When a vertex u is moved to S, the algorithm relaxes every outgoing edge (u,v).

8 2. Lecture WS 2006/07Bioinformatics III8 Pseudocode In the following algorithm, u := Extract-Min(Q) searches for the vertex u in the vertex set Q that has the smallest d[u] value. That vertex is removed from the set Q and returned to the user. Q := update(Q) updates the weight field of the current vertex in the vertex set Q. 1 function Dijkstra(G, w, s) 2 for each vertex v in V[G] // Initialization 3 do d[v] := infinity 4 previous[v] := undefined 5 d[s] := 0 6 S := empty set 7 Q := set of all vertices 8 while Q is not an empty set 9 do u := Extract-Min(Q) 10 S := S U {u} 11 for each edge (u,v) outgoing from u 12 do if d[v] > d[u] + w(u,v) // Relax (u,v) 13 then d[v] := d[u] + w(u,v) 14 previous[v] := u 15 Q := Update(Q)....enddo... end... endo... end If we are only interested in a shortest path between vertices s and t, we can terminate the search at line 9 if u = t.

9 2. Lecture WS 2006/07Bioinformatics III9 Pseudocode Now we can read the shortest path from s to t by iteration: 1 S := empty sequence 2 u := t 3 while defined u 4 do insert u to the beginning of S 5 u := previous[u] Now sequence S is the list of vertices on the shortest path from s to t.

10 2. Lecture WS 2006/07Bioinformatics III10 Running time The simplest implementation of the Dijkstra's algorithm stores the n vertices of set Q in an ordinary linked list or array, and operation Extract-Min(Q) is simply a linear search through all vertices in Q. In this case, the running time is O(n 2 ). For sparse graphs, that is, graphs with much less than n 2 edges, Dijkstra's algorithm can be implemented more efficiently. Description of Dijkstra‘s algorithm taken from www.wikipedia.org

11 2. Lecture WS 2006/07Bioinformatics III11 n nodes (vertices) joined by edges that have been chosen and placed between pairs of nodes uniformly at random. G n,p : each possible edge in the graph on n nodes is present with probability p and absent with probability 1 – p. Average number of edges in G n,p : Each edge connects two vertices  average degree of a vertex: Erdös-Renyi model of a random graph

12 2. Lecture WS 2006/07Bioinformatics III12 Erdös and Renyi studied how the expected topology of a random graph with n nodes changes as a function of the number of edges m. When m is small, the graph is likely fragmented into many small connected components having vertex sets of size at most O(log n). As m increases the components grow at first by linking to isolated nodes, and later by fusing with other components. A transition happens at m = n/2, when many clusters cross-link spontaneously to form a unique largest component called the giant component. Its vertex set size is much larger than the vertex set sizes of any other components. It contains O(n) nodes, while the second largest component contains O(log n) nodes. In statistical physics, this phenomenon is called percolation. Erdös-Renyi model: components

13 2. Lecture WS 2006/07Bioinformatics III13 The shortest path length between any pairs of nodes in the giant component grows like log n. Therefore, these graphs are called „small worlds“. The properties of random graphs have been studied very extensively. Literature: B. Bollobas. Random Graphs. Academic, London, 1985, 2004 However, random graphs are no adequate models for real-world networks because (1)real networks appear to have a power-law degree distribution, (while random graphs have Poisson distribution) and (2)real networks show strong clustering while the clustering coefficient of a random graph is C = p, independent of whether two vertices have a common neighbor. Erdös-Renyi model: shortest path length

14 2. Lecture WS 2006/07Bioinformatics III14 Aim: allow a power-law degree distribution in a graph while leaving all other aspects as in the random graph model.  Given a degree sequence (e.g. power-law distribution) one can generate a random graph by assigning to a vertex i a degree k i from the given degree sequence. Then choose pairs of vertices uniformly at random to make edges so that the assigned degrees remain preserved. When all degrees have been used up to make edges, the resulting graph is a random member of the set of graphs with the desired degree distribution. Problem: method does not allow to specify clustering coefficient. On the other hand, this property makes it possible to exactly determine many properties of these graphs in the limit of large n. E.g. almost all random graphs with a fixed degree distribution and no nodes of degree smaller than 2 have a unique giant component. Generalized Random Graphs

15 2. Lecture WS 2006/07Bioinformatics III15 Barabasi’s construction algorithm for scale-free model Input (n 0, m, t) where n 0 is the initial number of vertices, m (m  n 0 ) is the number of added edges every time one new vertex is added to the graph, and t is the number of iterations. Algorithm a) Start with n 0 isolated nodes. b) Every time we add one new node v, m edges will be linked to the existing nodes from v with a preferential attachment probability where k i is the number of links (degree) of the i-th node. Eventually, the graph will have (n 0 + t) nodes and (mt) edges. Problem of „pure“ mathematicians with this algorithm: how to start from n 0 = 0?

16 2. Lecture WS 2006/07Bioinformatics III16 Properties of Barabasi-Albert scale-free model P(k)  k -  with  = 3. Real networks often show   2.1 – 2.4 Observation: if either growth or preferential attachment is eliminated, the resulting network does not exhibit scale-free properties. The average path length in the BA-model is proportional to ln n/ln ln n which is shorter than in random graphs  scale-free networks are ultrasmall worlds. Observation: non-trivial correlations = clustering between the degrees of connected nodes. Numerical result for BA-model C  n -0.75. No analytical predictions of C sofar.

17 2. Lecture WS 2006/07Bioinformatics III17 Properties of scale-free models Scale-free networks are resistant to random failures („robustness“) because a few high-degree hubs dominate their topology; a deliberate node that fails probably has a small degree, and thus not severly affects the rest of the network. However, scale-free networks are quite vulnerable to attacks on the hubs. See example of last lecture about lethality of gene deletions in yeast. These properties have been confirmed numerically and analytically by studying the average path length and the size of the giant component.

18 2. Lecture WS 2006/07Bioinformatics III18 Properties of Barabasi-Albert scale-free model BA-model is a minimal model that captures the mechanisms responsible for the power-law degree distribution observed in real networks. A discrepany is the fixed exponent of the predicted power-law distribution (  = 3).  Does the BA-model describe the „true“ biological evolution of networks? Recent efforts: - study variants with cleaner mathematical properties (Bollobas, LCD-model) - include effects of adding or re-wiring edges, allow nodes to age so that they can no longer accept new edges or vary forms of preferential attachment. These models also predict exponential and truncated power-law degree distribution in some parameter regimes.

19 2. Lecture WS 2006/07Bioinformatics III19 2 Scale-free behavior in protein domain networks ‚Domains‘ are fundamental units of protein structure. Most proteins only contain one single domain. Some sequences appear as multidomain proteins. On average, they have 2-3 domains, but can have up to 130 domains! Most new sequences show homologies to parts of known protein sequences  most proteins may have descended from relatively few ancestral types. Sequences of large proteins often seem to have evolved by joining preexisting domains in new combinations, „domain shuffling“: domain duplication or domain insertion. Wuchty Mol. Biol. Evol. 18, 1694 (2001)

20 2. Lecture WS 2006/07Bioinformatics III20 Protein domain database SMART http://smart.embl-heidelberg.de/ contains (in 2001) 153 signalling domains 176 nuclear domains, e.g. HLH domains 225 extracellular domains 115 „other“ domains Wuchty Mol. Biol. Evol. 18, 1694 (2001)

21 2. Lecture WS 2006/07Bioinformatics III21 Protein Domain databases Prosite (http://expasy.proteome.org.au/prosite/) contains 1400 biologically significant motifs and profiles.http://expasy.proteome.org.au/prosite/ Pfam (http://www.sanger.ac.uk/Software/Pfam/index.shtml) : collection of multiple- sequence alignments of protein families and profile HMMs. Curated documentation on 2500 families.http://www.sanger.ac.uk/Software/Pfam/index.shtml ProDom (http://www.toulouse.inra.fr/prodom.html) : contains all 160.000 protein domain families that can be automatically generated from SwissProt and TrEMBL databases.http://www.toulouse.inra.fr/prodom.html Here, only consider families with  10 members  6000 ProDom families. InterPro Proteome Analysis of 41 nonredundant proteomes of genomes of archaea, bacteria, and eukaryotes (http://www.ebi.ac.uk/proteome) yields domains which appear along with other domains in a protein sequence  domains are vertices + co-appearance in a protein sequence means an edge.http://www.ebi.ac.uk/proteome Wuchty Mol. Biol. Evol. 18, 1694 (2001)

22 2. Lecture WS 2006/07Bioinformatics III22 Protein Domain databases Prosite (http://expasy.proteome.org.au/prosite/) contains 1360 biologically significant motifs and profiles.http://expasy.proteome.org.au/prosite/ Wuchty Mol. Biol. Evol. 18, 1694 (2001) number of links to other domains P(number of links to other domains)

23 2. Lecture WS 2006/07Bioinformatics III23 Which ones are highly connected domains? The majority of highly connected InterPro domains appear in signalling pathways. List of the 10 best linked domains in various species. Wuchty Mol. Biol. Evol. 18, 1694 (2001)  evolutionary trend toward compartementalization of the cell and multicellularity demands a higher degree of organization. From left to right: Number of links increases. Number of signalling domains (PH, SH3), their ligands (proline-rich extensions), and receptors (GPCR/RHODOPSIN) increases.

24 2. Lecture WS 2006/07Bioinformatics III24 Evolutionary Aspects BA-model of scale-free networks is constructed by preferential attachment of newly added vertices to already well connected ones.  Fell and Wagner (2000) argued that vertices with many connections in metabolic network were metabolites originating very early in the course of evolution where they shaped a core metabolism.  Analogously, highly connected domains could have also originated very early. Is this true? Wuchty Mol. Biol. Evol. 18, 1694 (2001) No. Majority of highly connected domains in Methanococcus and in E.coli are concerned with maintanced of metabolism. None of the highly connected domains of higher organisms is found here. On the other hand, helicase C has roughly similar degrees of connection in all organisms.

25 2. Lecture WS 2006/07Bioinformatics III25 Conclusions Expansion of protein families in multcellular vertebrates coincides with higher connectivity of the respective domains. Extensive shuffling of domains to increase combinatorial diversity might provide protein sets which are sufficient to preserve cellular procedures without dramatically expanding the absolute size of the protein complement.  greater proteome complexity of higher eukaryotes is not simply a consequence of the genome size, but must also be a consequence of innovations in domain arrangements.  highly linked domains represent functional centers in various different cellular aspects. They could be treated as „evolutionary hubs“ which help to organize the domain space. Wuchty Mol. Biol. Evol. 18, 1694 (2001)

26 2. Lecture WS 2006/07Bioinformatics III26 Network growth mechanism How can we know what is the „true“ growth mechanism of real biological networks? Question 1: Is it important to know this? Yes. Question 2: What measure do we use to distinguish networks produced by different growth mechanisms?  Look at the fine structure (motifs) of biological networks.

27 2. Lecture WS 2006/07Bioinformatics III27 Analysis of Drosophila melanogaster protein interaction network Data set: protein-protein interaction map for Drosophila by Giot et al. Problem: data set is subject to numerous false positives. Giot et al. assign a confidence score p  [0,1] to each interaction measuring how likely the interaction occurs in vivo. What threshold p* should be used? Measure size of the components for all possible values of p*. Observe: for p*= 0.65, the two largest components are connected  use this value as threshold. Edges in the graph correspond to interactions for which p > p*. Remove self-interactions and isolated vertices  3359 (4625) nodes with 2795 (4683) edges for p*= 0.65 (0.5) Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

28 2. Lecture WS 2006/07Bioinformatics III28 Network evolution models considered Duplication-mutation-complementation (DMC) algorithm: based on model that proposes that most of the duplicate genes observed today have been preserved by functional complementation. If either the gene or its copy loses one of its functions (edges), the other becomes essential in assuring the organisms‘s survival. Algorithm: duplication step is followed by mutations that preserve functional complementarity. At every time step choose a node v at random. A twin vertex v twin is introduced copying all of v‘s edges. For each edge of v, delete with probability q del either the original edge or its corresponding edge of v twin. Cojoin twins themselves with independent probability q con representing an interaction of a protein with its own copy. No edges are created by mutations  DMC algorithm assumes that the probability of creating new advantageous functions by random mutations is negligible. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

29 2. Lecture WS 2006/07Bioinformatics III29 Network evolution models considered Variant of DMC: Duplication-random mutations (DMR) algorithm: Possible interactions between twins are neglected. Instead, edges between v twin and the neighbors of v can be removed with probability q del and new edges can be created at random between v twin and any other vertices with probability q new /N, where N is the current total number of vertices. DMR emphasizes the creation of new advantageous functions by mutation. Other models: - linear preferential attachment (LPA) (Barabasi) - random static networks (Erdös-Renyi) (RDS) - random growing networks (RDG – growing graphs where new edges are created randomly between existing nodes) - aging vertex networks (AGV – growing graphs modeling citation networks, where the probability for new edges decreases with the age of the vertex) - small-world network (SMV – interpolation between regular ring lattices and randomly connected graphs). Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

30 2. Lecture WS 2006/07Bioinformatics III30 Training set Create 1000 graphs as training data for each of the seven different models. Every graph is generated with the same number of edges and nodes as measured in Drosophila. Quantify topology of a network by counting all possible subgraphs up to a given cut-off, which could be the number of nodes, number of edges, or the length of a given walk. Here: count all subgraphs that can be constructed by a walk of length=8 (148 non- isomorphic subgraphs) or length=7 (130 non-isomorphic subgraphs). Use these counts as input features for classifier. Note that the average shortest path between two nodes of the Drosophila network‘s giant component is 11.6 (9.4) for p*=0.65 (0.5).  Walks of length=8 can traverse large parts of the network. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

31 2. Lecture WS 2006/07Bioinformatics III31 Visualization of subgraphs A qualitative and more intuitive way of interpreting the classification result is visualizing the subgraph profiles. Subgraphs associated with Figures 3 and 1. A representatie subset of 50 subgraphs out of 148 is shown. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

32 2. Lecture WS 2006/07Bioinformatics III32 Learning algorithm: Alternating Decision Tree Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15 Rectangles: decision nodes. A given network‘s subgraph counts determine paths in the tree dictated by inequalities specified by the decision nodes. For each class, the Alternative Decision Tree outputs a real-valued prediction score, which is the sum of all weights over all paths. The class with the heighest score wins.

33 2. Lecture WS 2006/07Bioinformatics III33 Performance on training set Can the Decision Tree separate the graphs generated by the different growth mechanisms? The confusion matrix shows truth and prediction for the test sets. 5 out of 7 have nearly perfect prediction accuracy. AGV is constructed as an interpolation between LPA and a ring lattice  the AGV, LPA and SMW mechanisms are equivalent in specific parameter regimes and show a non-negligible overlap. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

34 2. Lecture WS 2006/07Bioinformatics III34 Task: discriminate different growth mechanisms Ten graphs of two different mechanisms exhibit similar average geodesic lengths and almost identical degree distribution and clustering coefficients. (a) cumulative degree distribution p(k > k 0 ), average clustering coefficient and average geodesic length, all quantities averaged over a set of 10 graphs.  global topology descriptors cannot separate between growth mechanisms (b) Prediction score for all ten graphs and all five cross-validated ADTs. The two sets of graphs can now be perfectly separated by the classifier. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

35 2. Lecture WS 2006/07Bioinformatics III35 Learning algorithm: Alternating Decision Tree Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15 Figure shows the first few descision nodes (out of 120) of a resulting ADT. The prediction scores reveal that a high count of 3-cycles suggests a DMC network. DMC mechanism indeed facilitates creation of many 3-cycles by allowing 2 copies to attach to eachother, thus creating 3-cycles with their common neighbors. A low count in 3-cycles but a high count in 8-edge linear chains is a good precictor for LPA and DMR networks.

36 2. Lecture WS 2006/07Bioinformatics III36 Prediction for Drosophila melanogaster network Use this classifier (ADT) with good prediction accuracy now to determine the network mechanism that best reproduces the Drosophila network (or any network of the same size). Prediction scores for the Drosophial protein network for different confidence threshold p* and different cut-offs in subgraph size. Drosophial is consistently classified as a DMC network, with an especially strong prediction for a confidence threshould of p*=0.65 and independently of the cut-off in subgraph size. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

37 2. Lecture WS 2006/07Bioinformatics III37 Subgraph profiles The average subgraph count of the training data for every mechanism is shown for the 50 representative subgraphs S1-S50. Black lines indicate that this model is closest to Drosophila based on the absolute difference between the subgraph counts. For 60% of the subgraphs (S1-S30), the counts for Drosophila are closest to the DMC model. All of these subgraphs contain one or more cycles, including highly connected subgraphs (S1) and long linear chains ending in cycles (S16, S18, S22, S23, S25). Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15 The DMC algorithm is the only mechanism that produces such cycles with a high occurrence.

38 2. Lecture WS 2006/07Bioinformatics III38 Robustness against noise Edges in Drosophila network are randomly replaced and the network is classified. Plotted are prediction scores for each of the 7 classes as more and more edges are replaced. Every point is an average over 200 independent random replacements. For high noise level (beyond 80%), the network is classified as an Erdös-Renyi (RDS) graph. For low noise (< 30%), the confidence in the classification as a DMC network is even higher than in the classification as an RDS network for high noise. The prediction score y(c) for class c is related to the estimated probability p(c) for the tested network to be in class c by Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15

39 2. Lecture WS 2006/07Bioinformatics III39 Conclusions Very nice (!) method that allows to infer growth mechanisms for real networks. Method is robust against noise and data subsampling, no prior assumption about network features/topology required. Learning algorithm does not assume any relationships between features (e.g. orthogonality). Therefore the input space can be augmented with various features in addition to subgraph counts. The protein interaction network of Drosophila is confidently classified as DMC network. However, further growth mechanisms need to be explored in future. Input from evolutionary biology is needed. Here, we mostly concentrated on the technique of characterizing the resulting network topologies. Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15


Download ppt "2. Lecture WS 2006/07Bioinformatics III1 V2 – network topologies - Random graphs: classical field in graph theory. Well studied analytically and numerically."

Similar presentations


Ads by Google