Presentation is loading. Please wait.

Presentation is loading. Please wait.

341: Introduction to Bioinformatics

Similar presentations


Presentation on theme: "341: Introduction to Bioinformatics"— Presentation transcript:

1 341: Introduction to Bioinformatics
Dr. Nataša Pržulj Department of Computing Imperial College London

2 Topics Introduction to biology (cell, DNA, RNA, genes, proteins)
Sequencing and genomics (sequencing technology, sequence alignment algorithms) Functional genomics and microarray analysis (array technology, statistics, clustering and classification) Introduction to biological networks Introduction to graph theory Network properties Network/node centralities Network motifs Network models Network/node clustering Network comparison/alignment Protein 3D structure / Network data integration Software tools for network analysis Interplay between topology and biology

3 Network properties: summary of last class
Network Comparisons: Large network comparison is computationally hard due to NP-completeness of the underlying subgraph isomorphism problem: Given 2 graphs G and H as input, determine whether G contains a subgraph that is isomorphic to H. Thus, network comparisons rely on easily computable heuristics (approximate solutions), called “network properties” Network properties can roughly & historically be divided in two categories: Global network properties: give an overall view of the network, but might not be detailed enough to capture complex topological characteristics of large networks. Local network properties: more detailed network descriptors which usually encompass larger number of constraints, thus reducing “degrees of freedom” in which the networks being compared can vary. 3

4 Network properties: summary of last class
1. Global Network Properties Readings: Chapter 3 of “Analysis of biological networks” by Junker and Schreiber. Some Global Network Properties: Degree distribution Average clustering coefficient Clustering spectrum Average Diameter Spectrum of shortest path lengths Centralities

5 Network properties: summary of last class
2. Local Network Properties Readings: Chapter 5 of “Analysis of Biological Networks” by Junker and Schreiber. Network motifs Graphlets Two network comparison measures based on graphlets: 2.1) Relative Graphlet Frequency Distance between two networks 2.2) Graphlet Degree Distribution Agreement between two networks

6 What is a network (graph) model?

7 Does the model network fit the data?
Use network properties: Local Global Why? “Hardness” of graph theoretic problems E.g., NP-completeness of subgraph isomorphism Cannot exactly compare/align networks Use heuristics (approximate solutions) Exact comparison inappropriate in biology Due to biological variation Noise  revise models as data sets evolve

8 Why model networks? Understand laws  reproduction/predictions
Network models have already been used in biological applications: Network motifs (Shen-Orr et al., Nature Genetics 2002, Milo et al., Science 2002) De-noising of PPI network data (Kuchaiev et al., PLoS Comp. Biology, 2009) Guiding biological experiments (Lappe and Holm, Nature Biotechnology, 2004) Development of computationally easy algorithms for PPI nets that are computationally intensive on graphs in general (Przulj et al., Bioinformatics, 2006)

9 Network models We will cover the following network models:
Erdos–Renyi random graphs Generalized random graphs (with the same degree distribution as the data networks) Small-world networks Scale-free networks Hierarchical model Geometric random graphs Stickiness index-based network model

10 Erdos–Renyi random graphs (ER)
Model a data network G(V,E) with |V|=n and |E|=m An ER graph that models G is constructed as follows: It has n nodes Edges are added between pairs of nodes uniformly at random with the same probability p Two (equivalent) methods for constructing ER graphs: Gn,p: pick p so that the resulting model network has m edges Gn,m: pick randomly m pairs of nodes and add edges between them with probability 1

11

12 Erdos–Renyi random graphs (ER)
Number of edges, |E|=m, in Gn,p is: Average degree is:

13 Erdos–Renyi random graphs (ER)
Many properties of ER can be proven theoretically (See: Bollobas, "Random Graphs," 2002) Example: When m=n/2,suddenly the giant component emerges, i.e.: One connected component of the network has O(n) nodes The next largest connected component has O(log(n)) nodes

14 Erdos–Renyi random graphs (ER)
The degree distribution is binomial: For large n, this can be approximated with Poisson distribution: where z is the average degree (compute it!) However, currently available biological networks have power-law degree distribution

15 Erdos–Renyi random graphs (ER)
Clustering coefficient, C, of ER is low (for low p) C=p, since probability p of connecting any two nodes in an ER graph is the same, regardless of whether the nodes are neighbors However, biological networks have high clustering coefficients

16 Erdos–Renyi random graphs (ER)
Average diameter of ER graphs is small It is equal to Biological networks also have small average diameters Summary

17 Generalized random graphs (ER-DD)
Preserve the degree distribution of data (“ER-DD”) Constructed as follows: An ER-DD network has n nodes (so does the data) Edges are added between pairs of nodes using the “stubs method”

18 Generalized random graphs (ER-DD)
The “stubs method” for constructing ER-DD graphs: The number of “stubs” (to be filled by edges) is assigned to each node in the model network according to the degree distribution of the real network to be modeled Edges are created between pairs of nodes with “available” stubs picked at random After an edge is created, the number of stubs left available at the corresponding “end nodes” of the edges is decreased by one Multiple edges between the same pair of nodes are not allowed

19 Generalized random graphs (ER-DD)
Summary 2 global network properties are matched by ER-DD How about local network properties (graphlet frequencies)? Low-density (sparse) graphlets are frequent in ER and ER-DD However, data networks have lots of dense graphlets, since data networks have high clustering coefficients

20 Small-world networks (SW)
Watts and Strogatz, 1998 Created from regular ring lattices by random rewiring of a small percentage of their edges E.g.

21 Small-world networks (SW)
SW networks have: High clustering coefficients – introduced by “ring regularity” Large average diameters of regular lattices – made small by randomly re-wiring a small percentage of edges Summary

22 Scale-free networks (SF)
Power-law degree distributions: P(k) = k−γ γ > 0; 2 < γ < 3

23 Scale-free networks (SF)
Power-law degree distributions: P(k) = k−γ γ > 0; 2 < γ < 3

24 Scale-free networks (SF)
Different models exist, e.g.: Preferential Attachment Model (SF-BA) (Barabasi-Albert, 1999) Gene Duplication and Mutation Model (SF-GD) (Vazquez et al., 2003)

25 Scale-free networks (SF)
Preferential Attachment Model (SF-BA) “Growth” model: nodes are added to an existing network New nodes preferentially attach to existing nodes with probability proportional to the degrees of the existing nodes; e.g.: This is repeated until the size of SF network matches the size of the data “Rich getting richer” The starting network strongly influences the properties of the resulting network (F. Hormozdiari, et al., PLoS Computational Biology, 3(7):e118, July ) SF-BA: particularly effective at describing Internet

26 Scale-free networks (SF)
Gene Duplication and Mutation Model (SF-GD) Biologically motivated Attempts to mimic gene duplication and mutation processes

27 Scale-free networks (SF)
Gene Duplication and Mutation Model (SF-GD) At each time step, a node is added to the network as follows:

28 Scale-free networks (SF)
Summary

29 Hierarchical model Preserves network “modularity” via a fractal-like generation of the network

30 Hierarchical model These graphs do not match any biological data and are highly unlikely to be found in data sets

31 Geometric random graphs
“Uniform” geometric random graphs (GEO) N. Przulj lab, Geometric gene duplication and mutation model (GEO-GD) N. Przulj et al., PSB 2010

32 Geometric random graphs
“Uniform” geometric random graphs (GEO) Take any metric space and, using a uniform random distribution, place nodes within the space If any nodes are within radius r (calculated via any chosen distance norm for the space), they will be connected Choose r so that the size of the GEO network matches that of the data There are many possible metric spaces (e.g., Euclidean space) There are many possible distance norms (e.g. the Euclidean distance, the Chessboard distance, and the Manhattan/Taxi Driver distance)

33

34

35

36

37

38

39

40

41

42 Geometric random graphs
“Uniform” geometric random graphs (GEO) Summary

43 Geometric random graphs
Geometric gene duplication and mutation model (GEO-GD) Gene duplications and mutations can be used to guide the growth process in a geometric graph

44 Geometric random graphs
Geometric gene duplication and mutation model (GEO-GD) Gene duplications and mutations can be used to guide the growth process in a geometric graph

45 Geometric random graphs
Geometric gene duplication and mutation model (GEO-GD) Gene duplications and mutations can be used to guide the growth process in a geometric graph

46 Geometric random graphs
Geometric gene duplication and mutation model (GEO-GD) Gene duplications and mutations can be used to guide the growth process in a geometric graph

47 Geometric random graphs
Geometric gene duplication and mutation model (GEO-GD) Gene duplications and mutations can be used to guide the growth process in a geometric graph

48 Geometric random graphs
Geometric gene duplication and mutation model (GEO-GD) Gene duplications and mutations can be used to guide the growth process in a geometric graph

49 Geometric random graphs
Geometric gene duplication and mutation model (GEO-GD) This variant also reproduces graphlet properties of the empirical dataset Also, these networks have power-law degree distributions -GD

50 Stickiness index-based network model
(N. Przulj and D. Higham, Journal of the Royal Society Interface, vol 3, num 10, pp , 2006.) Based on the stickiness index: A number based on the a protein’s normalized degree in a PPI network Used to summarize the abundance and popularity of binding domains of a protein Assumption: a high degree protein has many binding domains However, remember “date” vs. “party” hubs A pair of proteins is more likely to interact under this model if both proteins have high stickiness indices The probability of an edge between two nodes is the product of their stickiness indices

51 Stickiness index-based network model
“Sticky networks” have the expected degree distribution of the data Also, they mimic well the clustering coefficients and the diameters of real-world networks Summary

52 Software that implements many of these network models and evaluates their fit to data networks with respect to a variety of network properties (but there are others): GraphCrunch:

53 Software that implements many of these network models and evaluates their fit to data networks with respect to a variety of network properties (but there are others): GraphCrunch:

54 Topics Introduction to biology (cell, DNA, RNA, genes, proteins)
Sequencing and genomics (sequencing technology, sequence alignment algorithms) Functional genomics and microarray analysis (array technology, statistics, clustering and classification) Introduction to biological networks Introduction to graph theory Network properties Network/node centralities Network motifs Network models Network/node clustering Network comparison/alignment Protein 3D structure / Network data integration Software tools for network analysis Interplay between topology and biology


Download ppt "341: Introduction to Bioinformatics"

Similar presentations


Ads by Google