Presentation is loading. Please wait.

Presentation is loading. Please wait.

Functional Genomics and Gene Network Analysis

Similar presentations


Presentation on theme: "Functional Genomics and Gene Network Analysis"— Presentation transcript:

1 Functional Genomics and Gene Network Analysis
Alexandra Maertens Oct. 5, 2004

2 Functional Genomics The fundamental strategy of functional genomics is to expand the scope of biological investigation from studying single genes or proteins to studying all genes or proteins at once in a systematic fashion. Functional genomics seeks to narrow the gap between sequence and function and to yield new insights into the behavior of biological systems. paraphrased from 40& of predicted genes in newly sequences genomes cannot be assigned based on sequence similarity

3 How to determine functionally related genes?
40% of predicted genes in newly sequenced genomes cannot be assigned function based on sequence similarity Other techniques include data across phylogenetic profiles, looking for gene fusion events, etc. but all of these techniques are inexact

4 Does co-regulation imply functional similarity?
“Guilt by association” : gene’s sharing a common pattern of expression in many different experiment are likely to be involved in similar processes This technique has been used by Tavazoie et al. (1999) to identify biologically significant DNA-motifs in the promoter region of genes clustered based on cell-cycle expression patterns in yeast. Limited applicability for determining functions of genes in a single species

5 Co-Expression Can Be Caused By:
Gene A regulates Gene B, or vice versa Or, they are both regulated by a third gene, C It can just represent a common response to the environment of the cell And lastly: it can be an accident

6 Determining Co-Regulation in Large Scale Data Sets
Euclidean Distance Pearson Spearman

7 Euclidean Distance Commonly used
Not practical for analyzing a large data set with a diverse set of microarrays Microarrays do not really measure an absolute amount of mRNA; they should be considered as measuring the relative amount of mRNA There will be large differences in range of values for two microarrays done at different times

8 The Pearson Correlation treats the vectors as if they were the same (unit) length, and is thus insensitive to the amplitude of changes that may be seen in the expression profiles. Since Euclidean distance measures the absolute distance between points in space, the Euclidean distance thus takes into account both the direction and the magnitude of the vectors. From Stanford Microarray Database,

9 Pearson’s How well does a linear function describe the relationship between two variables? Values ranges between -1 (negative correlation), 0 (no correlation), and 1 (perfect correlation)

10 Pearson (courtesy of Hyperstat.com)

11 R= 0.778184 Lead Detoxification Gene Lead Uptake Gene No Lead 1.5 1.4
10 uM Lead 2.2 3.2 20 uM lead 3.5 2 50 uM Lead 4.6 4.5 100 uM Lead 4.7 4

12 Spearman from

13 Convert each expression value to a value according to rank in each column (from lowest to highest)
Lead Detoxification Gene Lead Uptake Gene No Lead 1.5 1.4 10 uM Lead 2.2 3.2 20 uM lead 3.5 2 50 uM Lead 4.6 4.5 100 uM Lead 4.7 4 Rank Values of Lead Detox Gene Rank Values of Lead Excretion Gene No Lead 1 10 uM Lead 2 3 20 uM lead 50 uM Lead 4 5 100 uM Lead

14 Calculate the difference between the ranks
Rank Values of Lead Detox Gene Rank Values of Lead Excretion Gene Difference in Ranks Difference Squared No Lead 1 10 uM Lead 2 3 -1 20 uM lead 50 uM Lead 4 5 100 uM Lead

15 Going From Distance Measurements to Networks
Why do biologists care about networks?

16 From: Patrik D’haeseleer Harvard University http:/genetics. med

17 Yeast Protein Interaction Network
Uetz, Schwikowski, Fields and co-workers; Ito and co-workers

18 Image credit: U.S. Department of Energy Genomics:GTL Program, http://doegenomestolife.org

19 A network is simply A collection of nodes (vertices)
Connected by edges (links) From Barbassi, Nature Review Genetics, Vol. 5 Feb, 2004

20 From the layout of the network, you can calculate a lot of interesting properties.
Degree = the connectivity, k, which is the number of links a node has to it’s neighbours Degree distribution = P(k) gives the probability that a node will have a given k number of links (obtained by counting the number of nodes and with each value of k and dividing by the total number of nodes)

21 From this, you can calculate whether the network is “scale-free”
This means that P(k) is propertional to k- g Usually g is around 2 This means that the network is organized into a “hub and spoke” system with a small number of nodes having a large number of links

22 If there are N nodes in a network, the number of possible connections is N2
Assuming a modest 6,000 genes, that is still more than we can hope to experimentally determine However, using correlation between two genes (or orthologues) across several different experiments as a stand-in for links, we can use microarrays to develop a basic network that can then be refined The more experiments, and the more types of experiments, that we find the gene co-expressed, the more probable they are truly linked.

23 Hierarchical clustering vs. non-hierarchical clustering
Hierarchical clustering (various agglomerative and divisive techniques) treat the data as if we have no idea how many clusters there should be.

24 K-means clustering the user decides in advance how many clusters there will be this is a good method if you a priori know who many clusters you want (for example, three time points, you would use three different clusters) alternatively, if you visually inspect the data you can see if there appears to be certain number of clusters If all else fails, there are computer programs that calculate an ideal k-cluster out of many possible values for k goal is to divide the objects into K clusters such that some distance metric relative to the centroids of the clusters is minimized

25 Initial reference vectors are assigned randomly or according to previous knowledge
Assign each object to one of k clusters randomly Calculate average expression vectors for each cluster (as reference vectors) and the distance between clusters Iteratively move objects between clusters and the objects stay in the new cluster when they are closer to the new cluster than to the old cluster. Repeat steps 3-4 until converge, i.e. moving any more objects would increase intra-cluster distances

26 For every cluster k, sum the differences between every data point Xn in the cluster and the geometric mean of the cluster.

27 Array 1 Array 2 METAGENE1 22 21 METAGENE 2 19 20 METAGENE 3 18 METAGENE 4 1 3 METAGENE 5 4 2

28 Finding the Center so the new cluster centre for A is [19.666 , 21]
                                              so the new cluster centre for A is [ , 21] This process is repeated until successive iterations cause no overall change in the sum of distances of each point in each cluster from the center. (adapted from

29

30

31

32 Remember: Clustering always works. There is no guarantee it is the optimal partition.

33 Analyze coexpression relationships of homologous sets of genes in human, fly, worm, yeast to identify conserved genetic modules 3182 microarrays over multiple groups of homologous genes (metagenes)

34 Metagenes Orthologous counterparts in different organisms Best reciprocal BLAST hit Each gene assigned to at least one metagene

35 BLAST stands for Basic Local Alignment Search Tool Performs a pairwise alignment of two strings (either nucleotides or amino acids) Amino acids are scored according to their similarity: chemical similarity and observed mutation frequencies The probability that this match could have occurred by chance is given as an E value

36 A Metagene

37 Global View of the Data Set

38 They then identified pairs of genes by “relabeling” each gene in each species with its metagene, and then comparing expression levels across each different array. This provided a Pearson correlation.

39 Are the data sufficient?
Divided the data set randomly into half Used each half independently to generate a network and compared this to the network generated by the total data to see how many interactions were maintained at the same level of statistical significance Only 40% of the interactions observed were statistically significant in both halves

40 You can interpret this number two ways:
1) The approach is sensitive to the amount of data used 2) However, even with only half of the data, there was still some signal amidst the noise

41 How robust is the constructed network to noise?
They added increasing amouts of Gaussian noise and found that the network was robust with realistic levels of noise seen in microarray experiments

42 Permutating your data preserves the same distribution of your data, but without the labels
They claim that they “randomly permutated gene expression data.” This is a bit ambiguous Probably means: they shuffled the expression data within the species column, so the consistency of the metagenes would cease to hold throughout the row They repeated this shuffling 10 times. The point of shuffling is to get a representative sample of the ways you could rearrange your objects; 10 is a conservative number for this instance.

43

44 Using Several Different Species
By using more species, are you just providing more data or is there a benefit to looking at conservation of coexpression across several distantly related species

45 Using Several Different Species

46 How to go from this data to a network?
For each metagene, m1, they ranked each other instance of the metagene in each species by how well they correlate, and then compared this across species, producing a rank ratios each specis for each directed pair They calculated a joint probability distribution based on order statistics to determine if these rankings were statistically significant.

47 They then used this p-value to do to things:
if the p-value was above a cutoff (corrected for multiple tests) this defined a link The strength of the p-value then provided a distance metric that was then used visualize the grouping of the data From visual inspection of this data, the ydecided upon 12 clusters, and then performed k-means clustering

48

49 Do these clusters effectively group functionally related genes?
The Gene Ontology (or GO) terminology provides a controlled vocabulary to describe protein and gene function, cellular location and biological process You can calculate whether you have an overrepresentation of certain GO terms in any given cluster to characterize the cluster The P-value is derived from a a hypergeometric distribution (sampling without replacement), as the probability of x or more out of n genes having a given annotation, given that G of N have that annotation in the genome in general. (Probably done by GOFinder)

50 Do these clusters effectively group functionally related genes?

51 However, this is a bit suspicious.
If you look in their supplemental material, you find “metagenes that were further than d units away from their closest center were excluded from membership in any of the 12 components. We chose d to be 10% of the diameter of the entire landscape. How many were excluded?

52 But are these predictions biologically valid?
Finding these metagenes in cancer data does imply that they are connected to cell proliferation, but it is weak proof that they are actually cell proliferation genes

53 But are these predictions biologically valid?
RNAi involves feeding small double stranded oligonucleotides into cells, “tricks” cells into degrading it’s own mRNA. New technique for quickly knocking out or knocking down gene expression.

54 Worm gonads stained for DNA; wt-type worms show less nucei

55 Not a properly done RNAi experiment
How much does one experiment tell you? (Did they have to do 10 RNAi experiments based on their predictions before one worked?)

56 Does this network look like a biological network?
Count the number of links of each metagene (with links being defined by the number of other metagenes with coexpressed for a given P-value) See if the distribution of links is non-random and different from a network generated from random data

57 Does this network look like a biological network?

58 Does this likely reflect the actual distribution of links, or does this data by definition select highly linked genes?

59 What are the limitations?
Is has been estimated that in humans: 95% of mRNA transcripts are expressed at <5 copies/cell Velculescu et al.(1997), Cell 88:


Download ppt "Functional Genomics and Gene Network Analysis"

Similar presentations


Ads by Google