Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)

Similar presentations


Presentation on theme: "Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)"— Presentation transcript:

1 Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)

2 Advanced Course Coverage
Introduction -fundamental concepts, expression vectors and distance metrics -fundamental statistical concepts encountered in mev analysis modules Algorithm Coverage -Lecture / Hands on Exercises (refer to algorithm handout for order…)

3 TIGR Microarray Data Flow Microarray Printers Microarray Scanners
IAS-1 MD Lucidea Others IAS-2 MD3 Microarray Printers Axon-1 Others Axon-2 ScanArray Microarray Scanners Scheduler (Machine Scheduling) SliTrack (Machine Control) Exp Designer MABCOS (Barcode System) PCR Score .tiff Image File Probe Source Data Entry Pages Probe Study Slide Scan Hybridization Expression Analysis MADAM (Data Manager) Spotfinder (Image Analysis) Expression Data Raw .tav File Miner (.tav File Creator) Raw .tav File MIDAS (Normalization) GenePix Converter Study Probe Slidetype Slide Experiment Reports MAGE-ML Normalized .tav File Query Window Database MUSAGE MeV (Data Analysis) Database Others… Database MAD Interpretation… THE INSTITUTE FOR GENOMIC RESEARCH TIGR

4 The Expression Matrix is a representation of data from multiple
microarray experiments. Each element is a log ratio (usually log 2 (Cy5 / Cy3) ) Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Black indicates a log ratio of zero, i. e., Cy5 and Cy3 are very close in value Green indicates a negative log ratio , i.e., Cy5 < Cy3 Gray indicates missing data Red indicates a positive log ratio, i.e, Cy5 > Cy3

5 Expression Vectors -Gene Expression Vectors
encapsulate the expression of a gene over a set of experimental conditions or sample types. Log2(cy5/cy3) -0.8 0.8 1.5 1.8 0.5 -1.3 -0.4

6 Expression Vectors As Points in ‘Expression Space’
G1 -0.8 -0.3 -0.7 G2 -0.4 -0.8 -0.7 G3 -0.6 -0.8 -0.4 Similar Expression G4 0.9 1.2 1.3 G5 1.3 0.9 -0.6 Experiment 3 Experiment 2 Experiment 1

7 Distance and Similarity
-the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms -distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression -selection of a distance metric defines the concept of distance

8 p1 p0 Distance: a measure of similarity between genes.
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene A Gene B x1A x2A x3A x4A x5A x6A x1B x2B x3B x4B x5B x6B p1 Some distances: (MeV provides 11 metrics) Euclidean: i = 1 (xiA - xiB)2 6 p0 Manhattan: i = 1 |xiA – xiB| 6 3. Pearson correlation

9 Distance is Defined by a Metric
Euclidean Pearson(r*-1) Distance Metric: D 1.4 -0.90 4.2 -1.00

10 Statistical Concepts

11 Probability distributions
The probability of an event is the likelihood of its occurring. It is sometimes computed as a relative frequency (rf), where the number of “favorable” outcomes for an event rf = the total number of possible outcomes for that event. The probability of an event can sometimes be inferred from a theoretical probability distribution, such as a normal distribution.

12 Normal distribution σ = std. deviation of the distribution
X = μ (mean of the distribution)

13 Population 1 Population 2 Mean 1 Mean 2 Sample mean “s” Less than a 5% chance that the sample with mean s came from population 1, i.e., s is significantly different from “mean 1” at the p < 0.05 significance level. But we cannot reject the hypothesis that the sample came from population 2.

14 Many biological variables, such as height and weight, can
reasonably be assumed to approximate the normal distribution. But expression measurements? Probably not. Fortunately, many statistical tests are considered to be fairly robust to violations of the normality assumption, and other assumptions used in these tests. Randomization / resampling based tests can be used to get around the violation of the normality assumption. Even when parametric statistical tests (the ones that make use of normal and other distributions) are valid, randomization tests are still useful.

15 Outline of a randomization test - 1
Compute the value of interest (i.e., the test-statistic s) from your data set. s Original data set Make “fake” data sets from your original data, by taking a random sub-sample of the data, or by re-arranging the data in a random fashion. Re-compute s from the “fake” data set. “fake” s “fake” s “fake” s . . . Randomized data sets

16 4. Repeat steps 2 and 3 many times (often several hundred to
Outline of a randomization test - 2 4. Repeat steps 2 and 3 many times (often several hundred to several thousand times). Keep a record of the “fake” s values from step 3. 5. Draw inferences about the significance of your original s value by comparing it with the distribution of the randomized (“fake”) s values. Original s value: could be significant as it exceeds most of the randomized s values Range of randomized s values

17 Ideally, we want to know the “behavior” of the larger
Outline of a randomization test - 3 Rationale Ideally, we want to know the “behavior” of the larger population from which the sample is drawn, in order to make statistical inferences. Here, we don’t know that the larger population “behaves” like a normal distribution, or some other idealized distribution. All we have to work with are the data in hand. Our “fake” data sets are our best guess about this behavior (i.e., if we had been pulling data at random from an infinitely large population, we might expect to get a distribution similar to what we get by pulling random sub-samples, or by reshuffling the order of the data in our sample)

18 The problem of multiple testing
(adapted from presentation by Anja von Heydebreck, Max–Planck–Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany Let’s imagine there are 10,000 genes on a chip, AND None of them is differentially expressed. Suppose we use a statistical test for differential expression, where we consider a gene to be differentially expressed if it meets the criterion at a p-value of p < 0.05.

19 The problem of multiple testing – 2
Let’s say that applying this test to gene “G1” yields a p-value of p = 0.01 Remember that a p-value of 0.01 means that there is a 1% chance that the gene is not differentially expressed, i.e., Even though we conclude that the gene is differentially expressed (because p < 0.05), there is a 1% chance that our conclusion is wrong. We might be willing to live with such a low probability of being wrong BUT .....

20 The problem of multiple testing – 3
We are testing 10,000 genes, not just one!!! Even though none of the genes is differentially expressed, about 5% of the genes (i.e., 500 genes) will be erroneously concluded to be differentially expressed, because we have decided to “live with” a p-value of 0.05 If only one gene were being studied, a 5% margin of error might not be a big deal, but 500 false conclusions in one study? That doesn’t sound too good.

21 There are “tricks” we can use to reduce the severity of this problem.
The problem of multiple testing - 4 There are “tricks” we can use to reduce the severity of this problem. They all involve “slashing” the p-value for each test (i.e., gene), so that while the critical p-value for the entire data set might still equal 0.05, each gene will be evaluated at a lower p-value. We’ll go into some of these techniques later.

22 Ultimately, what matters is biological relevance.
Don’t get too hung up on p-values. Ultimately, what matters is biological relevance. P-values should help you evaluate the strength of the evidence, rather than being used as an absolute yardstick of significance. Statistical significance is not necessarily the same as biological significance.

23 i.e., you don’t want to belong to “that group of people whose aim in life is to be wrong 5% of the time”!!! * *Kempthorne, O., and T.E. Deoerfler 1969 The behaviour of some significance tests under experimental randomization. Biometrika 56: , as cited in Manly, B.J.F Randomization, bootstrap and Monte Carlo methods in biology: pg. 1. Chapman and Hall / CRC

24 Pearson correlation coefficient – r
Indicates the degree to which a linear relationship can be approximated between two variables. Can range from (–1.0) to (+1.0). Positive r between two variables X and Y: as X increases, so does Y on the whole. X Y Y X Negative r: as X increases, Y generally decreases. The higher the magnitude of r (in the positive or negative direction), the more linear the relationship.

25 Sometimes, a p-value is associated with the correlation coefficient r.
Pearson correlation - 2 Sometimes, a p-value is associated with the correlation coefficient r. This p-value is computed from a theoretical distribution of the correlation coefficient, similar to the normal distribution. Population correlation coefficient = 0 Sample correlation coefficient r p < 0.05 range, i.e., reject the null hypothesis that the variables are not correlated, since the sample correlation coefficient is in the rejection range of the correlation coefficient distribution that has a mean = 0 • This is the p-value for the null hypothesis that the X and Y data for our sample come from a population in which their correlation is zero, i.e., the null hypothesis is that there is no linear relationship between X and Y. • If p is sufficiently small (often p < 0.05), we can reject the null hypothesis, i.e., we conclude that there is indeed a linear relationship between X and Y.

26 It is the proportion of the total variation in X and Y that is
Pearson correlation - 3 The square of the Pearson correlation, r2, also known as the coefficient of determination, is a measure of the “strength” of the linear relationship between X and Y. It is the proportion of the total variation in X and Y that is explained by a linear relationship.

27 Algorithms…

28 Hierarchical Clustering (HCL)
HCL is an agglomerative clustering method which joins similar genes into groups. The iterative process continues with the joining of resulting groups based on their similarity until all groups are connected in a hierarchical tree. (HCL-1)

29 Hierarchical Clustering
g1 is most like g8 g7 g1 g8 g2 g3 g4 g5 g6 g4 is most like {g1, g8} g7 g1 g8 g4 g2 g3 g5 g6 (HCL-2)

30 Hierarchical Clustering
g5 is most like g7 g6 g1 g8 g4 g2 g3 g5 g7 {g5,g7} is most like {g1, g4, g8} g6 g1 g8 g4 g5 g7 g2 g3 (HCL-3)

31 Hierarchical Tree g6 g1 g8 g4 g5 g7 g2 g3 (HCL-4)

32 Hierarchical Clustering
During construction of the hierarchy, decisions must be made to determine which clusters should be joined. The distance or similarity between clusters must be calculated. The rules that govern this calculation are linkage methods. (HCL-5)

33 Agglomerative Linkage Methods
Linkage methods are rules or metrics that return a value that can be used to determine which elements (clusters) should be linked. Three linkage methods that are commonly used are: Single Linkage Average Linkage Complete Linkage (HCL-6)

34 for all i = 1 to NA and j = 1 to NB
Single Linkage Cluster-to-cluster distance is defined as the minimum distance between members of one cluster and members of the another cluster. Single linkage tends to create ‘elongated’ clusters with individual genes chained onto clusters. DAB = min ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB (HCL-7)

35 Average Linkage DAB = 1/(NANB) S S ( d(ui, vj) )
Cluster-to-cluster distance is defined as the average distance between all members of one cluster and all members of another cluster. Average linkage has a slight tendency to produce clusters of similar variance. DAB = 1/(NANB) S S ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB (HCL-8)

36 for all i = 1 to NA and j = 1 to NB
Complete Linkage Cluster-to-cluster distance is defined as the maximum distance between members of one cluster and members of the another cluster. Complete linkage tends to create clusters of similar size and variability. DAB = max ( d(ui, vj) ) where u Î A and v Î B for all i = 1 to NA and j = 1 to NB DAB (HCL-9)

37 Comparison of Linkage Methods
Single Ave. Complete (HCL-10)

38 Bootstrapping (ST) Bootstrapping – resampling with replacement
Original expression matrix: Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Various bootstrapped matrices (by experiments): Exp 2 Exp 3 Exp 4 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 1 Exp 1 Exp 3 Exp 5 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6

39 Jackknifing (ST) Jackknifing – resampling without replacement
Original expression matrix: Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Various jackknifed matrices (by experiments): Exp 1 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Exp 1 Exp 2 Exp 3 Exp 4 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6

40 Analysis of Bootstrapped and Jackknifed Support Trees
Bootstrapped or jackknifed expression matrices are created many times by randomly resampling the original expression matrix, using either the bootstrap or jackknife procedure. Each time, hierarchical trees are created from the resampled matrices. The trees are compared to the tree obtained from the original data set. The more frequently a given cluster from the original tree is found in the resampled trees, the stronger the support for the cluster. As each resampled matrix lacks some of the original data, high support for a cluster means that the clustering is not biased by a small subset of the data.

41 K-Means / K-Medians Clustering (KMC)– 1
1. Specify number of clusters, e.g., 5. 2. Randomly assign genes to clusters. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13

42 K-Means Clustering – 2 3. Calculate mean / median expression profile of each cluster. 4. Shuffle genes among clusters such that each gene is now in the cluster whose mean / median expression profile (calculated in step 3) is the closest to that gene’s expression profile. G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 5. Repeat steps 3 and 4 until genes cannot be shuffled around any more, OR a user-specified number of iterations has been reached. K-Means / K-Medians is most useful when the user has an a-priori hypothesis about the number of clusters the genes should group into.

43 Principal Components (PCAG and PCAE) – 1
PCA simplifies the “views” of the data. Suppose we have measurements for each gene on multiple experiments. Suppose some of the experiments are correlated. PCA will ignore the redundant experiments, and will take a weighted average of some of the experiments, thus possibly making the trends in the data more interpretable. 5. The components can be thought of as axes in n-dimensional space, where n is the number of components. Each axis represents a different trend in the data.

44 PCAG and PCAE - 2 “Cloud” of data points (e.g., genes) in 3-dimensional space x y z Data points resolved along 3 principal component axes. In this example, x-axis could mean a continuum from over-to under-expression (“blue” and “green” genes over-expressed, yellow genes under-expressed) y-axis could mean that “gray” genes are over-expressed in first five expts and under expressed in The remaining expts, while “brown” genes are under-expressed in the first five expts, and over-expressed in the remaining expts. z-axis might represent different cyclic patterns, e.g., “red” genes might be over-expressed in odd-numbered expts and under-expressed in even-numbered ones, whereas the opposite is true for “purple” genes. Interpretation of components is somewhat subjective.

45 Cluster Affinity Search Technique (CAST)
-uses an iterative approach to segregate elements with ‘high affinity’ into a cluster -the process iterates through two phases -addition of high affinity elements to the cluster being created -removal or clean-up of low affinity elements from the cluster being created

46 Clustering Affinity Search Technique (CAST)-1
Affinity = a measure of similarity between a gene, and all the genes in a cluster. Threshold affinity = user-specified criterion for retaining a gene in a cluster, defined as %age of maximum affinity at that point 1. Create a new empty cluster C1. 2. Set initial affinity of all genes to zero 3. Move the two most similar genes into the new cluster. Empty cluster C1 G2 G4 G9 G8 G12 G6 G1 G7 G13 G11 G14 G3 G5 G15 G10 Unassigned genes 4. Update the affinities of all the genes (new affinity of a gene = its previous affinity + its similarity to the gene(s) newly added to the cluster C1) ADD GENES: 5. While there exists an unassigned gene whose affinity to the cluster C1 exceeds the user-specified threshold affinity, pick the unassigned gene whose affinity is the highest, and add it to cluster C1. Update the affinities of all the genes accordingly.

47 CAST – 2 REMOVE GENES: 6. When there are no more unassigned high-affinity genes, check to see if cluster C1 contains any elements whose affinity is lower than the current threshold. If so, remove the lowest-affinity gene from C1. Update the affinities of all genes by subtracting from each gene’s affinity, its similarity to the removed gene. 7. Repeat step 6 while C1 contains a low-affinity gene. G3 G13 G8 Current cluster C1 G2 G6 G4 G14 G12 G5 G9 G11 G7 G1 G10 G15 Unassigned genes 8. Repeat steps 5-7 as long as changes occur to the cluster C1. 9. Form a new cluster with the genes that were not assigned to cluster C1, repeating steps 1-8. 10. Keep forming new clusters following steps 1-9, until all genes have been assigned to a cluster

48 QT-Clust (from Heyer et. al. 1999) (HJC) -1
Compute a jackknifed distance between all pairs of genes (Jackknifed distance: The data from one experiment are excluded from both genes, and the distance is calculated. Each experiment is thus excluded in turn, and the maximum distance between the two genes (over all exclusions) is the jackknifed distance. This is a conservative estimate of distance that accounts for bias that might be introduced by single outlier experiments.) 2. Choose a gene as the seed for a new cluster. Add the gene which increases cluster diameter the least. Continue adding genes until additional genes will exceed the specified cluster diameter limit. G4 G6 G5 G8 G7 G9 G10 G2 G3 G11 G1 “Seed” gene Currently unassigned genes Current cluster G12 3. Repeat step 2 for every gene, so that each gene has the chance to be the seed of a new cluster. All clusters are provisional at this point.

49 QT-Clust – 2 4. Choose the largest cluster obtained from steps 2 and 3. In case of a tie, pick one of the largest clusters at random. G2 “Seed” gene G11 G10 G3 G4 G1 G5 G9 G7 G8 G1 “Seed” gene G11 G12 G7 G8 G4 G9 G3 “Seed” gene Pick this cluster 5. All genes that are not in the cluster selected above are treated as currently unassigned. Repeat steps 2-4 on these unassigned genes. 6. Stop when the last cluster thus formed has fewer genes than a user-specified number. All genes that are not in a cluster at this point are treated as unassigned.

50 Self Organizing Tree Algorithm
SOTA - 1 Self Organizing Tree Algorithm Dopazo, J. , J.M Carazo, Phylogenetic reconstruction using and unsupervised growing neural network that adopts the topology of a phylogenetic tree. J. Mol. Evol. 44: , 1997. Herrero, J., A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17(2): , 2001.

51 SOTA Characteristics SOTA - 2 Divisive clustering, allowing high level hierarchical structure to be revealed without having to completely partition the data set down to single gene vectors Data set is reduced to clusters arranged in a binary tree topology The number of resulting clusters is not fixed before clustering Neural network approach which has advantages similar to SOMs such as handling large data sets that have large amounts of ‘noise’ Divisive  don’t have to completely partition the data set. Can stop at a higher level before full partitioning.

52 SOTA Topology Centroid Vector Parent Node ap Members as aw
Winning Cell Sister Cell a* = migration factor (as < ap < aw)

53 SOTA - 4 Adaptation Overview -each gene vector associated with the parent is compared to the centroid vector of its offspring cells. -the most similar cell’s centroid and its neighboring cells are adapted using the appropriate migration weights.

54 SOTA - 5 -following the presentation of all genes to the system a measure of system diversity is used to determine if training has found an optimal position for the offspring. -if the system diversity improves (decreases) then another training epoch is started otherwise training ends and a new cycle starts with a cell division.

55 SOTA - 6 The most ‘diverse’ cell is selected for division at the start of the next training cycle.

56 SOTA - 7 Growth Termination Expansion stops when the most diverse cell’s diversity falls below a threshold.

57 Each training cycle ends when the overall tree diversity ‘stabilizes’.
SOTA - 8 Each training cycle ends when the overall tree diversity ‘stabilizes’. This triggers a cell division and possibly a new training cycle.

58 Self-organizing maps (SOMs) – 1
1. Specify the number of nodes (clusters) desired, and also specify a 2-D geometry for the nodes, e.g., rectangular or hexagonal N = Nodes G = Genes G1 G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12 G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26 G27 G29 G28 N1 N2 N3 N4 N5 N6

59 SOMs – 2 2. Choose a random gene, e.g., G9
3. Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The further away the node is from N2, the less it is moved. G1 G6 G3 G5 G4 G2 G11 G7 G8 G10 G9 G12 G13 G14 G15 G19 G17 G22 G18 G20 G16 G21 G23 G25 G24 G26 G27 G29 G28 N1 N2 N3 N4 N5 N6

60 SOM Neighborhood Options
Gaussian Neighborhood Bubble Neighborhood radius G7 G7 G8 G8 G9 G9 G10 G10 G11 G11 N1 N2 N1 N2 N3 N4 N3 N4 N5 N6 N5 N6 Some move, alpha is constant. All move, alpha is scaled.

61 SOMs – 3 4. Steps 2 and 3 (i.e., choosing a random gene and moving the nodes towards it) are repeated many (usually several thousand) times. However, with each iteration, the amount that the nodes are allowed to move is decreased. 5. Finally, each node will “nestle” among a cluster of genes, and a gene will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node G7 G8 G1 G6 G2 G5 G9 N2 N1 G4 G10 G3 G11 G12 G13 N4 G14 G26 G27 G15 N3 G28 G29 G16 G17 G18 G19 G23 G20 N6 G21 N5 G24 G25 G22

62 Template Matching -template matching allows one to find expression vectors which match a provided template -a template can be derived from - a gene known to be central to the area of study - a sample or set of samples of a particular type - a cluster with a mean pattern of interest - a pattern constructed to reveal trends based on knowledge of the experimental design

63 PTM-2 -Sometimes it is useful to identify elements that have complementary patterns by selecting to use the absolute value of r.

64 K-Means / K-Medians Support (KMS)
Because of the random initialization of K-Means / K-Means, clustering results may vary somewhat between successive runs on the same dataset. KMS helps us validate the clustering results obtained from K-Means / K-Medians. Run K-Means / K-Medians multiple times. The KMS module generates clusters in which the member genes frequently group together in the same clusters (“consensus clusters”) across multiple runs of K-Means / K-Medians. 3. The consensus clusters consist of genes that clustered together in at least x% of the K-Means / Medians runs, where x is the threshold percentage input by the user.

65 Gene Shaving Compute first principle component of expression matrix
Shave off a% (default 10%) of genes with lowest values of dot product with 1st principal component Results in a series of nested clusters Choose cluster of appropriate size as determined by gap statistic calculation Repeat until only one gene remains Orthogonalize expression matrix with respect to the average gene in the cluster and repeat shaving procedure

66 Gene Shaving Gap statistic calculation (choosing cluster size) within variance between variance R2 = Quality measure for clusters: between variance of mean gene across experiments within variance of each gene about the cluster average Large R2 implies a tight cluster of coherent genes Create random permutations of the expression matrix and calculate R2 for each The final cluster contains a set of genes that are greatly affected by the experimental conditions in a similar way. Compare R2 of each cluster to that of the entire expression matrix Choose the cluster whose R2 is furthest from the average R2 of the permuted expression matrices.

67 Relevance Networks Set of genes whose expression profiles are predictive of one another. Can be used to identify negative correlations between genes Genes with low entropy (least variable across experiments) are excluded from analysis. H = -Sp(x)log2(p(x)) x=1 10

68 Relevance Networks A A B B E E C C D D Tmin = 0.50 Tmax = 0.90
.28 .75 .15 .37 .40 .02 .51 .11 .63 .92 B E B E C D C D The expression pattern of each gene compared to that of every other gene. Tmin = 0.50 The remaining relationships between genes define the subnets Tmax = 0.90 The ability of each gene to predict the expression of each other gene is assigned a correlation coefficient Correlation coefficients outside the boundaries defined by the minimum and maximum thresholds are eliminated.

69 T-Tests (TTEST) – Between subjects (or unpaired) - 1
Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

70 TTEST – Between subjects - 2
3. Calculate t-statistic for each gene 4. Calculate probability value of the t-statistic for each gene either from: A. Theoretical t-distribution OR B. Permutation tests.

71 TTEST - Between subjects - 3
Permutation tests i) For each gene, compute t-statistic ii) Randomly shuffle the values of the gene between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Group A Group B Original grouping Exp 1 Exp 4 Exp 5 Exp 2 Exp 3 Exp 6 Gene 1 Group A Group B Randomized grouping

72 TTEST - Between subjects - 4
Permutation tests - continued iii) Compute t-statistic for the randomized gene iv) Repeat steps i-iii n times (where n is specified by the user). v) Let x = the number of times the absolute value of the original t-statistic exceeds the absolute values of the randomized t-statistic over n randomizations. vi) Then, the p-value associated with the gene = 1 – (x/n)

73 TTEST - Between subjects - 5
5. Determine whether a gene’s expression levels are significantly different between the two groups by one of three methods: Just alpha: If the calculated p-value for a gene is less than or equal to the user-input alpha (critical p-value), the gene is considered significant. OR Use Bonferroni corrections to reduce the probability of erroneously classifying non-significant genes as significant. B) Standard Bonferroni correction: The user-input alpha is divided by the total number of genes to give a critical p-value that is used as above.

74 TTEST - Between subjects – 6
5C) Adjusted Bonferroni: i) The t-values for all the genes are ranked in descending order. ii) For the gene with the highest t-value, the critical p-value becomes (alpha / N), where N is the total number of genes; for the gene with the second-highest t-value, the critical p-value will be (alpha/ N-1), and so on.

75 TTEST – 1-class (or One-sample t-test) - 1
Used to test if the the mean expression of a gene over all experiments is different from a hypothesized mean. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Vector 1 Gene 1 Vector 2 Gene 2 Vector 3 Gene 3 2. Question: Is the mean of the values of a given gene vector significantly different from a hypothesized mean?

76 TTEST- 1 Class - 2 3. Often, the hypothesized mean in gene expression studies is zero, meaning that we are looking for genes whose mean log2 ratio across all experiments is significantly different from zero, i.e., 4. Using 1-sample t-tests, we can select genes which, on average, show differential expression across all experiments (since genes with no differential expression should have a mean log2 ratio of zero across all expts). 5. Calculate t-value, where Observed mean of gene vector – Hypothesized mean of gene vector t = Standard error of the mean of the gene vector

77 TTEST – 1 class - 3 6. Calculate p-value from a theoretical t-distribution, OR 7. By permutation: 7a. Randomly pick some elements of the gene vector, and change their values, such that the new value of the changed element is [original value – 2 x (original value - hypothesized mean)] (i.e., “flip” the element’s deviation around the hypothesized mean) Thus, if the original gene values are: and the hypothesized mean is zero, then the randomized gene values could be: 0.5 -1.3 2.4 1.2 -0.2 0.8 -0.5 -1.3 2.4 -1.2 0.2 -0.8 These elements were randomly chosen and flipped around zero, the hypothesized mean

78 TTEST – 1 class - 4 7b. Calculate t-value from the randomized gene
7c. Repeat 7a and 7b as many times as desired. If all permutations are chosen, then every possible combination of elements in the gene vector is chosen for flipping. 7d. The p-value = 1 – (the proportion of times that the original absolute t-value exceeds the randomized absolute t-value over all the permutations conducted). 8. If a gene’s p-value is less than or equal to the user-specified critical p-value, the gene’s mean expression over all experiments is significantly different from the hypothesized mean. 9. Bonferroni and adjusted Bonferroni corrections may be applied just as in the two-sample t-test.

79 One Way Analysis of Variance (ANOVA)
Assign experiments to > 2 groups Ex 2 Ex 1 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Ex 9 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Ex 2 Ex 1 Ex 7 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Group 1 Ex 4 Ex 5 Ex 9 Ex 3 Ex 6 Ex 8 Group 2 Group 3 2. Question: Is mean expression level of a gene the same across all groups?

80 ANOVA - 2 3. Calculate an F-ratio for each gene, where
Mean square (groups) F = , which is a measure of Mean square (error) Between groups variability Within groups variability The larger the value of F, the greater the difference among the group means relative to the sampling error variability (which is the within groups variability). i.e., the larger the value of F, the more likely it is that the differences among the group means reflect “real” differences among the means of the populations they are drawn from, rather than being due to random sampling error.

81 ANOVA - 3 4. The p-value associated with an F-value is the probability that an F-value that large would be obtained if there were no differences among group means (i.e., given the null hypothesis). Therefore, the smaller the p-value, the less likely it is that the null hypothesis is valid, i.e., the differences among group means are more likely to reflect real population differences as p-values decrease in magnitude.

82 ANOVA - 4 5. P-values can be obtained for the F-values from a theoretical F-distribution, assuming that the populations from which the data are obtained are normally distributed, and have homogeneous variances. The test is considered robust to violations of these assumptions, provided sample sizes are relatively large and similar across groups.

83 ANOVA – 5 6. P-values can be obtained from permutation tests (just like in t-tests), if one does not want to rely on the assumptions needed for using the F-distribution. P-values can also be corrected for multiple comparisons (using Bonferroni or other procedures). These features will soon be implemented in MeV.

84 Two-factor ANOVA (TFA)
Can be used to find genes whose expression is significantly different over two factors (e.g., sex and strain), as well as to look for genes with a significant interaction for these two factors. Strain A Strain B Strain C Male Female

85 TFA - 2 Strain Interaction No interaction Female Gene expression Male
1 2 3 No interaction Interaction

86 Ideally, design should be balanced, i.e., equal numbers of samples
TFA - 3 Ideally, design should be balanced, i.e., equal numbers of samples in each factor A – factor B combination. If unbalanced, the analysis can still be conducted, but F-tests will be somewhat biased. May need to use smaller p-values. can have balanced designs with no replication (see below). In this case, interaction cannot be tested.. Male Female Strain A Strain B Strain C

87 Significance analysis of microarrays (SAM)
SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently implemented for the following designs: - two-class unpaired two-class paired multi-class censored survival one-class

88 SAM -2 SAM gives estimates of the False Discovery Rate (FDR), which is the proportion of genes likely to have been wrongly identified by chance as being significant. It is a very interactive algorithm – allows users to dynamically change thresholds for significance (through the tuning parameter delta) after looking at the distribution of the test statistic. The ability to dynamically alter the input parameters based on immediate visual feedback, even before completing the analysis, should make the data-mining process more sensitive.

89 SAM designs Two-class unpaired: to pick out genes whose mean expression level is significantly different between two groups of samples (analogous to between subjects t-test). Two-class paired: samples are split into two groups, and there is a 1-to-1 correspondence between an sample in group A and one in group B (analogous to paired t-test).

90 SAM designs - 2 Multi-class: picks up genes whose mean expression is different across > 2 groups of samples (analogous to one-way ANOVA) Censored survival: picks up genes whose expression levels are correlated with duration of survival. One-class: picks up genes whose mean expression across experiments is different from a user-specified mean.

91 SAM Two-Class Unpaired
Assign experiments to two groups, e.g., in the expression matrix below, assign Experiments 1, 2 and 5 to group A, and experiments 3, 4 and 6 to group B. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Group A Group B Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 2. Question: Is mean expression level of a gene in group A significantly different from mean expression level in group B?

92 SAM Two-Class Unpaired– 2
Permutation tests For each gene, compute d-value (analogous to t-statistic). This is the observed d-value for that gene. ii) Rank the genes in ascending order of their d-values. iii) Randomly shuffle the values of the genes between groups A and B, such that the reshuffled groups A and B respectively have the same number of elements as the original groups A and B. Compute the d-value for each randomized gene Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Group A Group B Original grouping Exp 1 Exp 4 Exp 5 Exp 2 Exp 3 Exp 6 Gene 1 Group A Group B Randomized grouping

93 SAM Two-Class Unpaired - 3
iv) Rank the permuted d-values of the genes in ascending order v) Repeat steps iii) and iv) many times, so that each gene has many randomized d-values corresponding to its rank from the observed (unpermuted) d-value. Take the average of the randomized d-values for each gene. This is the expected d-value of that gene. vi) Plot the observed d-values vs. the expected d-values

94 SAM Two-Class Unpaired– 4
Significant positive genes (i.e., mean expression of group B > mean expression of group A) in red Significant negative genes (i.e., mean expression of group A > mean expression of group B) in green “Observed d = expected d” line Tuning parameter “delta” limits, can be dynamically changed by using the slider bar or entering a value in the text field. The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant.

95 SAM Two-Class Unpaired – 5
For each permutation of the data, compute the number of positive and negative significant genes for a given delta as explained in the previous slide. The median number of significant genes from these permutations is the median False Discovery Rate. The rationale behind this is, any genes designated as significant from the randomized data are being picked up purely by chance (i.e., “falsely” discovered). Therefore, the median number picked up over many randomizations is a good estimate of false discovery rate.

96 SAM Two-Class Paired A B Samples fall into two groups
Each member of group A is associated with a member of group B in a 1-to-1 relationship A B A-B pair

97 SAM Two-Class Paired - 2 e.g., groups A and B could respectively represent “before” and “after” a drug treatment, and each A-B pair of samples could come from the same patient before and after the treatment. or, groups A and B could represent two strains for which samples were collected at the several time points over a time course study. A sample collected from each of strain A and B at the same time point could form an AB pair. The rest of the analysis is similar to two-class unpaired SAM. Positive significant genes are those for which Mean(Group B) is significantly larger than Mean (Group A), and reverse is true for negative significant genes

98 SAM Multi-Class Extension of SAM two -class unpaired to more than 2 groups Experiments belong to one of at least three groups Analogous to one-way between subjects ANOVA Ex 2 Ex 1 Ex 3 Ex 4 Ex 5 Ex 6 Ex 7 Ex 8 Ex 9 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Ex 2 Ex 1 Ex 7 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Group 1 Ex 4 Ex 5 Ex 9 Ex 3 Ex 6 Ex 8 Group 2 Group 3

99 SAM Multi-Class - 2 This analysis yields only positive significant genes These are genes whose means are significantly different across some combination of the groups of experiments.

100 SAM Censored Survival Each experiment (sample) is associated with an observation time, and a state at the time of observation. The state is either “dead” or “censored” “Censored” means that the subject survived beyond the time point at which the sample was taken. A positive score means that a higher expression level for that gene implies shorter survival (i.e., higher risk), whereas a negative score means that higher expression implies longer survival.

101 SAM One-Class used to pick up genes whose mean expression across experiments is different from a user-specified mean. analogous to one-class t-test positive genes are those whose means are greater than the specified mean, while negative genes have means smaller than the specified mean

102 Support Vector Machines (SVM)
supervised learning technique uses supplied information such as presumptive biological relationships between a set of elements, and the expression profiles of elements to produce a binary classification of elements.

103 Supervised Learning -begins with the definition of a class which specifies in advance which elements should cluster together. -ie. genes for enzymes in a common pathway or part of a regulatory system, or samples may be a tissue type or from a particular strain. -this information is used to train the SVM to discriminate members from non-members

104 Initial Classification
SVM Process Overview Initial Classification Data Data SVM Training SVM Classification Weights Elements In Classification Elements Out of Classification

105 Separating hyperplane
SVM Classification SVM attempts to find an optimal separating hyperplane between members of the two initial classifications. Separating hyperplane

106 Separation Problem -an optimal hyperplane partitions the initial classification correctly and maximizes distance from the plane to elements on either ‘side’, positive and negative examples. -when the training examples (initial classification) consists of very diverse expression patterns finding an optimal hyperplane can be impossible…

107 SVM Kernel Construction
The expression data can be transformed to a higher dimensional space (feature space) by applying a kernel function. This transformation can have the effect of allowing a separating hyperplane to be found.

108 Practical SVM Issues Results depend heavily on the input parameters.
Using a high degree kernel function risks artificial separation of the data. An iterative approach to increasing the kernel power is advisable.

109 SVM Results Two classes are produced
Positive Class: contains elements with expression patterns similar to those in the positive examples in the training set. Negative Class: contains all other members of the input set. Each of these classes has elements that fall in two groups Those initially in the class (true positives and true negatives) Those recruited into the class (false positives and false negatives)

110 K-Nearest Neighbor Classification – KNNC - 1
supervised classification scheme user specifies the number of expected classes a training set of vectors is provided as input user specifies classes of training vectors training set should contain example of each class

111 KNNC – 2 – pre-classification filters
Prior to classification, variance filtering can optionally be applied to all vectors (training set + vectors to be trained). This will filter out genes with low variance across experiments. Note that this might filter out some genes in the training set as well. Correlation filtering can also be applied on the vectors to be classified. This would filter out those vectors in the set to be classified, that are not significantly correlated with any gene in the training set. Significance for correlation filtering is determined by a permutation test.

112 KNNC – 3 - correlation filtering randomization test
1. The Pearson correlation coefficient r is computed between a given vector to be classified, and each member of the training set 2. The maximum such r is called the rmax for that vector. 3. The vector is randomized a user-specified number of times, and each time, an rmax is calculated using the randomized vector (call it rmax*), just as in steps 1 and 2. 4. The proportion of times rmax* exceeds rmax over all randomizations is the p-value for that vector. 5. If the p-value for a vector < the user-specified p-value, that vector is retained for further analysis. 6. Steps 1-6 are repeated for every vector in the set to be classified.

113 KNNC – 4 - Classification parameters
Let v be a vector that needs to be classified, and T = {t1, t2, …, t10} be the set of training vectors. The user specifies the classes of each element of T. Say, there are 4 classes. The user also specifies the number of neighbors k. Say, k = 5. t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 v Class 1 Class 2 Class 3 Class 4 T

114 KNNC –5 - Classification
v Class 1 Class 2 Class 3 Class 4 T Suppose v’s 5 nearest neighbors in set T (by Euclidean distance) are t1, t4, t8, t2, and t5. Since class 1 is most frequently represented in v’s nearest neigbors, v is assigned to class 1. If there is a tie in frequency of classes represented among nearest neighbors, the vector remains unassigned.

115 EASE (Expression Analysis Systematic Explorer)
EASE analysis identifies prevalent biological themes within gene clusters. The significance of each identified theme is determined by its prevalence in the cluster and in the gene population of genes from which the cluster was created.

116 Diverse Biological Roles
Consider a population of genes representing a diverse set of biological roles or themes shown below as different colors.

117 Many algorithms can be applied to expression data to partition genes based on expression profiles over multiple conditions. Many of these techniques work solely on expression data and disregard biological information.

118 Consider a particular cluster…
-What are the some of the predominant biological themes represented in the cluster and how should significance be assigned to a discovered biological theme?

119 Example: Population Size: 40 genes Cluster size: 12 genes 10 genes, shown in green, have a common biological theme and 8 occur within the cluster.

120 Consider the Outcome AND
The frequency of the theme in the population is 10/40 = 25% The frequency of the theme within the cluster is 8/12 = 67% 40 12 10 8 AND * 80% of the genes related to the theme in the population ended up within the relatively small cluster.

121 Contingency Matrix A 2x2 contingency matrix is typically used to
capture the relationships between cluster membership and membership to a biological theme.

122 out in Cluster Contingency Matrix 2 8 out in Theme 26 4

123 Assigning Significance to the Findings
The Fisher’s Exact Test permits us to determine if there are non-random associations between the two variables, expression based cluster membership and membership to a particular biological theme. Cluster in out 8 2 4 26 in p  .0002 Theme out ( 2x2 contingency matrix )

124 Hypergeometric Distribution
a b c d The probability of any particular matrix occurring by random selection, given no association between the two variables, is given by the hypergeometric rule. a+c a+b b+d c+d

125 Probability Computation
8 2 4 26 For our matrix, , we are not only interested in getting the probability of getting exactly 8 annotation hits in the cluster but rather the probability of having 8 or more hits. In this case the probabilities of each of the possible matrices is summed. 8 2 4 26 9 1 3 27 10 2 28 x x10-8 

126 EASE Results Consider all of the Results
EASE reports all themes represented in a cluster and although some themes may not meet statistical significance it may still be important to note that particular biological roles or pathways are represented in the cluster. Independently Verify Roles Once found, biological themes should be independently verified using annotation resources.

127 Basic EASE Requirements
Annotation keys; identifiers for each gene must be loaded with the data into MeV. EASE file system; EASE uses a file system to link annotation keys to biological themes.

128 EASE File System

129 EASE (Expression Analysis Systematic Explorer)
Hosack et al. Identifying biological themes within lists of genes with EASE. Genome Biol., 4:R70-R70.8, 2003. NIAID graciously provided the foundation Java classes upon which the MeV version was built.

130 Coming Attractions Algorithm scripting Discriminant analysis
Chromosome Viewers etc.


Download ppt "Analysis of Multiple Experiments TIGR Multiple Experiment Viewer (MeV)"

Similar presentations


Ads by Google