Presentation on theme: "Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui and Gary A. Churchill Genome Biology 2003, 4:210 Presented."— Presentation transcript:
Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui and Gary A. Churchill Genome Biology 2003, 4:210 Presented by M. Carme Ruíz de Villa and Alex Sánchez Departament d’Estadística U.B.
Remember … We want to measure how gene expression changes under different conditions. Only two conditions and an adequate number of replicates t-tests & extensions More than two conditions / more than one factor: several approaches Analysis of Variance (ANOVA) (Churchill et al.) Linear Models (Smyth, Speed, …)
Sources of variation (1) We want to determine when the variation due to gene expression is significant, but… There are multiple sources of variation in measurements besides just gene expression. We want to know when the variation in measurements is caused by varying levels of gene expression versus other factors.
Sources of variation (2) Some sources of variation in the measurements in microarray experiments are: Array effects Dye effects Variety effects Gene effects Combinations
Relative expression values If more than two conditions we cannot simply compute ratios ANOVA modelling yields estimates of the relative expression for each gene in each sample The ANOVA model is not based on log ratios. Rather it is applied directly to intensity data. However the difference between two relative expression values can be interpreted as the mean log ratio for comparing two samples.
Technical & biological replicates If inference is being made on the basis of biological replicates and there is also technical replication technical replicates should be averaged to yield a single value for each independent biological unit.
Derived data sets The set of estimated relative expression values, one for each gene in each RNA sample, is a derived data set that may be subject to a second level of analysis. The derived data can be analyzed on a gene by gene basis using standard ANOVA methods to test for differences among conditions. (Oleksiak et al. )
Review of ANOVA models
One way ANOVA Suppose you have a model for each measurement in your experiment: y ij is j th measurement for i th group. μ : overall mean effect (constant) α i : i th group effect (constant) ε ij : experimental error term ~N(0,σ 2 ) Therefore, observations from group i are distributed with mean μ+ α i and variance σ 2.
Hypothesis Testing Overall variabilityWithin group variability Between group variability Intuition: if between group variability is large compared to within group variability then the differences between means is significant.
Sum of Squares Total sum of squares Within Sum of Squares Between Sum of Squares
Mean Sum of Squares Between MS = Between SS/(k-1) Within MS = Within SS/(n-k) F = Between MS / Within SS It is summarized in the ANOVA table Example 1
Multiple Factor ANOVA The model can be extended by adding more Factors ( , , …) Interactions between them ( , …) Other … This is used to model the different sources of variation appearing in microarray experiments
Experiment 1: Latin Square livermuscle liver
Random effects models If the k factor levels can be considered a random sample of a population of factors we have a random effect ANOVA model: Y ij = + A i + e ij, overall mean, A i is a random variable instead of a constanty, e ij experimental error. E(A i )=0, E(e ij )=0, var(A i )= A 2, var(e ij ) = 2, A i i e ij independent var(Y ij )= A 2 + 2.
Where to find more… Draghici, S. (2003). ANOVA chapter (7) Data analysis tools for microarrays Wiley Pavlidis, P. (2003) Using ANOVA for gene selection from microarray studies of the nervous system avlidis/ doc/reprints/anova-methods.pdf avlidis/ doc/reprints/anova-methods.pdf
ANOVA Models for Microarray Data
Kerr & Churchill’s model y ijkg expression measurement from the i th array, j th dye, k th variety, and g th gene. μ average expression over all spots. A i effect of the i th array. D j effect of the j th dye. V k effect of the k th variety (=treatment, sample, …) G g effect of the g th gene. (AG) ig effect of the i th array and g th gene. (VG) kg effect of the k th variety and g th gene. Є ijkg independent and identically distributed error terms.
Interpreting main effects A: differences in fluorescent signal from array to array (e.g. if arrays are probed under inconsistent conditions that increase or reduce hybridization of labeled cDNA) D: differences between two dye fluorescent labels (one dye may consistently be brighter than the other) G: differences in fluorescence for equally expressed genes. V: differences of expression level between different varieties (samples, tumour types,..).
Interpreting interactions DV: If for a particular variety labelling is produced in separate runs of the process Differences in the runs can produce pools of cDNA of varying concentrations or quality. AG: (Spot effect) Spots for a given gene on the different arrays vary in the amount of cDNA available for hybridization. DG: if there are differences in the dyes that are gene-specific VG: reflects differences in expression for particular variety and gene combinations that are not explained by the average effects of these varieties and genes. THIS IS THE QUANTITY OF INTEREST !!!
Normalization A,D,V terms effectively normalize the data, thus the normalization process is integrated with the data analysis. This approach has several benefits (?) The normalization is based on a clearly stated set of assumptions It systematically estimates normalization parameters based on all the data The model can be generalized to the situation where genes are spotted multiple times on each array rather
Statistically Significant Effects Array, Dye, Variety & Gene effect Goal: To estimate their value. Need not assess their significance Sometimes don’t appear (gene-level model) Array x Gene, Variety x Gene effects May or not be present Goal: To assess their significance Mean effect = 0 if fixed Effect variance = 0 if random
Test statistics: The 3 F’s Hypothesis testing involves the comparison of two models. In this setting we consider a null model of no differential expression (all VG =0) and an alternative model with differential expression among the conditions (some VG are not equal to zero). F statistics are computed on a gene-by-gene basis based on the residual sums of squares from fitting each of these models.
Example 1 A gene, which is believed to be related to ovarian cancer is investigated The cancer is sub-classified in 3 cathegories (stages) I, II, III-IV 15 samples, 3 per stage are available They are labelled with 3 colors and hybridized on a 4 channel cDNA array (1 channel empty) (A seemingly more reasonable procedure: double dye-swap reference design)
Example 1. Normalized Data
Example 1: ANOVA table (1) If arrays are homogeneous The appropriate model is 1 factor ANOVA
Example (1): Blocking If arrays are not homogeneous the appropriate model is 2 factor ANOVA (1 new block factor for arrays)
Example 2: CAMDA kidney data ftp://ftp.camda.duke.edu/CAMDA02_DATASETS/papers/README_normal.html 6 mouse kidney samples (suppose 6 different treatments) Compared to a common reference in a double reference design Dye swap Replicate arrays 2
2.1. The ANOVA model Work only at the gene level: no main effects (A, D, V, G) as defined Y ijk =DG i +AG j +VG k + ijk i=1,2 (dyes) j=1,2; (array) K=1,…,6 (sample)
Example 3: A 2 factor design Diet X Strain
3.3. The ANOVA model Y ijk =DG i +AG j +Strain l +Diet m + Strain:Diet lm + VG k + ijklm i=1,…,2 (dyes) j=1,…,2; (array) k=1,…,12 (sample) l =1,…,3 (strain) m = 1,...,2 (diet)