Presentation on theme: "Lecture 9 Microarray experiments MA plots"— Presentation transcript:
1Lecture 9Microarray experimentsMA plotsNormalization of microarray dataTests for differential expression of genesMultiple testing and FDR
2DNA MicroarrayTypical microarray chipThough most cells in an organism contain the same genes, not all of the genes are used in each cell.Some genes are turned on, or "expressed" when needed in particular types of cells.Microarray technology allows us to look at many genes at once and determine which are expressed in a particular cell type.
3DNA MicroarrayTypical microarray chipDNA molecules representing many genes are placed in discrete spots on a microscope slide which are called probes.Messenger RNA--the working copies of genes within cells is purified from cells of a particular type.The RNA molecules are then "labeled" by attaching a fluorescent dye that allows us to see them under a microscope, and added to the DNA dots on the microarray.Due to a phenomenon termed base-pairing, RNA will stick to the probe corresponding to the gene it came from
4The probes are typically spaced widely along the sequence DNA MicroarrayUsually a gene is interrogated by 11 to 20 probes and usually each probe is a 25-mer sequenceThe probes are typically spaced widely along the sequenceSometimes probes are choosen closer to the 3’ end of the sequenceA probe that is exactly complementary to the sequence is called perfect match (PM)A mismatch probe (MM) is not complementary only at the central positionIn theory MM probes can be used to quantify and remove non specific hybridizationSource: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley
5Sample preparation and hybridization Source: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley
6This may fall prey to spatial defects Sample preparation and hybridizationDuring the hybridization process cRNA binds to the arrayEarlier probes had all the probes of a probset located continuously on the arrayThis may fall prey to spatial defectsNewer chips have all the probes spread out across the arrayA PM and MM probe pair are always adjacent on the arraySource: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley
7Samples can be taken at different stages of the growth curve Growth curve of bacteriaSamples can be taken at different stages of the growth curveOne of them is considered as control and others are considered as targetsSamples can be taken before and after application of drugsSample can be taken under different experimental conditions e.g. starvation of some metabolite or soWhat types of samples should be used depends on the target of the experiment at hand.
8DNA MicroarrayTypical microarray chipAfter washing away all of the unstuck RNA, the microarray can be observed under a microscope and it can be determined which RNA remains stuck to the DNA spotsMicroarray technology can be used to learn which genes are expressed differently in a target sample compared to a control sample (e.g diseased versus healthy tissues)However background correction and normalization are necessary before making useful decisions or conclusions
9The M stands for minus and A stands for add MA plotsMA plots are typically used to compare two color channels, two arrays or two groups of arraysThe vertical axis is the difference between the logarithm of the signals(the log ratio) and the horizontal axis is the average of the logarithms of the signalsThe M stands for minus and A stands for addMA is also mnemonic for microarrayMi= log(Xij) - log(Xik) = Log(Xij/Xik) (Log ratio)Ai=[log(Xij) + log(Xik)]/2 (Average log intensity)
10A typical MA plotFrom the first plot we can see differences between two arrays but the non linear trend is not apparentThis is because there are many points at low intensities compared to at high intensitiesMA plot allows us to assess the behavior across all intensities
11Normalization of microarray data Normalization is the process of removing unwanted non-biological variation that might exist between chips in microarray experimentsBy normalization we want to remove the non-biological variation and thus make the biological variations more apparent.
12Typical microarray data ・・・Array jArray mGene 1X11X12X1jX1mGene 2X21X22X2jX2mGene iXi1Xi2XijXimGene nXn1Xn2XnjXnmMeanX1X2XjXmSDσ1σ2σjσm
14Effect of Scaling and centering normalization Original DataCenteringScalingMean = 0Standard deviation = 1Mean = 0
15Normalization between a pair of arrays: Loess(Lowess) Normalization Lowess normalization is separately applied to each experiment with two dyesThis method can be used to normalize Cy5 and Cy3 channel intensities (usually one of them is control and the other is the target) using MA plots
16Normalization between a pair of arrays: Loess(Lowess) Normalization Genei-1Ci-1Ti-1GeneiCiTiGenei+1Ci+1Ti+12 channel dataMi=Log(Ti/Ci) (Log ratio)Ai=[log(Ti) + log(Ci)]/2 (Average log intensity)Mi=Log(Ti/Ci)Each point corresponds to a single geneAi=[log(Ti) + log(Ci)]/2
17Normalization between a pair of arrays: Loess(Lowess) Normalization Mi=Log(Ti/Ci) (Log ratio)Ai=[log(Ti) + log(Ci)]/2 (Average log intensity)Mi=Log(Ti/Ci)Each point corresponds to a single geneTypical regression lineThe MA plot shows some biasAi=[log(Ti) + log(Ci)]/2
18Normalization between a pair of arrays: Loess(Lowess) Normalization Mi=Log(Ti/Ci) (Log ratio)Ai=[log(Ti) + log(Ci)]/2 (Average log intensity)Mi=Log(Ti/Ci)Each point corresponds to a single geneThe MA plot shows some biasUsually several regression lines/polynomials are considered for different sectionsAi=[log(Ti) + log(Ci)]/2The final result is a smooth curve providing a model for the data. This model is then used to remove the bias of the data points
19Normalization between a pair of arrays: Loess(Lowess) Normalization Bias reduction by lowess normalization
20Normalization between a pair of arrays: Loess(Lowess) Normalization Unnormalized fold changesfold changes after Loess normalization
21Normalization across arrays Here we are discussing the following two normalization procedure applicable to a number of arraysQuantile normalizationBaseline scaling normalization
22Quantile normalization Normalization across arraysThe goal of quantile normalization is to give the same empirical distribution to the intensities of each arrayIf two data sets have the same distribution then their quantile- quantile plot will have straight diagonal line with slope 1 and intercept 0.Or projecting the data points of the quantile- quantile plot to 45-degree line gives the transformation to have the same distribution.Quantile normalizationquantile- quantile plot motivates the quantile normalization algorithm
23Quantile normalization Algorithm Normalization across arraysQuantile normalization AlgorithmSource: PhD thesis by Benjamin Milo Bolstad, 2004, University of California, Barkeley
24Quantile Normalization: Normalization across arraysQuantile Normalization:Original dataNo.Exp.1Exp.2Mean22.214.171.124 = ( )/220.611.20.94126.96.36.199.82.23.8No.Exp.1Exp.188.8.131.52.831.840.83.850.4Sort1. Sort each column of X (values)2. Take the means across rows of X sortNo.Exp.1Exp.250.620.9141.332.22.8No.Exp.1Exp.212.20.9232.81.3450.6Sort3. Assign this mean to each elementin the row to get X' sort4. Get X normalized by rearranging each column of X'sort to have the same ordering as original X
25Normalization across arrays Raw dataAfter quantile normalization
26Baseline scaling method Normalization across arraysBaseline scaling methodIn this method a baseline array is chosen and all the arrays are scaled to have the same mean intensity as this chosen arrayThis is equivalent to selecting a baseline array and then fitting a linear regression line without intercept between the chosen array and every other array
27Baseline scaling method Normalization across arraysBaseline scaling method
28Normalization across arrays Raw dataAfter Baseline scaling normalization
29Tests for differential expression of genes Let x1…..xn and y1…yn be the independent measurements of the same probe/gene across two conditions.Whether the gene is differentially expressed between two conditions can be determined using statistical tests.
30Tests for differential expression of genes Important issues of a test procedure areWhether the distributional assumptions are validWhether the replicates are independent of each otherWhether the number of replicates are sufficientWhether outliers are removed from the sampleReplicates from different experiments should not be mixed since they have different characteristics and cannot be treated as independent replicates
31Tests for differential expression of genes Most commonly used statistical tests are as follows:(a) Student’s t-test(b) Welch’s test(c) Wilcoxon’s rank sum test(d) Permutation testsThe first two test assumes that the samples are taken from Gaussian distributed data and the p-values are calculated by a probability distribution functionThe later two are nonparametric and the p values are calculated using combinatorial arguments.
32Student’s t-testAssumptions: Both samples are taken from Gaussian distribution that have equal variancesDegree of freedom: m+n-2Welch’s test is a variant of t-test where t is calculated as followsWelch’s test does not assume equal population variances
33The value of t is supposed to follow a t-distribution. Student’s t-testThe value of t is supposed to follow a t-distribution.After calculating the value of t we can determine the p-value from the t distribution of the corresponding degree of freedom
34Wilcoxon’s rank sum test Let x1…..xn and y1…ym be the independent measurements of the same probe/gene across two conditions.Consider the combined set x1…..xn ,y1…ymThe test statistic of Wilcoxon test isWhere is the rank of xi in the combined seriesPossible Minimum value of T isPossible Maximum value of T isMinimum and maximum values of T occur if all X data are greater or smaller than the Y data respectively i.e. if they are sampled from quite different distributions
35Expected value and variance of T under null hypothesis are as follow: Now unusually low or high values of T compared to the expected value indicate that the null hypothesis should be rejected i.e. the samples are not from the same populationFor larger samples i.e. m+n >25 we have the following approximation
36Wilcoxon’s rank sum test (Example) Datax17x28x35x49x5YDatay15y26y38y44X&YDataRankx491x282y33x574x15y26y1x3y4n=5. m=4T=R(x1)+R(x2)+R(x3)+R(x4)+R(x5)= = 20EH0(T)=n(m+n+1)/2= 5(4+5+1)/2=25VarH0(T)=mn(m+n+1)/12= 5*4(4+5+1)/12=50/3=16.66P-value = (From chart)
38Multiple testing and FDR The single gene analysis using statistical tests has a drawback.This arises from the fact that while analyzing microarray data we conduct thousands of tests in parallel.Let we select genes with a significant level α=0.05 i.e a false positive rate of 5%This means we expect that 500 individual tests are false which is not at logicalTherefore corrections for multiple testing are applied while analyzing microarray data
39Multiple testing and FDR Let αg be the global significance level and αs is the significance level at single gene levelIn case of a single gene the probability of making a correct decision isTherefore the probability of making correct decision for all n genes (i.e. at global level)Now the probability of drawing the wrong conclusion in either of n tests isFor example if we have 100 different genes and αs=0.05the probability that we make at least 1 error is this is very high and this is called family-wise error rate (FWER)
40Multiple testing and FDR Using binomial expansion we can writeThusTherefore the Bonferroni correction of the single gene level is the global level divided by the number of testsTherefore for FWER of for n= genes the P-value at single gene level should be 10-6Usually very few genes can meet this requirementTherefore we need to adjust the threshold p-value for the single gene case.
41Multiple testing and FDR A method for adjusting p-value is given in the following paperWestfall P. H. and Young S. S. Resampling based multiple testing : examples and methods for p-value adjustment(1993), Wiley, New York
42Multiple testing and FDR An alternative to controlling FWER is the computation of false discovery rate(FDR)The following papers discuss about FDRStorey J. D. and Tibshirani R. Statistical significance for genome wise studies(2003), PNAS 100,Benjamini Y and Hochberg Y Controlling the false discovery rate : a practical and powerful approach to multiple testing(1995) J Royal Statist Soc B 57,Still the practical use of multiple testing is not entirely clear.However it is clear that we need to adjust the p-value at single gene level while testing many genes together.