Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28.

Similar presentations


Presentation on theme: "1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28."— Presentation transcript:

1 1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28

2 2 Outline Simple description of gene chip data Earlier works Mutiple t-test and SAM Lee’s ANOVA Wong’s factor models Efron’s empirical Bayes

3 3 Remarks Most works are statistical analysis, not really machine learning type Very small set of training sample – not to mention the test sample Medical research needs scientific rigor when we can

4 4 Arthritis and Rheumatism Guidelines for the submission and reviews of reports involving microarray technology v.46, no. 4, 859-861

5 5 Reproducibility Should document the accuracy and precision of data, including run-to-run variability of each gene No arbitrary setting of threshold (e.g., 2- fold) Careful evaluation of false discovery rate

6 6 Statistical Analysis Statistical analysis is absolutely necessary to support claims of an increase or decrease of gene expression Such rigor requires multiple experiments and analysis of standard statistical instruments.

7 7 Sample Heterogenenity … Strongly recommends that investigators focus studies on homogenous cell populations until other methodological and data analysis problems can be resolved.

8 8 Independent Confirmation It is important that the findings be confirmed using an independent method, preferably with separate samples rather than restating of the original mRNA.

9 9 Microarray Other terms: DNA array DNA chips biochips Gene chips

10 10 The underlying principle is the same for all microarrays, no matter how they are made Gene function is the key element researchers want to extract from the sequence DNA array is one of the most important tools (Nature, v.416, April 2002 885-891)

11 11 2 types of microarray cDNA Oligonucleotides DIY type

12 12 Microarray allows the researchers to determine which genes are being expressed in a given cell type at a particular time and under particular condition Gene-expression

13 13 Basic data form On each array, there are p “spots” (p>1000, sometimes 20000). Each spot has k probes (k=20 or so). There are usually 2k measurements (expressions) per spot, and the k differences, or the difference of logs, are used. Sometimes they only give you a summary statistics, e.g. median, mean,.. per spot

14 14 Each spot corresponding to a “gene” For each study, we can arrange the chips so that the i-th spot represents the i-th gene. (genes close in index may not be close physically at all) This means that when we read the i-th spot of all chips in one study, we know we get different measurements of the same i- th gene

15 15 Data of one chip can be arranged in a matrix form, Y; X_1, X_2, …, X_p Just as in a regression setup. But in practice, n (chips used) is small compared with p. Y is the response: cell type, experimental condition, survival time, …

16 16 For a spot with 20 probes, see Efron et al. (2001, JASA, p.1153).

17 17 Earlier works Cluster analysis Fold methods Multiple t with Bonferroni correction

18 18 Multiple t with Bonferroni correction It is too conservative Family wise error rate Among G tests, the probability of at least one false reject – basically goes to 1 with exponential rate in G

19 19 Sidak’s single-step adjusted p-value p’=1-(1-p)^G Bonferroni’s single-step adjusted p-value p’=min{Gp,1} All are very conservative

20 20 FDR –false discovery rate Roughly: Among all rejected cases, how many are rejected wrong? (Benjamini and Hochberg 1995 JRSSB, 289-300) “Sequential p-method”

21 21 Sequential p-method Using the observed data, it estimates the rejection regions so that the FDR < alpha Order all p-values, from small to large, and obtain a k so the first k hypotheses (wrt the smallest k p-values) are rejected.

22 22 Since we have a different definition for error to control, it will increase the “power” For modifications, see Storey (2002, JRSSB, 479-498) These are criteria specifically designed to handle risk assessment when G is large

23 23 Role of permutation For tests (multiple or not), it is important to use a null distribution It is generated by a well-designed permutation (of the columns of the data matrix) –column refers to observations, not genes.

24 24 One simple example Let us say we look at the first gene, with n_1 arrays for treatment and n_2 arrays for control We use a t-statistics, t_1, say. What is the p-value corresponding to this observed t_1?

25 25 Permute the n=n_+n_2 columns of data of the data matrix. Look at first row (corresponds to the first gene) Treat the first n_1 numbers as a fake “treatment”, the last n_2 numbers as a fake “control”, compute a t-value, say we get s_1

26 26 Permute again and do the same thing and we get s_2, …. Do it B times and get s_1, s_2, …., s_B Treat these s’s as a (bootstrap) sample for the null distribution of the t_1 statistic The p-value of the earlier t_1 is found from the ecdf of the s_j, j=1,2,…,B

27 27 Permutation plays a major role --- finding a reference measure of variation in various situations For a well designed experiment with microarray, DOE techniques will play an important role in determining how to do proper permutations.

28 28 SAM– significance analysis of microarray A standard method of microarray analysis, taught many times in Stanford short courses of data mining Modified multiple t-tests Using the permutation of certain data columns to evaluate variation of data in each gene

29 29 Original paper is hard to read: (Tusher, Tibshirani and Chu, PNAS 2001, v.98, no.9, 5116-5121) But the SAM manual is a lot easier to read for statisticians: (free software for academia use)

30 30 D(i)={X_treatment – X_control} over {s(i)+s_0} i=1,2,…,G D(1)<D(2)<….. Used in SAM, s_0 is a carefully determined constant >0.

31 31 D(i)* are used with certain group of permutations of the columns; D(i)* are also ordered Plot D vs. D*, points outside the 45-degree line by a threshold Delta are signals of significant expression change. Control the value of Delta to get different FDR.

32 32 Other model-based methods Wong’s model PM-MM= \theta \phi + \epsilon Outlier detection Model validation Li and Wong (2001, PNAS v.98, no.1, 31-36)

33 33 Lee’s work ANOVA based May do unbalanced data – e.g., 7 microarray chips (Lee et al. 2000, PNAS, v.97, 9834-9839)

34 34 Empirical Bayes (Efron et al. (2001) JASA, v.96, 1151-1160) Use a mix model f(z)=p_0 f_0(z)+p_1 f_1(z) with f_0, f_1 estimated by data. p_1=prior prob that a gene expression is affected (by a treatment)

35 35 A key idea is to use permuted (columns) data to estimate f_0 Use a tricky logistic regression method Eventually found p_1(Z)= the a posteriori probability that a gene at expression level Z is affected

36 36 Part I conclusion Earlier methods are relatively easy to understand, but to get familiar with the bio- language needs time More powerful data analytic methods will continue to develop It is important to first understand the basic problems of biologist before we jump with the fancy stat methods

37 37 We may do the wrong problem … But if the problem is relevant, even simple methods can get good recognition All methods so far are “first moment only” – ie, not too much different from multiple t tests; or, they all are one-gene- at-a-time methods.

38 38 We did not address issues about data cleaning, outlier detection, normalization, etc. Microarray data are highly noisy, these problems are by no means trivial. As the cost per chip goes down, the number of chips per problem may grow. But still well-designed experiments, e.g., fractional factorial, has room to play in this game

39 39 Statistical methods, as compared with machine learn based methods, will play a more important role for this type of data since, with a model, parametric or not, one can attach a measure of confidence to the claimed result. This is crucial for scientific development.

40 40 Quote: The statistical literature for microarrays, still in its infancy and with much of it unpublished, has tended to focus on frequentist data-analytical devices, such as cluster analysis, bootstrapping and linear models. (Efron, B. 2001)


Download ppt "1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28."

Similar presentations


Ads by Google