High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics.

High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics & Dep. of Pathology, VU University medical center, Amsterdam

Genomics: a short history (1) Some history 1.Watson & Crick: double helix structure of DNA (1953) Source: http://ghr.nlm.nih.gov/handbook/illustrations/

Genomics: a short history (2) 2. Human Genome Project: Identification of all 20.000-25.000 human genes (1990-2003) June 25, 2000 PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST SURVEY OF THE ENTIRE HUMAN GENOME Hails Public and Private Efforts Leading to This Historic Achievement THE WHITE HOUSE Office of the Press Secretary For Immediate Release June 25, 2000 PRESIDENT CLINTON ANNOUNCES THE COMPLETION OF THE FIRST SURVEY OF THE ENTIRE HUMAN GENOME Hails Public and Private Efforts Leading to This Historic Achievement June 26, 2000 Today, at a historic White House event with British Prime Minister Tony Blair, President Clinton announced that the international Human Genome Project and Celera Genomics Corporation have both completed an initial sequencing of the human genome -- the genetic blueprint for human beings.

Genomics: a short history (3) 3a. 1961 DNA hybridisation discovered 3b. 1994 Introduction of robotics (Hoheisel et al.) 3c. 1995 First microarray publication (Schena et al.) 3d. 1997 First whole genome microarray experiments (De Risi et al.) 3e. 1999 First publication on microarrays for cancer classification (Golub et al.): Leukemia / Affymetrix arrays

Central dogma 1.DNA is the same in each cell (tumours are an exception) 2.Function of the cell is determined by proteins 3.The path from DNA to proteins goes via messenger RNA (mRNA) 4.DNA is transcribed to mRNA according to the needs of that cell 5.mRNA contains the instructions for what proteins to build Microarrays measure the amount of mRNA DNAmRNA protein

Microarrays (1) Source: http://www.cottongenomics.org/Source: http://research.yale.edu/ysm/

Microarrays (2) 1. Isolation of mRNA (single-stranded DNA; genes) 2. Labeling with color molecule 3. Chip contains probes which uniquely correspond to genes 4. Hybridization to the chip 5. Laser to read labeled molecules 6. Image analysis converts colors to numbers, intensities 7. Result: data matrix with 2 intensities for each array Microarray Movie

The result Nr of rows (eg 44.000) is determined by nr of probes (> nr of genes) More genes than samples: high-dimensional setting

Statistical issues before data analysis 1. Design of the experiment (not discussed) 2. Quality control (not discussed) 3. Normalization Data visualized by MA plot Use of different dyes (colours) may leed to a non-linear dye-bias This needs to be removed since it is artificial M = log2(R/G) = log2(R)-log2(G) A = log2(R*G)= log2(R)+log2(G)

Normalization Purpose: remove artificial dye effects to obtain unbiased M values. Most popular method: Loess. Assumption: mean M value equals 0 for all intensity ranges. Algorithm 1.Sort A values: A’ 1,..., A’ p. 2.For A’ i, window W i = [A’ i – L, A’ i + L] 3.For each W i linearly regress: M = a + bA + ε 4.M’ i (pred) = a i + b i A’ i 5.Subtract M’ i (pred) from M’ i.

Loess BeforeAfter

After normalization Log2-ratios for further analysis. Ratios: cancel out experimental spot effect, log to obtain symmetric scale. However, nowadays log-intensities (both dyes) are used more and more often.

Data Type of response Nominal. Eg tumor type. R = {Benigne, Maligne} Ordinal. Stage of a tumor. R={1,2,3,4} Continuous. Disease severity score. R = R + Censored. Survival. R= R + x {0,1}.

Typical data analyses for microarrays (1) Multivariate Unsupervised Clustering Principle component analysis Classification (statistical learning, discriminant analysis, supervised clustering) Multivariate regression with penalty for overfitting (eg Lasso / Ridge regression) Prognostic multivariate survival models

Typical data analyses for microarrays (2) Univariate Inference (Hypothesis testing). Expression of each gene is related to clinical response using, for example, –ANOVA –Linear Regression –Cox regression (survival) –Permutation (nonparametric) tests Hybrid Inference for sets of genes that are functionally related

Two-step ANOVA (1) (1) is the normalization model; it only includes a gene factor in the residual u. That is residual u contains all gene specific factors. (2) is the differential expression model Indices a: array; c: condition; d: dye; g: gene

Two-step ANOVA (2) Use of the two-step ANOVA: first fit (1) on all data, then estimate residuals u for each gene, then fit (2) for each gene separately. Main advantage with respect to one-level model: computational. One-level model would require fitting many parameters simultaneously in one ANOVA. Computation of raw p-values is the same as for usual ANOVA.

Multiple Testing, Motivation. Histogram of 20.000 p-values generated under H 0 Even when all 20.000 null-hypotheses are true, we expect 20.000*0.05 = 1.000 p-values smaller than α = 0.05!!!

Multiple Testing. Illustration of Benjamini-Hochberg procedure

Multiple Testing M

High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics.

Similar presentations

Presentation on theme: "High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics.

Similar presentations

Presentation on theme: "High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics."— Presentation transcript:

Similar presentations

About project

Feedback