Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017 Carlo Colantuoni –

Slides:



Advertisements
Similar presentations
DNA Microarray Quality Control Carlo Colantuoni April 25, 2007.
Advertisements

Lecture 9 Microarray experiments MA plots
Microarray Quality Assessment Issues in High-Throughput Data Analysis BIOS Spring 2010 Dr Mark Reimers.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Normalization of microarray data
Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017 Carlo Colantuoni –
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Getting the numbers comparable
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Microarray Data Preprocessing and Clustering Analysis
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
1 Test of significance for small samples Javier Cabrera.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015 Carlo Colantuoni
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
CDNA Microarrays MB206.
Panu Somervuo, March 19, cDNA microarrays.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Agenda Introduction to microarrays
Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015 Carlo Colantuoni
Microarray - Leukemia vs. normal GeneChip System.
Carlo Colantuoni – Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Estimation of Gene-Specific Variance
Getting the numbers comparable
Normalization for cDNA Microarray Data
Pre-processing AFFY data
Presentation transcript:

Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017 Carlo Colantuoni –

Class Outline Basic Biology & Gene Expression Analysis Technology Data Preprocessing, Normalization, & QC Measures of Differential Expression Multiple Comparison Problem Clustering and Classification The R Statistical Language and Bioconductor GRADES – independent project with Affymetrix data.

Class Outline - Detailed Basic Biology & Gene Expression Analysis Technology –The Biology of Our Genome & Transcriptome –Genome and Transcriptome Structure & Databases –Gene Expression & Microarray Technology Data Preprocessing, Normalization, & QC –Intensity Comparison & Ratio vs. Intensity Plots (log transformation) –Background correction (PM-MM, RMA, GCRMA) –Global Mean Normalization –Loess Normalization –Quantile Normalization (RMA & GCRMA) –Quality Control: Batches, plates, pins, hybs, washes, and other artifacts –Quality Control: PCA and MDS for dimension reduction Measures of Differential Expression –Basic Statistical Concepts –T-tests and Associated Problems –Significance analysis in microarrays (SAM) [ & Empirical Bayes] –Complex ANOVA’s (limma package in R) Multiple Comparison Problem –Bonferroni –False Discovery Rate Analysis (FDR) Differential Expression of Functional Gene Groups –Functional Annotation of the Genome –Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum –Gene Set Enrichment Analysis (GSEA) –Parametric Analysis of Gene Set Enrichment (PAGE) –geneSetTest –Notes on Experimental Design Clustering and Classification –Hierarchical clustering –K-means –Classification LDA (PAM), kNN, Random Forests Cross-Validation Additional Topics –The R Statistical Language –Bioconductor –Affymetrix data processing example!

DAY #2: Intensity Comparison & Ratio vs. Intensity Plots Log transformation Background correction (Affymetrix, 2-color, other) Normalization: global and local mean centering Normalization: quantile normalization Batches, plates, pins, hybs, washes, and other artifacts QC: PCA and MDS for dimension reduction

Log Intensity Microarray Data Quantification

Log Intensity Log Ratio Microarray Data Quantification

Logarithmic Transformation: if : log z (x)=y then : z y =x Logarithm math refresher: log(x) + log(y) = log( x * y ) log(x) - log(y) = log( x / y )

Intensity vs. Intensity: LINEAR Intensity Distribution: LINEAR

Intensity vs. Intensity: LOG Intensity Distribution: LOG

Intensity vs. Intensity: LINEAR

Intensity vs. Intensity: LOG

Int vs. Int: LINEAR Int vs. Int: LOG Ratio vs. Int: LOG Microarray Data Quantification

Background Subtraction

Before Hybridization Array 1 Array 2 Sample 1 Sample 2

After Hybridization Array 1 Array 2

More Realistic - Before Array 1 Array 2 Sample 1 Sample 2

Array 1 Array 2 More Realistic - After

poly C No label

Intensity distributions for the no-label and Yeast DNA

The presence of background noise is clear from the fact that the minimum PM intensity is not 0 and that the geometric mean of the probesets with no spike-in is around 200 units. Why Adjust for Background?

Local slope decreases as nominal concentration decreases! (E 1 + B) / (E 2 + B) ≈ 1 (E 1 + B) / (E 2 + B) ≈ E 1 / E 2 (E 1 + B) ≈ B or … (E 1 + B) ≈ E 1 or … By using the log-scale transformation before analyzing microarray data, investigators have, implicitly or explicitly, assumed a multiplicative measurement error model (Dudoit et al., 2002; Newton et al., 2001; Kerr et al., 200; Wolfinger et al., 2001). The fact, seen in Figure 2, that observed intensity increase linearly with concentration in the original scale but not in the log-scale suggests that background noise is additive with non-zero mean. Durbin et al. (2002), Huber et al. (2002), Cui, Kerr, and Churchill (2003), and Irizarry et al. (2003a) have proposed additive-background-multiplicative-measurement-error models for intensities read from microarray scanners.

Affymetrix GeneChip Design 5’ 3’ Reference sequence …TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT… GTACTACCCAGTCTTCCGGAGGCTA GTACTACCCAGTGTTCCGGAGGCTA Perfectmatch (PM) Mismatch (MM) NSB & SB NSB

Why not subtract MM?

Background: Solutions

Affymetrix GeneChip Design 5’ 3’ Reference sequence …TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT… GTACTACCCAGTCTTCCGGAGGCTA GTACTACCCAGTGTTCCGGAGGCTA Perfectmatch (PM) Mismatch (MM) NSB & SB NSB

Motivation: PM - MM PM = B + S MM = B PM – MM = S The hope is that: But this is not correct!

Simulation We create some feature level data for two replicate arrays Then compute Y=log(PM-kMM) for each array We make an MA using the Ys for each array We make a observed concentration versus known concentration plot We do this for various values of k. The following “movie” shows k moving from 0 to 1.

k=0 Known level (log2) Observed level (log2) Log2(Intensity) Log2(Ratio)

k=1/4 Known level (log2) Observed level (log2) Log2(Intensity) Log2(Ratio)

k=1/2 Known level (log2) Observed level (log2) Log2(Intensity) Log2(Ratio)

k=3/4 Known level (log2) Observed level (log2) Log2(Intensity) Log2(Ratio)

k=1 Known level (log2) Observed level (log2) Log2(Intensity) Log2(Ratio)

Real Data MAS 5.0RMA

RMA: The Basic Idea PM=B+S Observed: PM Of interest: S Pose a statistical model and use it to predict S from the observed PM

The Basic Idea PM=B+S A mathematically convenient, useful model –B ~ Normal ( ,  ) S ~ Exponential ( ) –No MM –Borrowing strength across probes

MAS 5.0

RMA Notice improved precision but worse accuracy

Problem Global background correction ignores probe-specific NSB MM have problems Another possibility: Use probe sequence

Probe-specific Background

G-C content effect in PM’s Boxplots of log intensities from the array hybridized to Yeast DNA for strata of probes defined by their G-C content. Probes with 6 or less G-C are grouped together. Probes with 20 or more are grouped together as well. Smooth density plots are shown for the strata with G-C contents of 6,10,14, and 18. Any given probe will have some propensity to non-specific binding. As described in Section 2.3 and demonstrated in Figure 3, this tends to be directly related to its G-C content. We propose a statistical model that describes the relationship between the PM, MM, and probes of the same G-C content.

General Model (GCRMA) NSBSB We can calculate: Due to the associated variance with the measured MM intensities we argue that one data point is not enough to obtain a useful adjustment. In this paper we propose using probe sequence information to select other probes that can serve the same purpose as the MM pair. We do this by defining subsets of the existing MM probes with similar hybridization properties.

The MA plot shows log fold change as a function of mean log expression level. A set of 14 arrays representing a single experiment from the Affymetrix spike-in data are used for this plot. A total of 13 sets of fold changes are generated by comparing the first array in the set to each of the others. Genes are symbolized by numbers representing the nominal log2 fold change for the gene. Non-differentially expressed genes with observed fold changes larger than 2 are plotted in red. All other probesets are represented with black dots. The smooth lines are 3SDs away with SD depending on log expression.

Naef & Magnasco (2003), PHYSICAL REVIEW E 68, , 2003 Another sequence effect in PM’s and MM’s

We show in Fig. 2 joint probability distributions of PMs and MMs, obtained from all probe pairs in a large set of experiments. Actually, two separate probability distributions are superimposed: in red, the distribution for all probe pairs whose 13th letter is a purine, and in cyan those whose 13 th letter is a pyrimidine. The plot clearly shows two distinct branches in two colors, corresponding to the basic distinction between the shapes of the bases: purines are large, double ringed nucleotides while pyrimidines have smaller single rings. This underscores that by replacing the middle letter of the PM with its complementary base, the situation on the MM probe is that the middle letter always faces itself, leading to two quite distinct outcomes according to the size of the nucleotide. If the letter is a purine, there is no room within an undistorted backbone for two large bases, so this mismatch distorts the geometry of the double helix, incurring a large steric and stacking cost. But if the letter is a pyrimidine, there is room to spare, and the bases just dangle. The only energy lost is that of the hydrogen bonds. Naef & Magnasco (2003), PHYSICAL REVIEW E 68, , 2003

C and T are pyrimidines (and small), A and G are purines (and large).

Why not subtract MM?

Another sequence effect in PM’s Naef & Magnasco (2003), PHYSICAL REVIEW E 68, , 2003 The asymmetry of (A,T) and (G,C) affinities in Fig. 3 can be explained because only A-U and G-C bonds carry labels ~purines U and C on the mRNA are labeled. Notice the nearly equal magnitudes of the reduction in both type of bonds. (Remember also that G-C pairs have 3 and A-T pairs have 2 hydrogen bonds!).

Two color platforms (Agilent, cDNA) Common to have just one feature per gene 60 vs. 25 NT? Optical noise still a concern After spots are identified, a measure of local background is obtained from area around spot (this is also applicable to some spotted one-channel data)

Local background ---- GenePix ---- QuantArray ---- ScanAnalyze

Two color feature level data Red and Green foreground and background obtained from each feature We have Rf gij, Gf gij, Rb gij, Gb gij (g is gene, i is array and j is replicate) A default summary statistic is the log-ratio: log 2 [(Rf-Rb) / (Gf - Gb)]

Background subtraction No background subtraction

Diagnostics: images of Rb, Gb, scatterplot of log2 (Rf/Gf) vs. log2(Rb/Gb)

Correlation may be spatially dependent

Two color platforms Again, we can assess the tradeoff of accuracy and precision via simulation Simulation uses a self versus self (SVS) hybridization experiment -- no differential expression should occur. Mean squared error (MSE) = bias^2 + variance.

Lower MSE with NBS if correlation < 0.2

A procedure that subtracts local background as a function of the correlation of fg and bg ratios may be a nice compromise between background subtraction and no background subtraction. For references, see background subtraction paper by C. Kooperberg J Computational Biol Limma package in R has many useful functions for background subtraction. Following the decision to background subtract, we need to consider a normalization algorithm. Background Subtraction: Conclusions

Normalization

Normalization is needed to ensure that differences in intensities are indeed due to differential expression, and not some printing, hybridization, or scanning artifact. Normalization is necessary before any analysis which involves within or between slides comparisons of intensities, e.g., clustering, testing. Somewhat different approaches are used in two-color and one-color technologies

Varying distributions of intensities from each microarray.

Distributions of intensities after global mean normalization.

What does this normalization mean in Int vs. Int, or Ratio vs. Int space?

Distributions of intensities after global mean normalization – global mean normalization is not enough … Possible solutions: Local Mean Normalization Quantile Normalization

Local Mean Normalization (loess): Adjusts for intensity- dependent bias in ratios. Requires Comparison!

Loess

Quantile Normalization

Quantile normalization All these non-linear methods perform similarly Quantiles is commonly used because its fast and conceptually simple Basic idea: –order value in each array –take average across probes –Substitute probe intensity with average –Put in original order

Example of quantile normalization OriginalOrderedAveragedRe-ordered

Before Quantile Normalization

After Quantile Normalization A worry is that it over corrects

QC

Print-tip Effect

Print-tip Loess

Plate effect

Bad Plate Effect

Print Order Effect

Microarray Pseudo Images: Intensity

Microarray Pseudo Images: Ratios

Images of probe level data This is the raw data

Images of probe level data Residuals (or weights) from probe level model fits show problem clearly

Hybridization Artifacts

PCA, MDS, and Clustering: Dimension Reduction to Detect Experimental Artifacts and Biological Effects

Principle Components Analysis (PCA) and Multi-Dimensional Scaling (MDS)

PCA

MDS

Uncorrected Intensities: MDS Colored by Batch

Removing The Batch Effect Much Like Red:Green Analysis

Uncorrected Intensities: MDS Colored by Batch

Batch Subtracted Measures: MDS Colored by Batch

MDS of All Array Experiments: Subject Replicates

AGE ?

RNA Quality

AGE Batch

Biological Effects: Tissue Types and Growth Factor Treatments

Illumina 24K