Preprocessing Methods for Two-Color Microarray Data

Slides:



Advertisements
Similar presentations
Lecture 9 Microarray experiments MA plots
Advertisements

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 4. Measuring Averages.
Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Getting the numbers comparable
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Jan. 29 “Statistics” for one quantitative variable… Mean and standard deviation (last week!) “Robust” measures of location (median and its friends) Quartiles,
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Differentially expressed genes
Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Statistical Treatment of Data Significant Figures : number of digits know with certainty + the first in doubt. Rounding off: use the same number of significant.
SW388R7 Data Analysis & Computers II Slide 1 Computing Transformations Transforming variables Transformations for normality Transformations for linearity.
Lecture II-2: Probability Review
Relationships Among Variables
Analysis of microarray data
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Constant process Separate signal & noise Smooth the data: Backward smoother: At any give T, replace the observation yt by a combination of observations.
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Inference for regression - Simple linear regression
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Numerical Descriptive Techniques
DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Measures of Variability In addition to knowing where the center of the distribution is, it is often helpful to know the degree to which individual values.
Microarray - Leukemia vs. normal GeneChip System.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Relative Values. Statistical Terms n Mean:  the average of the data  sensitive to outlying data n Median:  the middle of the data  not sensitive to.
Statistics for Differential Expression Naomi Altman Oct. 06.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.
Measures of variability: understanding the complexity of natural phenomena.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Microarray Data Pre-Processing
Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Variability & Statistical Analysis of Microarray Data GCAT – Georgetown July 2004 Jo Hardin Pomona College
The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Lecture 2 – Pre-processing and Normalization José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute.
Linear Algebra Review.
Data Mining: Concepts and Techniques
Normalization Methods for Two-Color Microarray Data
Significance Analysis of Microarrays (SAM)
The Basics of Microarray Image Processing
Significance Analysis of Microarrays (SAM)
Getting the numbers comparable
Normalization for cDNA Microarray Data
Pre-processing AFFY data
Presentation transcript:

Preprocessing Methods for Two-Color Microarray Data 1/15/2011 Copyright © 2011 Dan Nettleton

Preprocessing Steps Background correction Transformation Normalization Summarization

What is background correction? Background correction involves an attempt to remove any portion of a raw fluorescence intensity measurement that is not attributable to fluorescence from target nucleic acid molecules hybridized to their complementary probe. Example sources of fluorescence other than hybridized target nucleic acid molecules include fluorescence in the microarray slide itself, fluorescence from neighboring probe spots, or fluorescence from unbound labeled nucleic acid sequences or other stray particles not washed from the slide.

What is transformation? Transformation refers to transforming the gene expression measures (usually after background correction). The most commonly used transformation is the log transformation. The base is irrelevant, but log base 2 is popular for microarray data. More complex transformations have been proposed that are linear for low values and logarithmic for high values.

What is Normalization? Normalization describes the process of removing (or minimizing) non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected. Normalization does not necessarily have anything to do with the normal distribution that plays a prominent role in statistics.

Sources of Non-Biological Variation Variation across replicate microarray slides resulting from the manufacturing process Variation in the preparation of target samples Differences in the number of dyed target molecules hybridized for each target sample Dye variation: differences in heat and light sensitivity of dyes and differences in the efficiency of dye incorporation

Sources of Non-Biological Variation (continued) Variation across various steps in the measurement process, such as hybridization, washing, and microarray image acquisition Variation in laboratory conditions from day to day Variation among technicians doing the lab work etc.

What is summarization? If a gene is represented by multiple probes on a microarray, it may be desirable to combine the measures from multiple probes to obtain a single measure of the gene's expression level. Simply computing the mean or median is often reasonable. We will discuss more complex strategies in the context of Affymetrix GeneChip data.

Background Correction Recall that Spot signal or simply signal is fluorescence intensity due to target molecules hybridized to probe sequences contained in a spot (what we would like to measure) plus background fluorescence (what we would rather not measure). Background is fluorescence that may contribute to spot pixel intensities but is not due to fluorescence from target molecules hybridized to spot probe sequences. The idea is to remove background fluorescence from the spot signal fluorescence because the spot signal is believed to be a sum of fluorescence due to background and fluorescence due to hybridized target molecules.

A Simple Background Correction Method Subtract local background from the signal, e.g., signal mean – background mean or signal mean – background median

Drawbacks of the Simple Method Signal minus background may be more variable than the signal itself. Subtracting the background may produce a negative value that cannot be logarithmically transformed.

The Normal-Exponential Convolution Method Silver, J. D., Ritchie, M. E., Smyth, G. K. (2009). Microarray background correction: maximum likelihood estimation for the normal – exponential convolution. Biostatistics 2, 352–363. The authors build upon an idea originally proposed for Affymetrix data by Irizarray et al. (2003) Biostatistics 4, 249-264.

The Normal-Exponential Convolution Method Silver, J. D., Ritchie, M. E., Smyth, G. K. (2009). Microarray background correction: maximum likelihood estimation for the normal – exponential convolution. Biostatistics 2, 352–363. The authors build upon an idea originally proposed for Affymetrix data by Irizarray et al. (2003) Biostatistics 4, 249-264.

The Normal-Exponential Convolution Method Suppose there are n spots on a slide. For spot i=1,...n on any particular slide and for either dye, let Di = spot signali – spot backgroundi. Suppose Di = Xi + Yi, where X1,...,Xn iid Exponential(λ) independent of Y1,...,Yn iid N(μ,σ2).

The Normal-Exponential Convolution Method Xi represents the true signal. Yi represents error and background that is present after subtraction of local background. It can be shown that E( Xi | Di ) is Di – μ – λσ2 + σ2φ(0; Di – μ – λσ2, σ2 ) 1 - Φ(0; Di – μ – λσ2, σ2 ) where φ( ; mean, variance) and Φ( ; mean, variance) denote a normal density and cumulative density, respectively.

The Normal-Exponential Convolution Method Using D1,...,Dn as the observed data, find maximum likelihood estimates of μ, σ2, and λ. Substitute these MLEs into E( Xi | Di ) to get a background corrected signal. This entire process is repeated separately for each slide and dye combination.

Transformation of Background-Corrected Signals Silver, Ritchie, and Smyth (2009) recommend the transformation log2(BCS+50), where BCS denotes the background-corrected signal resulting from the normal- exponential convolution method. The addition of 50 prior to the log transformation reduces the variance of log ratios for genes with low signal intensities, i.e., rather than log2(R/G) use log2{(R+50)/(G+50)}=log2(R+50)-log2(G+50), where R and G are red and green BCS, respectively.

Normalization Following background correction and transformation, the next preprocessing step is normalization. Recall that the goal of normalization is to reduce the impact of non-biological variation in measured signal intensity levels so that biological differences in gene expression can be appropriately detected.

Normalization The next several slides show normalization of data from an actual microarray experiment. Although the normalization steps depicted would typically be carried out with log2(BCS) or log2(BCS+50) data, natural log signal means prior to background correction were used in most cases. The figures would look very similar if log2(BCS) or log2(BCS+50) data had been used instead of log signal means.

Side-by-side boxplots show variation across channels. Here channel refers to a slide / dye combination.

maximum Slide 2 Cy3 Cy5 Slide 1 Cy3 Cy5 Q3=75th percentile median Q1=25th percentile minimum

Interquartile range (IQR) is Q3-Q1. Points more than 1.5*IQR above Q3 or more than 1.5*IQR below Q1 are displayed individually. maximum Q3=75th percentile median Q1=25th percentile minimum

One of the simplest normalization strategies is to align the log signals so that all channels have the same median. The value of the common median is not important for subsequent analyses. A convenient choice is zero so that positive or negative values reflect signals above or below the median for a particular channel. If negative normalized signal values seem confusing, any positive constant may be added to all values after normalization to zero medians.

Log Mean Signal Centered at 0

Note that medians match but variation seems to differ greatly across channels. Log Mean Signal Centered at 0

Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend scale normalization.* Consider a matrix X with i=1,...,I rows and j=1,...,J columns. Let xij denote the entry in row i and column j. We will apply scale normalization to the matrix of log signal mean values that have already been median centered (each row corresponds to a gene and each column corresponds to a channel). For each column j, let mj=median(x1j, x2j, ..., xIj). For each column j, let MADj=median(|x1j-mj|,|x2j-mj|,...,|xIj-mj|). To scale normalize the columns of X to a constant value C, multiply all the entries in the jth column by C/MADj for all j=1,...,J. A common choice for C is the geometric mean of MAD1,...,MADJ = The choice of C will not effect subsequent tests or p-values but will affect fold change calculations. *Yang et al. recommended scale normalization for log R/G values.

Data after Median Centering and Scale Normalizing Log Mean Signal (centered and scaled)

A Simple Example Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

Determine Channel Medians Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11 medians 7 6 6 11

Subtract Channel Medians Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 This is the data after median centering.

Find Median Absolute Deviations Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 MAD 2 4 1 2

Find Scaling Constant Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 MAD 2 4 1 2 C = (2*4*1*2)1/4 = 2

Find Scaling Factors Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 Scaling 2 2 2 2 Factors 2 4 1 2

Scale Normalize the Median Centered Data Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 4.5 6 2 2 0 -2.0 2 4 3 -4 0.0 -2 -3 4 -6 -0.5 -8 -2 5 2 3.5 0 0 This is the data after median centering and scale normalizing.

Slide 1 Log Signal Means after Median Centering and Scaling All Channels Evidence of intensity-dependent dye bias Log Red Log Green

M vs. A Plot of the Logged, Centered, and Scaled Slide 1 Data M = Log Red - Log Green A = (Log Green + Log Red) / 2

“lowess” stands for LOcally WEighted polynomial regreSSion. To handle intensity-dependent dye bias, Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend “lowess” normalization prior to median centering and scale normalizing. “lowess” stands for LOcally WEighted polynomial regreSSion. The original reference for lowess is Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. JASA 74 829-836.

Slide 1 Log Signal Means Log Red Log Green

M vs. A Plot for Slide 1 Log Signal Means M = Log Red - Log Green A = (Log Green + Log Red) / 2

M vs. A Plot for Slide 1 Log Signal Means with lowess fit (f=0.40) M = Log Red - Log Green A = (Log Green + Log Red) / 2

Adjust M Values M = Log Red - Log Green A = (Log Green + Log Red) / 2

M vs. A Plot after Adjustment M = Adjusted Log Red – Adjusted Log Green A = (Adjusted Log Green + Adjusted Log Red) / 2

M vs. A Plot for Slide 1 Log Signal Means adjusted log red = log red – adj/2 adjusted log green=log green + adj/2 where adj = lowess fitted value Adjusted Log Red Adjusted Log Green

M vs. A Plot for Slide 1 Log Signal Means For spots with A=7, the lowess fitted value is 0.883. Thus the value of adj discussed on the previous slide is 0.883 for spots with A=7. The M value for such spots would be moved down by 0.883. The log red value would be decreased by 0.883/2 and the log green value would be increased by 0.883/2 to obtain adjusted log red and adjusted log green values, respectively. M vs. A Plot for Slide 1 Log Signal Means with lowess fit (f=0.40) M = Log Red - Log Green 0.883 A = (Log Green + Log Red) / 2

After a separate lowess normalization for each slide, the adjusted values can be median centered and (if deemed necessary) scale normalized across all channels using the lowess-normalized data for each channel.

Boxplots of Mean Signal after Logging, Lowess Normalization, Median Centering, and Scaling Normalized Signal

After a separate lowess normalization for each slide, the adjusted values can be median centered and scale normalized across all channels using the lowess-normalized data for each channel. A sector represents the set of points spotted by a single pin on a single slide. The entire normalization process described above can be carried out separately for each sector on each channel. It may be necessary to normalize by sector/channel combinations if spatial variability is apparent.

Data from 3 Sectors on a Single Slide Log Red Log Green

Bolstad, et al. (2003, Bioinformatics 19 2:185-193) propose quantile normalization for microarray data Quantile normalization is most commonly used in normalization of Affymetrix data It can be used for two-color data as well. Quantile normalization can force each channel to have the same quantiles. xq (for q between 0 and 1) is the q quantile of a data set if the fraction of the data points less than or equal to xq is at least q, and the fraction of the data points greater than or equal to xq at least 1-q. median=x0.5 Q1=x0.25 Q3=x0.75

Boxplots of Log Signal Means after Quantile Normalization

Original Slide 1 Log Signal Means Log Red Log Green

Comparison of Slide 1 Log Signal Means after Quantile Normalization Log Red Log Green

Details of Quantile Normalization Find the smallest log signal on each channel. Average the values from step 1. Replace each value in step 1 with the average computed in step 2. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.

A Simple Example Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

Find the Smallest Value for Each Channel Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11

Average These Values (1+2+2+8)/4=3.25 Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11 (1+2+2+8)/4=3.25

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11 (1+2+2+8)/4=3.25

Find the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11

Average These Values (3+5+5+9)/4=5.5 Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11 (3+5+5+9)/4=5.5

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11

Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11 (7+6+6+11)/4=7.5

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50

Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50 (8+13+7+13)/4=10.25

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50

Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50 (9+15+9+15)/4=12.00

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.00 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 12.00 10.25 7.50 7.50 This is the data matrix after quantile normalization.

Miscellaneous Comments on Preprocessing We have only scratched the surface in terms of preprocessing methods. There are many variations on the techniques that we have described as well as other approaches that we won’t discuss. Preprocessing affects the final results, but it is often not clear what strategies are best. It would be good to integrate preprocessing and statistical analysis, but it is difficult to do so. The most common approach is to preprocess data and then perform statistical analysis of the resulting data as a separate step in the microarray analysis process.