Microarray Data Pre-Processing

Slides:

Advertisements

Similar presentations

NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.

Advertisements

MicroArray Image Analysis Robin Liechti

Statistical Techniques I EXST7005 Start here Measures of Dispersion.

Measures of Dispersion

Chapter 4: Image Enhancement

Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.

MicroArray Image Analysis

MicroArray Image Analysis Robin Liechti

Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.

Microarray Normalization

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical.

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Getting the numbers comparable

Probe Level Analysis of AffymetrixTM Data

DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.

Preprocessing Methods for Two-Color Microarray Data

Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.

Jan Shapes of distributions… “Statistics” for one quantitative variable… Mean and median Percentiles Standard deviations Transforming data… Rescale:

1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.

Low Level Statistics and Quality Control Javier Cabrera.

Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.

Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.

Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.

ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”

Introduce to Microarray

Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.

Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.

Microarray Preprocessing

1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.

October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.

CSE554Laplacian DeformationSlide 1 CSE 554 Lecture 8: Laplacian Deformation Fall 2012.

Chap 6-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 6 The Normal Distribution Business Statistics: A First Course 6 th.

Numerical Descriptive Techniques

Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.

© 2008 Brooks/Cole, a division of Thomson Learning, Inc. 1 Chapter 4 Numerical Methods for Describing Data.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.

Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.

Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.

Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.

Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.

Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.

Microarray - Leukemia vs. normal GeneChip System.

Scenario 6 Distinguishing different types of leukemia to target treatment.

Copyright © 2009 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.

Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.

Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.

Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.

Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.

Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.

1 Chapter 4 Numerical Methods for Describing Data.

(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.

Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Standard Deviation as a Ruler and the Normal Model.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.

Descriptive Statistics ( )

Normalization Methods for Two-Color Microarray Data

Copyright © 2011 Dan Nettleton

Fitting Curve Models to Edges

Image Processing for cDNA Microarray Data

The Basics of Microarray Image Processing

Image Processing, Lecture #8

Image Processing, Lecture #8

Getting the numbers comparable

Normalization for cDNA Microarray Data

Lecture 3 From Images to Data

Pre-processing AFFY data

Presentation transcript:

Microarray Data Pre-Processing 4/25/2017

Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks! 4/25/2017

Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. 4/25/2017

Microarray data analysis: preprocessing Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency 4/25/2017

Microarray data analysis: preprocessing Image analysis Background correction Normalization Summarization 4/25/2017

Image analysis The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes. Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

Steps in Images Processing 1. Addressing: locate centers 2. Segmentation: classification of pixels either as signal or background. using seeded region growing). 3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

Addressing This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis. 4 by 4 grids 19 by 21 spots per grid

Addressing Registration Registration

Problems in automatic addressing Misregistration of the red and green channels Rotation of the array in the image Skew in the array Rotation

Segmentation methods Edge detection. Fixed circles Adaptive Circle Adaptive Shape Edge detection. Seeded Region Growing. (R. Adams and L. Bishof (1994): Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region. Histogram Methods Adaptive threshold.

Examples of algorithms and software implementation

Limitation of fixed circle method SRG Fixed Circle

Limitation of circular segmentation Small spot Not circular Results from SRG

Information Extraction Spot Intensities mean (pixel intensities). median (pixel intensities). Pixel variation (IQR of log (pixel intensities). Background values Local Morphological opening Constant (global) None Quality Information Signal Background

Background Correction Recall that Spot signal or simply signal is fluorescence intensity due to target molecules hybridized to probe sequences contained in a spot (what we would like to measure) plus background fluorescence (what we would rather not measure). Background is fluorescence that may contribute to spot pixel intensities but is not due to fluorescence from target molecules hybridized to spot probe sequences. The idea is to remove background fluorescence from the spot signal fluorescence because the spot signal is believed to be a sum of fluorescence due to background and fluorescence due to hybridized target cDNA. 4/25/2017

Local background Focusing on small regions surrounding the spot mask. Median of pixel values in this region Most software package implement such an approach ScanAlyze ImaGene Spot, GenePix By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure 4/25/2017

Global background Global method which subtracts a constant background for all spots Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide More meaningful to estimate background based on a set of negative control spots If no negative control spots: approximation of the average background = third percentile of all the spot foreground values 4/25/2017

Background Correction Strategies (applied prior to logging signal intensity) Subtract local background, e.g., signal mean – background mean or signal mean – background median This can increase variation in measurements, especially for low expressing genes. Some believe that local background will overestimate the background contribution to spot fluorescence. Background fluorescence where cDNA has been spotted may be different than background where no cDNA has been spotted. 4/25/2017

Background Correction Strategies (applied prior to logging signal intensity) For each spot, find the local background of the spot as well as the local backgrounds of all neighboring spots. Compute the median or mean of these local backgrounds. Subtract that summary of local backgrounds from the spot’s signal. This is similar to option 1 but can reduce some variation in background estimation. 4/25/2017

Background Correction Strategies (applied prior to logging signal intensity) Find the median or mean of local backgrounds in a sector. Subtract the sector summary of local backgrounds from each signal in the sector. Subtract the median or mean of blank spot signals or negative control signals in a sector from all other signals in a sector. Estimate the background for each spot by fitting a row and column model to the local background values in a sector. (See next slide.) 4/25/2017

in ith row and jth column Modeling local backgrounds within each sector (Kafadar and Phang. (2003). CSDA 44 313-338) baseline background for the sector residual bij = m + ri + cj + eij background for spot in ith row and jth column of the sector row effect for the sector column effect for the sector ^ An estimated background for each spot bij is obtained via median polish. 4/25/2017

Comments on Background Correction Subtracting background may result in a negative or zero adjusted-signal values. Such values cannot be logged. One simple approach is to replace all negative values by zero, add one to all values (whether zero or not), and log the resulting values. 4/25/2017

Data Normalization Large sets of experiments involve dozens to hundreds arrays To make the arrays comparable, the data need to be normalized Because equal amounts of mRNA are used in all arrays, the spot intensities of an array should sum to a fixed number 4/25/2017

What is Normalization? Normalization describes the process of removing (or minimizing) non-biological variation in the measured gene expression levels of hybridized mRNA so that biological differences can be more easily detected. Typically normalization is attempting to remove global effects, i.e., effects that can be seen by examining plots that show all the data for a slide or slides. Normalization does not necessarily have anything to do with the normal distribution that plays a prominent role in statistics. 4/25/2017

Sources of Non-Biological Variation Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment Channel is used to refer to a combination of a dye and a slide. Variation across replicate slides Variation across hybridization conditions Variation in scanning conditions Variation among technicians doing the lab work etc....................................................................... 4/25/2017

Normalization Methods for Two-Color Microarray Data 4/25/2017

Side-by-side boxplots show examples of variation across channels. 4/25/2017

maximum Slide 2 Cy3 Cy5 Slide 1 Cy3 Cy5 Q3=75th percentile median minimum 4/25/2017

Interquartile range (IQR) is Q3-Q1. Points more than 1.5*IQR above Q3 or more than 1.5*IQR below Q1 are displayed individually. maximum Q3=75th percentile median Q1=25th percentile minimum 4/25/2017

One of the simplest normalization strategies is to align the log signals so that all channels have the same median. The value of the common median is not important for subsequent analyses. A convenient choice is zero so that positive or negative values reflect signals above or below the median for a particular channel. If negative normalized signal values seem confusing, any positive constant may be added to all values after normalization to zero medians. 4/25/2017

Log Mean Signal Centered at 0 4/25/2017

Note that medians match but variation seems to differ greatly across channels. Log Mean Signal Centered at 0 4/25/2017

Scale normalization (Yang, et al. 2002 Scale normalization (Yang, et al. 2002. Nucliec Acids Research, 30, 4 e15) Consider a matrix X with i=1,...,I rows and j=1,...,J columns. Let xij denote the entry in row i and column j. We will apply scale normalization to the matrix of log signal mean values that have already been median centered (each row corresponds to a gene and each column corresponds to a channel). For each column j, let mj=median(x1j, x2j, ..., xIj). For each column j, let MADj=median(|x1j-mj|,|x2j-mj|,...,|xIj-mj|). MAD: median absolute deviation To scale normalize the columns of X to a constant value C, multiply all the entries in the jth column by C/MADj for all j=1,...,J. A common choice for C is the geometric mean of MAD1,...,MADJ = The choice of C will not effect subsequent tests or p-values but will affect fold change calculations. *Yang et al. recommended scale normalization for log R/G values. 4/25/2017

Data after Median Centering and Scale Normalizing Log Mean Signal (centered and scaled) 4/25/2017

A Simple Example Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11 4/25/2017

Determine Channel Medians Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11 medians 7 6 6 11 4/25/2017

Subtract Channel Medians Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 This is the data after median centering. 4/25/2017

Find Median Absolute Deviations Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 MAD 2 4 1 2 4/25/2017

Find Scaling Constant Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 MAD 2 4 1 2 C = (2*4*1*2)1/4 = 2 4/25/2017

Find Scaling Factors Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 9 3 2 2 0 -4 1 4 3 -4 0 -1 -3 4 -6 -1 -4 -2 5 2 7 0 0 Scaling 2 2 2 2 Factors 2 4 1 2 4/25/2017

Scale Normalize the Median Centered Data Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 1 4.5 6 2 2 0 -2.0 2 4 3 -4 0.0 -2 -3 4 -6 -0.5 -8 -2 5 2 3.5 0 0 This is the data after median centering and scale normalizing. 4/25/2017

Evidence of intensity-dependent dye bias Slide 1 Log Signal Means after Median Centering and Scaling All Channels Evidence of intensity-dependent dye bias Log Red 4/25/2017 Log Green

M vs. A Plot of the Logged, Centered, and Scaled Slide 1 Data M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

“lowess” stands for LOcally WEighted polynomial regreSSion. To handle intensity-dependent dye bias, Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend “lowess” normalization prior to median centering and scale normalizing. “lowess” stands for LOcally WEighted polynomial regreSSion. The original reference for lowess is Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. JASA 74 829-836. 4/25/2017

LOESS At each point in the data set a low-degree polynomial is fit to a subset of the data, with explanatory variable values near the point whose response is being estimated. The polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The value of the regression function for the point is then obtained by evaluating the local polynomial using the explanatory variable values for that data point. The LOESS fit is complete after regression function values have been computed for each of the n data points. From Wikipedia, the free encyclopedia 4/25/2017

Slide 1 Log Signal Means Log Red 4/25/2017 Log Green

M vs. A Plot for Slide 1 Log Signal Means M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

M vs. A Plot for Slide 1 Log Signal Means with lowess fit (f=0.40) M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

A = (Log Green + Log Red) / 2 Adjust M Values M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

M vs. A Plot after Adjustment M = Adjusted Log Red – Adjusted Log Green 4/25/2017 A = (Adjusted Log Green + Adjusted Log Red) / 2

M vs. A Plot for Slide 1 Log Signal Means adjusted log red = log red – adj/2 adjusted log green=log green + adj/2 where adj = lowess fitted value Adjusted Log Red 4/25/2017 Adjusted Log Green

M vs. A Plot for Slide 1 Log Signal Means For spots with A=7, the lowess fitted value is 0.883. Thus the value of adj discussed on the previous slide is 0.883 for spots with A=7. The M value for such spots would be moved down by 0.883. The log red value would be decreased by 0.883/2 and the log green value would be increased by 0.883/2 to obtain adjusted log red and adjusted log green values, respectively. M vs. A Plot for Slide 1 Log Signal Means with lowess fit (f=0.40) M = Log Red - Log Green 0.883 4/25/2017 A = (Log Green + Log Red) / 2

How is the lowess curve determined? Weight function Suppose we have data points (x1,y1), (x2,y2),...(xn,yn). Let 0 < f ≤ 1 denote a fraction that will determine the smoothness of the curve. Let r = n*f rounded to the nearest integer. Consider the tricube weight function defined as Tricube Weight Function T(t) = ( 1 - | t | 3 ) 3 for | t | < 1 = 0 for | t | ≥ 1. T(t) For i=1, ..., n; let hi be the rth smallest number among |xi-x1|, |xi-x2|, ..., |xi-xn|. For k=1, 2, ..., n; let wk(xi)=T( ( xk – xi ) / hi ). 4/25/2017 t

An Example i 1 2 3 4 5 6 7 8 9 10 xi 1 2 5 7 12 13 15 25 27 30 yi 1 8 4 5 3 9 16 15 23 29 Suppose a lowess curve will be fit to this data with f=0.4. y 4/25/2017 x

Table Containing |xi-xj| Values x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 0 1 4 6 11 12 14 24 26 29 x2 1 0 3 5 10 11 13 23 25 28 x3 4 3 0 2 7 8 10 20 22 25 x4 6 5 2 0 5 6 8 18 20 23 x5 11 10 7 5 0 1 3 13 15 18 x6 12 11 8 6 1 0 2 12 14 17 x7 14 13 10 8 3 2 0 10 12 15 x8 24 23 20 18 13 12 10 0 2 5 x9 26 25 22 20 15 14 12 2 0 3 x10 29 28 25 23 18 17 15 5 3 0 4/25/2017

Calculation of hi from |xi-xj| Values n=10, f=0.4  r=4 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x1 0 1 4 6 11 12 14 24 26 29 h1= 6 x2 1 0 3 5 10 11 13 23 25 28 h2= 5 x3 4 3 0 2 7 8 10 20 22 25 h3= 4 x4 6 5 2 0 5 6 8 18 20 23 h4= 5 x5 11 10 7 5 0 1 3 13 15 18 h5= 5 x6 12 11 8 6 1 0 2 12 14 17 h6= 6 x7 14 13 10 8 3 2 0 10 12 15 h7= 8 x8 24 23 20 18 13 12 10 0 2 5 h8=10 x9 26 25 22 20 15 14 12 2 0 3 h9=12 x10 29 28 25 23 18 17 15 5 3 0 h10=15 4/25/2017

Weights wk(xi) Rounded to Nearest 0.001 k 1 2 3 4 5 6 7 8 9 10 1 1.000 0.986 0.348 0.000 0.000 0.000 0.000 0.000 0.000 0.000 2 0.976 1.000 0.482 0.000 0.000 0.000 0.000 0.000 0.000 0.000 3 0.000 0.193 1.000 0.670 0.000 0.000 0.000 0.000 0.000 0.000 4 0.000 0.000 0.820 1.000 0.000 0.000 0.000 0.000 0.000 0.000 5 0.000 0.000 0.000 0.000 1.000 0.976 0.482 0.000 0.000 0.000 6 0.000 0.000 0.000 0.000 0.986 1.000 0.893 0.000 0.000 0.000 7 0.000 0.000 0.000 0.000 0.850 0.954 1.000 0.000 0.000 0.000 8 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.976 0.670 9 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.986 1.000 0.954 10 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.893 0.976 1.000 i w6(x5) = (1 - ( | x6 - x5 | / h5 ) 3 ) 3 = ( 1 - ( | ( 13 – 12 ) / 5 | ) 3 ) 3 = ( 1 – 1 / 125 ) 3 0.976 ~ ~ 4/25/2017

How is the lowess curve determined? Regression For each i=1, 2, ..., n; let and denote the values of and that minimize . For i=1, 2, ..., n; let and Bisquare Weight Function Consider the bisquare weight function defined as B(t) = ( 1 - t 2 ) 2 for | t | < 1 = 0 for | t | ≥ 1. B(t) For k=1,2,...,n; let where s is the median of |e1|, |e2|, ..., |en|. 4/25/2017 t

How is the lowess curve determined? For each i=1, 2, ..., n; let and denote the values of and that minimize . For i=1, 2, ..., n; let . Now use the new fitted values to compute new as on the previous slide. Substitute the new for the old in the expression above and repeat the minimization described above to obtain new values. These resulting values are the lowess fitted values. Plot these values versus x1, x2, ..., xn and connect with straight lines to obtain the lowess curve. 4/25/2017

4/25/2017

4/25/2017

4/25/2017

4/25/2017

4/25/2017

4/25/2017

4/25/2017

4/25/2017

4/25/2017

4/25/2017

Plot Showing All 10 Lines and Predicted Values after One More Iteration 4/25/2017

The Lowess Curve 4/25/2017

After a separate lowess normalization for each slide, the adjusted values can be median centered and scale normalized across all channels using the lowess-normalized data for each channel. A sector represents the set of points spotted by a single pin on a single slide. The entire normalization process described above can be carried out separately for each sector on each channel. It may be necessary to normalize by sector/channel combinations if spatial variability is apparent. 4/25/2017

Boxplots of Mean Signal after Logging, Lowess Normalization, Median Centering, and Scaling Normalized Signal 4/25/2017

Bolstad, et al. (2003, Bioinformatics 19 2:185-193) propose quantile normalization for microarray data Quantile normalization is most commonly used in normalization of Affymetrix data It can be used for two-color data as well. Quantile normalization can force each channel to have the same quantiles. xq (for q between 0 and 1) is the q quantile of a data set if the fraction of the data points less than or equal to xq is at least q, and the fraction of the data points greater than or equal to xq at least 1-q. median=x0.5 Q1=x0.25 Q3=x0.75 4/25/2017

Boxplots of Log Signal Means after Quantile Normalization 4/25/2017

Original Slide 1 Log Signal Means Log Red 4/25/2017 Log Green

Comparison of Slide 1 Log Signal Means after Quantile Normalization Log Red 4/25/2017 Log Green

Details of Quantile Normalization Find the smallest log signal on each channel. Average the values from step 1. Replace each value in step 1 with the average computed in step 2. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values. 4/25/2017

A Simple Example Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11 4/25/2017

Find the Smallest Value for Each Channel Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11 4/25/2017

Average These Values (1+2+2+8)/4=3.25 Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 2 7 15 3 3 6 5 8 4 1 5 2 9 5 9 13 6 11 (1+2+2+8)/4=3.25 4/25/2017

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11 (1+2+2+8)/4=3.25 4/25/2017

Find the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11 4/25/2017

Average These Values (3+5+5+9)/4=5.5 Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 3 6 5 3.25 4 3.25 5 3.25 9 5 9 13 6 11 (3+5+5+9)/4=5.5 4/25/2017

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11 4/25/2017

Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7 3.25 7 15 3 5.50 6 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 6 11 (7+6+6+11)/4=7.5 4/25/2017

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50 4/25/2017

Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 8 15 9 13 2 7.50 3.25 7 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 13 7.50 7.50 (8+13+7+13)/4=10.25 4/25/2017

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50 4/25/2017

Find the Average of the Next Smallest Values Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 15 9 10.25 2 7.50 3.25 10.25 15 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 9 10.25 7.50 7.50 (9+15+9+15)/4=12.00 4/25/2017

Replace Each Value by the Average Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 1 10.25 12.00 12.00 10.25 2 7.50 3.25 10.25 12.00 3 5.50 7.50 5.50 3.25 4 3.25 5.50 3.25 5.50 5 12.00 10.25 7.50 7.50 This is the data matrix after quantile normalization. 4/25/2017

Background Correction and Normalization of Affymetrix GeneChip Data 4/25/2017

Affymetrix .CEL Files A .CEL file contains one number representing signal intensity for each probe cell on a single GeneChip. .CEL files can be read with Affymetrix software or in R using the Bioconductor package affy. We will discuss two methods for normalizing and obtaining expression measures using data from Affymetrix .CEL files. 4/25/2017

Methods Microarray Analysis Suite (MAS) 5.0 Signal proposed by Affymetrix. Statistical Algorithms Description Document (2002) Affymetrix Inc. Robust Multi-array Average (RMA) proposed by Irizarray et al. (2003) Biostatistics 4, 249-264. These are perhaps the two most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are described and compared at http://affycomp.biostat.jhsph.edu/. 4/25/2017

MAS 5.0 Signal: Background Adjustment Each chip is divided into 16 rectangular zones. The lowest 2% of intensities in each zone are averaged to form a zone-specific background value denoted bZk for zones k=1, 2, ..., 16. The standard deviation of the lowest 2% of intensities in each zone is calculated and denoted nZk for zones k=1, 2, ..., 16. Let dk(x,y) denote the distance from the center of zone k to a probe cell located at coordinates (x,y) on the chip. 4/25/2017

GeneChip Divided into 16 Zones 2 3 4 probe cell at coordinates (x,y) 5 6 7 8 9 10 11 12 y 13 14 15 16 x 4/25/2017

16 Distances to Zone Centers for Each Probe Cell d1(x,y) d4(x,y) d16(x,y) 4/25/2017

MAS 5.0 Signal: Background Adjustment (continued) Let wk(x,y)=1/(dk(x,y)+100). Denote the background for the cell located at coordinates (x,y) by b(x,y)=Σk=1 wk(x,y) bZk / Σk=1 wk(x,y). Denote the “noise” for the cell located at coordinates (x,y) by n(x,y)=Σk=1 wk(x,y) nZk / Σk=1 wk(x,y). 2 16 16 16 16 4/25/2017

MAS 5.0 Signal: Background Adjustment (continued) Let I(x,y) denote the original intensity of the cell located at coordinates (x,y) on the chip. (75th percentile of 36 pixel intensities in the center of the cell.) Let I’(x,y)=max ( I(x,y) , 0.5 ). Define the background-adjusted intensity for the cell at coordinates (x,y) by A(x,y)=max { I’(x,y)-b(x,y) , 0.5n(x,y) }. Henceforth these background-adjusted intensities will be referred to as either PM or MM for perfect match or mismatch cells, respectively. 4/25/2017

MAS 5.0 Signal: Ideal Mismatch Computation MM values are supposed to provide measures of cross- hybridization and stray signal intensity that inflate the value of PM. In the simplest case, a PM value would be corrected simply by subtracting its corresponding MM value. However, some MM values are bigger than their corresponding PM values so that PM-MM would become negative. Because negative values do not make sense and would pose problems with subsequent steps in analysis, Affymetrix determines an Ideal Mismatch (IM) value for each probe pair that is guaranteed to be less than PM. 4/25/2017

MAS 5.0 Signal: Ideal Mismatch Computation (continued) For a given probe set containing n probe pairs, let PMj and MMj denote the perfect match and mismatch values of the jth probe pair. The IM value from the jth probe pair (IMj) is determined as follows: If PMj > MMj, then IMj = MMj and no further computation is needed. If PMj ≤ MMj, compute M = TBW { log2(PM1/MM1),...,log2(PMn/MMn) } where TBW denotes a one-step Tukey BiWeight (a special weighted average described later). 4/25/2017

MAS 5.0 Signal: Ideal Mismatch Computation (continued) If M > 0.03, then IMj = PMj / 2M. If M ≤ 0.03, then compute P = and let IMj = PMj / 2P. Note that at M = 0.03, IMj = PMj / 1.021012 so that PMj will be slightly larger than IMj. As M gets larger, IMj decreases. As M gets smaller, IMj increases towards PMj / 1.020949. 0.03 1 + ( 0.03-M ) 10 4/25/2017

MAS 5.0 Signal: Signal Log Value Computation Let Vj = max ( PMj – IMj , 2-20 ). Define the probe value for the jth probe pair by PVj = log2(Vj). The signal log value for a given probe set is defined by SLV = TBW ( PV1 , PV2 , ... , PVn ) where TBW denotes a one-step Tukey BiWeight (a special weighted average to be discussed later). 4/25/2017

MAS 5.0 Signal: Scaling and Signal Calculation Let SLVi denote the signal log value for the ith probe set on a single chip. Let I denote the number of probe sets on the chip. Let SF = 500/TrimMean( 2SLV , 2SLV , ..., 2SLV ; 0.02,0.98). MAS 5.0 Signal for the ith probe set is Signali = SF * 2SLV. All computations are done separately for each chip to obtain a Signal value for each chip and probe set. 1 2 I The average of the values in parentheses that are strictly between the 0.02 and 0.98 quantiles of the values in parentheses. i 4/25/2017

The One-Step Tukey BiWeight Estimator Used by Affymetrix Let x1, x2, ..., xn denote observations. Let m = median ( x1, x2, ..., xn ). Let MAD = median ( |x1 – m|, |x2 – m|, ..., |xn – m| ). For each i = 1, 2, ..., n; let ti = . xi - m 5 * MAD + 0.0001 Factor Affymetrix uses to avoid division by 0. 4/25/2017

The One-Step Tukey BiWeight Estimator Used by Affymetrix (ctd.) Recall the bisquare weight function defined as Bisquare Weight Function B(t) = ( 1 - t 2 ) 2 for | t | < 1 = 0 for | t | ≥ 1. B(t) n TBW ( x1, x2, ..., xn ) = Σi=1 B(ti) xi Σi=1 B(ti) n t 4/25/2017

An Example Compute TBW ( 1, 7, 13, 15, 28, 1075 ). Ignore the 0.0001 factor to make calculations easier. Compute TBW ( 1, 7, 13, 15, 28, 1075 ). m = ( 13 + 15 ) / 2 = 14. MAD = median ( |1-14|,|7-14|,|13-14|,|15-14|,|28-14|,|1075-14| ) = median ( 13, 7, 1, 1, 14, 1061 ) = median ( 1, 1, 7, 13, 14, 1061 ) = ( 7 + 13 ) / 2 = 10. t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50 t4 = 1 / 50 t5 = 14 / 50 t6 = 1061 / 50 4/25/2017

An Example (continued) t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50 t4 = 1 / 50 t5 = 14 / 50 t6 = 1061 / 50 B(t1)=B(0.26)=( 1 - 0.262 ) 2 = 0.8693698 B(t2)=B(0.14)=( 1 - 0.142 ) 2 = 0.9611842 B(t3)=B(0.02)=( 1 - 0.022 ) 2 = 0.9992002 B(t4)=B(0.02)=( 1 - 0.022 ) 2 = 0.9992002 B(t5)=B(0.28)=( 1 - 0.282 ) 2 = 0.8493466 B(t6)=0 0.8693698*1+ 0.9611842*7+0.9992002*13+0.9992002*15+0.8493466*28+0*1075 0.8693698+ 0.9611842+0.9992002+0.9992002+0.8493466+0 =12.68772. 4/25/2017

Obtaining MAS5.0 Signal Values from Affymetrix .CEL Files MAS5.0 Signal values can be obtained from Affymetrix software. Approximate MAS5.0 Signal values can be computed with the mas5 function that is part of the Bioconductor package affy. 4/25/2017

Robust Multi-array Average (RMA) Background adjust PM values from .CEL files. Take the base-2 log of each background-adjusted PM intensity. Quantile normalize values from step 2 across all GeneChips. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe. For each row, find the average of the fitted values from step 4 to use as probe-set-specific expression measures for each GeneChip. 4/25/2017

RMA: Background Adjustment Assume PM = S + B where signal S ~ Exp(λ) independent of background B ~ N+(μ,σ2). N+(μ,σ2) denotes N(μ,σ2) truncated on the left at 0. 4/25/2017

λe-λs The Probability Density Function of the Exponential Distribution with Mean 1/λ = 10000 λe-λs s 4/25/2017

The Probability Density Function of the Normal Distribution with Mean μ = 1000 and Variance σ2 = 3002 2 e-(b-μ) /(2σ ) 2 (2πσ2)0.5 b 4/25/2017

The Probability Density Function of s + b where s~Exp(λ=1/10000) and b~N+(μ = 1000,σ2 = 3002) Density of s+b s+b 4/25/2017

RMA: Background Adjustment (continued) N(0,1) density function N(0,1) distribution function Separately for each chip, estimate μ, σ, and λ from the observed PM distribution. Plug those estimates into the formula above to obtain an estimate of E(S|PM) for each PM value. These serve as background-adjusted PM values. 4/25/2017

RMA: Background Adjustment (continued) Obtaining Estimates of μ, σ, and λ (unpublished description of the procedure) Estimate the mode of the PM distribution using a kernel density estimate of the PM density. Estimate the density of the PM values less than the mode. The mode of this distribution serves as an estimate of μ. Assume the data to the left of the estimate of μ are the background observations that fell below their mean. Use those observations to estimate σ. Subtract the estimate of μ from all observations larger than the estimate. The mode of this distribution estimates 1/λ. 4/25/2017

PM Density Estimate Based on Simulated Data Data below the estimated mode is used to estimate background parameters μ and σ. Density 4/25/2017

Density Estimate of PM Data below the Estimated Mode of the PM Distribution This data is used to estimate σ as 642.3. Density Estimate of μ = 1612 4/25/2017

Estimate of σ According to the RMA R code, σ is estimated as follows: The purpose of the factor of 2 in the numerator is not clear. 4/25/2017

Density Estimate of PM – μ Values ^ Density Estimate of PM – μ Values Greater than Zero The mean of these values would be a much better estimate of 1/λ in this case. (Mean is 9848 and 1/λ=10000.) Density Estimate of 1/λ = 2019 4/25/2017

RMA: Quantile Normalization After background adjustment, find the smallest log2(PM) on each chip. Average the values from step 1. Replace each value in step 1 with the average computed in step 2. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values. 4/25/2017

RMA: Median Polish For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile- normalized value for GeneChip i and probe j. Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0. Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column. gene expression of the probe set on GeneChip i probe affinity affect for the jth probe in the probe set residual for the jth probe on the ith GeneChip 4/25/2017

RMA: Median Polish (continued) Let yij denote the fitted value for yij that results from the median polish procedure. Let αj = y.j – y.. where y.j =Σi=1 yij and y..= Σi=1Σj=1 yij and and I denotes the number of GeneChips. Let μi = yi. =Σj=1 yij / J μi is the probe-set-specific measure of expression for GeneChip i. ^ ^ ^ ^ ^ I ^ ^ I J ^ I IJ ^ ^ J ^ ^ 4/25/2017

An Example Suppose the following are background-adjusted, log2-transformed, quantile-normalized PM intensities for a single probe set. Determine the final RMA expression measures for this probe set. Probe 1 2 3 4 5 1 4 3 6 4 7 2 8 1 10 5 11 3 6 2 7 8 8 4 9 4 12 9 12 5 7 5 9 6 10 GeneChip 4/25/2017

An Example (continued) 4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10 4 8 7 9 row medians 0 -1 2 0 3 0 -7 2 -3 3 -1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3 matrix after removing row medians 4/25/2017

An Example (continued) 0 -1 2 0 3 0 -7 2 -3 3 -1 -5 0 1 1 0 -5 3 0 3 0 -2 2 -1 3 0 4 0 0 0 0 -2 0 -3 0 -1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0 0 -5 2 0 3 matrix after subtracting column medians column medians 4/25/2017

An Example (continued) 0 4 0 0 0 0 -2 0 -3 0 -1 0 -2 1 -2 0 0 1 0 0 0 3 0 -1 0 -1 row medians 0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0 matrix after removing row medians 4/25/2017

An Example (continued) 0 4 0 0 0 0 -2 0 -3 0 0 1 -1 2 -1 0 0 1 0 0 0 3 0 -1 0 0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0 0 1 0 0 0 matrix after subtracting column medians column medians 4/25/2017

An Example (continued) 0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0 All row medians and column medians are 0. Thus the median polish procedure has converged. The above is the residual matrix that we will subtract from the original matrix to obtain the fitted values. 4/25/2017

An Example (continued) original matrix residuals from median polish 4 3 6 4 7 8 1 10 5 11 6 2 7 8 8 9 4 12 9 12 7 5 9 6 10 0 3 0 0 0 0 -3 0 -3 0 0 0 -1 2 -1 0 -1 1 0 0 0 2 0 -1 0 matrix of fitted values row means = μ1 = μ2 = μ3 = μ4 = μ5 ^ 4 0 6 4 7 8 4 10 8 11 6 2 8 6 9 9 5 11 9 12 7 3 9 7 10 4.2 8.2 6.2 9.2 7.2 ^ RMA expression measures for the 5 GeneChips ^ ^ ^ 4/25/2017

Miscellaneous Comments on Normalization We have only scratched the surface in terms of normalization methods. There are many variations on the techniques that were described previously as well as other approaches that we won’t discuss at this point in the course. Normalization affects the final results, but it is often not clear what normalization strategy is best. It would be good to integrate normalization and statistical analysis, but it is difficult to do so. The most common approach is to normalize data and then perform statistical analysis of the normalized data as a separate step in the microarray analysis process. 4/25/2017