Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microarray Data Pre-Processing

Similar presentations


Presentation on theme: "Microarray Data Pre-Processing"— Presentation transcript:

1 Microarray Data Pre-Processing
4/25/2017

2 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks! 4/25/2017

3 Microarray data analysis: preprocessing
The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. 4/25/2017

4 Microarray data analysis: preprocessing
Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency 4/25/2017

5 Microarray data analysis: preprocessing
Image analysis Background correction Normalization Summarization 4/25/2017

6 Image analysis The raw data from a cDNA microarray experiment consist of pairs of image files, 16-bit TIFFs, one for each of the dyes. Image analysis is required to extract measures of the red and green fluorescence intensities for each spot on the array.

7 Steps in Images Processing
1. Addressing: locate centers 2. Segmentation: classification of pixels either as signal or background. using seeded region growing). 3. Information extraction: for each spot of the array, calculates signal intensity pairs, background and quality measures.

8 Addressing This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis. 4 by 4 grids 19 by 21 spots per grid

9 Addressing Registration Registration

10 Problems in automatic addressing
Misregistration of the red and green channels Rotation of the array in the image Skew in the array Rotation

11 Segmentation methods Edge detection.
Fixed circles Adaptive Circle Adaptive Shape Edge detection. Seeded Region Growing. (R. Adams and L. Bishof (1994): Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region. Histogram Methods Adaptive threshold.

12 Examples of algorithms and software implementation

13 Limitation of fixed circle method
SRG Fixed Circle

14 Limitation of circular segmentation
Small spot Not circular Results from SRG

15 Information Extraction
Spot Intensities mean (pixel intensities). median (pixel intensities). Pixel variation (IQR of log (pixel intensities). Background values Local Morphological opening Constant (global) None Quality Information Signal Background

16 Background Correction
Recall that Spot signal or simply signal is fluorescence intensity due to target molecules hybridized to probe sequences contained in a spot (what we would like to measure) plus background fluorescence (what we would rather not measure). Background is fluorescence that may contribute to spot pixel intensities but is not due to fluorescence from target molecules hybridized to spot probe sequences. The idea is to remove background fluorescence from the spot signal fluorescence because the spot signal is believed to be a sum of fluorescence due to background and fluorescence due to hybridized target cDNA. 4/25/2017

17 Local background Focusing on small regions surrounding the spot mask.
Median of pixel values in this region Most software package implement such an approach ScanAlyze ImaGene Spot, GenePix By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure 4/25/2017

18 Global background Global method which subtracts a constant background for all spots Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide More meaningful to estimate background based on a set of negative control spots If no negative control spots: approximation of the average background = third percentile of all the spot foreground values 4/25/2017

19 Background Correction Strategies (applied prior to logging signal intensity)
Subtract local background, e.g., signal mean – background mean or signal mean – background median This can increase variation in measurements, especially for low expressing genes. Some believe that local background will overestimate the background contribution to spot fluorescence. Background fluorescence where cDNA has been spotted may be different than background where no cDNA has been spotted. 4/25/2017

20 Background Correction Strategies (applied prior to logging signal intensity)
For each spot, find the local background of the spot as well as the local backgrounds of all neighboring spots. Compute the median or mean of these local backgrounds. Subtract that summary of local backgrounds from the spot’s signal. This is similar to option 1 but can reduce some variation in background estimation. 4/25/2017

21 Background Correction Strategies (applied prior to logging signal intensity)
Find the median or mean of local backgrounds in a sector. Subtract the sector summary of local backgrounds from each signal in the sector. Subtract the median or mean of blank spot signals or negative control signals in a sector from all other signals in a sector. Estimate the background for each spot by fitting a row and column model to the local background values in a sector. (See next slide.) 4/25/2017

22 in ith row and jth column
Modeling local backgrounds within each sector (Kafadar and Phang. (2003). CSDA ) baseline background for the sector residual bij = m + ri + cj + eij background for spot in ith row and jth column of the sector row effect for the sector column effect for the sector ^ An estimated background for each spot bij is obtained via median polish. 4/25/2017

23 Comments on Background Correction
Subtracting background may result in a negative or zero adjusted-signal values. Such values cannot be logged. One simple approach is to replace all negative values by zero, add one to all values (whether zero or not), and log the resulting values. 4/25/2017

24 Data Normalization Large sets of experiments involve dozens to hundreds arrays To make the arrays comparable, the data need to be normalized Because equal amounts of mRNA are used in all arrays, the spot intensities of an array should sum to a fixed number 4/25/2017

25 What is Normalization? Normalization describes the process of removing (or minimizing) non-biological variation in the measured gene expression levels of hybridized mRNA so that biological differences can be more easily detected. Typically normalization is attempting to remove global effects, i.e., effects that can be seen by examining plots that show all the data for a slide or slides. Normalization does not necessarily have anything to do with the normal distribution that plays a prominent role in statistics. 4/25/2017

26 Sources of Non-Biological Variation
Dye bias: differences in heat and light sensitivity, efficiency of dye incorporation Differences in the amount of labeled cDNA hybridized to each channel in a microarray experiment Channel is used to refer to a combination of a dye and a slide. Variation across replicate slides Variation across hybridization conditions Variation in scanning conditions Variation among technicians doing the lab work etc 4/25/2017

27 Normalization Methods for Two-Color Microarray Data
4/25/2017

28 Side-by-side boxplots show examples of variation across channels.
4/25/2017

29 maximum Slide 2 Cy3 Cy5 Slide 1 Cy3 Cy5 Q3=75th percentile median
minimum 4/25/2017

30 Interquartile range (IQR) is Q3-Q1. Points more than 1.5*IQR above Q3
or more than 1.5*IQR below Q1 are displayed individually. maximum Q3=75th percentile median Q1=25th percentile minimum 4/25/2017

31 One of the simplest normalization strategies is to align the log signals so that all channels have the same median. The value of the common median is not important for subsequent analyses. A convenient choice is zero so that positive or negative values reflect signals above or below the median for a particular channel. If negative normalized signal values seem confusing, any positive constant may be added to all values after normalization to zero medians. 4/25/2017

32 Log Mean Signal Centered at 0
4/25/2017

33 Note that medians match but variation seems to differ greatly across channels.
Log Mean Signal Centered at 0 4/25/2017

34 Scale normalization (Yang, et al. 2002
Scale normalization (Yang, et al Nucliec Acids Research, 30, 4 e15) Consider a matrix X with i=1,...,I rows and j=1,...,J columns. Let xij denote the entry in row i and column j. We will apply scale normalization to the matrix of log signal mean values that have already been median centered (each row corresponds to a gene and each column corresponds to a channel). For each column j, let mj=median(x1j, x2j, ..., xIj). For each column j, let MADj=median(|x1j-mj|,|x2j-mj|,...,|xIj-mj|). MAD: median absolute deviation To scale normalize the columns of X to a constant value C, multiply all the entries in the jth column by C/MADj for all j=1,...,J. A common choice for C is the geometric mean of MAD1,...,MADJ = The choice of C will not effect subsequent tests or p-values but will affect fold change calculations. *Yang et al. recommended scale normalization for log R/G values. 4/25/2017

35 Data after Median Centering and Scale Normalizing
Log Mean Signal (centered and scaled) 4/25/2017

36 A Simple Example Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 4/25/2017

37 Determine Channel Medians
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 medians 4/25/2017

38 Subtract Channel Medians
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 This is the data after median centering. 4/25/2017

39 Find Median Absolute Deviations
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 MAD 4/25/2017

40 Find Scaling Constant Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
MAD C = (2*4*1*2)1/4 = 2 4/25/2017

41 Find Scaling Factors Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
Scaling Factors 4/25/2017

42 Scale Normalize the Median Centered Data
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 This is the data after median centering and scale normalizing. 4/25/2017

43 Evidence of intensity-dependent dye bias
Slide 1 Log Signal Means after Median Centering and Scaling All Channels Evidence of intensity-dependent dye bias Log Red 4/25/2017 Log Green

44 M vs. A Plot of the Logged, Centered, and Scaled Slide 1 Data
M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

45 “lowess” stands for LOcally WEighted polynomial regreSSion.
To handle intensity-dependent dye bias, Yang, et al. (2002. Nucliec Acids Research, 30, 4 e15) recommend “lowess” normalization prior to median centering and scale normalizing. “lowess” stands for LOcally WEighted polynomial regreSSion. The original reference for lowess is Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. JASA 4/25/2017

46 LOESS At each point in the data set a low-degree polynomial is fit to a subset of the data, with explanatory variable values near the point whose response is being estimated. The polynomial is fit using weighted least squares, giving more weight to points near the point whose response is being estimated and less weight to points further away. The value of the regression function for the point is then obtained by evaluating the local polynomial using the explanatory variable values for that data point. The LOESS fit is complete after regression function values have been computed for each of the n data points. From Wikipedia, the free encyclopedia 4/25/2017

47 Slide 1 Log Signal Means Log Red 4/25/2017 Log Green

48 M vs. A Plot for Slide 1 Log Signal Means
M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

49 M vs. A Plot for Slide 1 Log Signal Means
with lowess fit (f=0.40) M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

50 A = (Log Green + Log Red) / 2
Adjust M Values M = Log Red - Log Green 4/25/2017 A = (Log Green + Log Red) / 2

51 M vs. A Plot after Adjustment
M = Adjusted Log Red – Adjusted Log Green 4/25/2017 A = (Adjusted Log Green + Adjusted Log Red) / 2

52 M vs. A Plot for Slide 1 Log Signal Means
adjusted log red = log red – adj/2 adjusted log green=log green + adj/2 where adj = lowess fitted value Adjusted Log Red 4/25/2017 Adjusted Log Green

53 M vs. A Plot for Slide 1 Log Signal Means
For spots with A=7, the lowess fitted value is Thus the value of adj discussed on the previous slide is for spots with A=7. The M value for such spots would be moved down by The log red value would be decreased by 0.883/2 and the log green value would be increased by 0.883/2 to obtain adjusted log red and adjusted log green values, respectively. M vs. A Plot for Slide 1 Log Signal Means with lowess fit (f=0.40) M = Log Red - Log Green 0.883 4/25/2017 A = (Log Green + Log Red) / 2

54 How is the lowess curve determined? Weight function
Suppose we have data points (x1,y1), (x2,y2),...(xn,yn). Let 0 < f ≤ 1 denote a fraction that will determine the smoothness of the curve. Let r = n*f rounded to the nearest integer. Consider the tricube weight function defined as Tricube Weight Function T(t) = ( 1 - | t | 3 ) for | t | < 1 = for | t | ≥ 1. T(t) For i=1, ..., n; let hi be the rth smallest number among |xi-x1|, |xi-x2|, ..., |xi-xn|. For k=1, 2, ..., n; let wk(xi)=T( ( xk – xi ) / hi ). 4/25/2017 t

55 An Example i xi yi Suppose a lowess curve will be fit to this data with f=0.4. y 4/25/2017 x

56 Table Containing |xi-xj| Values
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x x x x x x x x x x 4/25/2017

57 Calculation of hi from |xi-xj| Values
n=10, f=0.4  r=4 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x h1= 6 x h2= 5 x h3= 4 x h4= 5 x h5= 5 x h6= 6 x h7= 8 x h8=10 x h9=12 x h10=15 4/25/2017

58 Weights wk(xi) Rounded to Nearest 0.001 k
i w6(x5) = (1 - ( | x6 - x5 | / h5 ) 3 ) 3 = ( 1 - ( | ( 13 – 12 ) / 5 | ) 3 ) 3 = ( 1 – 1 / 125 ) ~ ~ 4/25/2017

59 How is the lowess curve determined? Regression
For each i=1, 2, ..., n; let and denote the values of and that minimize For i=1, 2, ..., n; let and Bisquare Weight Function Consider the bisquare weight function defined as B(t) = ( 1 - t 2 ) for | t | < 1 = for | t | ≥ 1. B(t) For k=1,2,...,n; let where s is the median of |e1|, |e2|, ..., |en|. 4/25/2017 t

60 How is the lowess curve determined?
For each i=1, 2, ..., n; let and denote the values of and that minimize For i=1, 2, ..., n; let Now use the new fitted values to compute new as on the previous slide. Substitute the new for the old in the expression above and repeat the minimization described above to obtain new values. These resulting values are the lowess fitted values. Plot these values versus x1, x2, ..., xn and connect with straight lines to obtain the lowess curve. 4/25/2017

61 4/25/2017

62 4/25/2017

63 4/25/2017

64 4/25/2017

65 4/25/2017

66 4/25/2017

67 4/25/2017

68 4/25/2017

69 4/25/2017

70 4/25/2017

71 Plot Showing All 10 Lines and Predicted Values after One More Iteration
4/25/2017

72 The Lowess Curve 4/25/2017

73 After a separate lowess normalization for each
slide, the adjusted values can be median centered and scale normalized across all channels using the lowess-normalized data for each channel. A sector represents the set of points spotted by a single pin on a single slide. The entire normalization process described above can be carried out separately for each sector on each channel. It may be necessary to normalize by sector/channel combinations if spatial variability is apparent. 4/25/2017

74 Boxplots of Mean Signal after Logging, Lowess Normalization,
Median Centering, and Scaling Normalized Signal 4/25/2017

75 Bolstad, et al. (2003, Bioinformatics 19 2: ) propose quantile normalization for microarray data Quantile normalization is most commonly used in normalization of Affymetrix data It can be used for two-color data as well. Quantile normalization can force each channel to have the same quantiles. xq (for q between 0 and 1) is the q quantile of a data set if the fraction of the data points less than or equal to xq is at least q, and the fraction of the data points greater than or equal to xq at least 1-q. median=x Q1=x Q3=x0.75 4/25/2017

76 Boxplots of Log Signal Means after Quantile Normalization
4/25/2017

77 Original Slide 1 Log Signal Means
Log Red 4/25/2017 Log Green

78 Comparison of Slide 1 Log Signal Means after Quantile Normalization
Log Red 4/25/2017 Log Green

79 Details of Quantile Normalization
Find the smallest log signal on each channel. Average the values from step 1. Replace each value in step 1 with the average computed in step 2. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values. 4/25/2017

80 A Simple Example Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
4/25/2017

81 Find the Smallest Value for Each Channel
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 4/25/2017

82 Average These Values (1+2+2+8)/4=3.25
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 ( )/4=3.25 4/25/2017

83 Replace Each Value by the Average
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 ( )/4=3.25 4/25/2017

84 Find the Next Smallest Values
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 4/25/2017

85 Average These Values (3+5+5+9)/4=5.5
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 ( )/4=5.5 4/25/2017

86 Replace Each Value by the Average
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 4/25/2017

87 Find the Average of the Next Smallest Values
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 ( )/4=7.5 4/25/2017

88 Replace Each Value by the Average
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 4/25/2017

89 Find the Average of the Next Smallest Values
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 ( )/4=10.25 4/25/2017

90 Replace Each Value by the Average
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 4/25/2017

91 Find the Average of the Next Smallest Values
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 ( )/4=12.00 4/25/2017

92 Replace Each Value by the Average
Gene Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5 This is the data matrix after quantile normalization. 4/25/2017

93 Background Correction and Normalization of Affymetrix GeneChip Data
4/25/2017

94 Affymetrix .CEL Files A .CEL file contains one number representing signal intensity for each probe cell on a single GeneChip. .CEL files can be read with Affymetrix software or in R using the Bioconductor package affy. We will discuss two methods for normalizing and obtaining expression measures using data from Affymetrix .CEL files. 4/25/2017

95 Methods Microarray Analysis Suite (MAS) 5.0 Signal proposed by Affymetrix. Statistical Algorithms Description Document (2002) Affymetrix Inc. Robust Multi-array Average (RMA) proposed by Irizarray et al. (2003) Biostatistics 4, These are perhaps the two most popular of many methods for normalizing and computing expression measures using Affymetrix data. Currently over 50 methods are described and compared at 4/25/2017

96 MAS 5.0 Signal: Background Adjustment
Each chip is divided into 16 rectangular zones. The lowest 2% of intensities in each zone are averaged to form a zone-specific background value denoted bZk for zones k=1, 2, ..., 16. The standard deviation of the lowest 2% of intensities in each zone is calculated and denoted nZk for zones k=1, 2, ..., 16. Let dk(x,y) denote the distance from the center of zone k to a probe cell located at coordinates (x,y) on the chip. 4/25/2017

97 GeneChip Divided into 16 Zones
2 3 4 probe cell at coordinates (x,y) 5 6 7 8 9 10 11 12 y 13 14 15 16 x 4/25/2017

98 16 Distances to Zone Centers for Each Probe Cell
d1(x,y) d4(x,y) d16(x,y) 4/25/2017

99 MAS 5.0 Signal: Background Adjustment (continued)
Let wk(x,y)=1/(dk(x,y)+100). Denote the background for the cell located at coordinates (x,y) by b(x,y)=Σk=1 wk(x,y) bZk / Σk=1 wk(x,y). Denote the “noise” for the cell located at coordinates (x,y) by n(x,y)=Σk=1 wk(x,y) nZk / Σk=1 wk(x,y). 2 16 16 16 16 4/25/2017

100 MAS 5.0 Signal: Background Adjustment (continued)
Let I(x,y) denote the original intensity of the cell located at coordinates (x,y) on the chip. (75th percentile of 36 pixel intensities in the center of the cell.) Let I’(x,y)=max ( I(x,y) , 0.5 ). Define the background-adjusted intensity for the cell at coordinates (x,y) by A(x,y)=max { I’(x,y)-b(x,y) , 0.5n(x,y) }. Henceforth these background-adjusted intensities will be referred to as either PM or MM for perfect match or mismatch cells, respectively. 4/25/2017

101 MAS 5.0 Signal: Ideal Mismatch Computation
MM values are supposed to provide measures of cross- hybridization and stray signal intensity that inflate the value of PM. In the simplest case, a PM value would be corrected simply by subtracting its corresponding MM value. However, some MM values are bigger than their corresponding PM values so that PM-MM would become negative. Because negative values do not make sense and would pose problems with subsequent steps in analysis, Affymetrix determines an Ideal Mismatch (IM) value for each probe pair that is guaranteed to be less than PM. 4/25/2017

102 MAS 5.0 Signal: Ideal Mismatch Computation (continued)
For a given probe set containing n probe pairs, let PMj and MMj denote the perfect match and mismatch values of the jth probe pair. The IM value from the jth probe pair (IMj) is determined as follows: If PMj > MMj, then IMj = MMj and no further computation is needed. If PMj ≤ MMj, compute M = TBW { log2(PM1/MM1),...,log2(PMn/MMn) } where TBW denotes a one-step Tukey BiWeight (a special weighted average described later). 4/25/2017

103 MAS 5.0 Signal: Ideal Mismatch Computation (continued)
If M > 0.03, then IMj = PMj / 2M. If M ≤ 0.03, then compute P = and let IMj = PMj / 2P. Note that at M = 0.03, IMj = PMj / so that PMj will be slightly larger than IMj. As M gets larger, IMj decreases. As M gets smaller, IMj increases towards PMj / 0.03 1 + ( 0.03-M ) 10 4/25/2017

104 MAS 5.0 Signal: Signal Log Value Computation
Let Vj = max ( PMj – IMj , 2-20 ). Define the probe value for the jth probe pair by PVj = log2(Vj). The signal log value for a given probe set is defined by SLV = TBW ( PV1 , PV2 , ... , PVn ) where TBW denotes a one-step Tukey BiWeight (a special weighted average to be discussed later). 4/25/2017

105 MAS 5.0 Signal: Scaling and Signal Calculation
Let SLVi denote the signal log value for the ith probe set on a single chip. Let I denote the number of probe sets on the chip. Let SF = 500/TrimMean( 2SLV , 2SLV , ..., 2SLV ; 0.02,0.98). MAS 5.0 Signal for the ith probe set is Signali = SF * 2SLV. All computations are done separately for each chip to obtain a Signal value for each chip and probe set. 1 2 I The average of the values in parentheses that are strictly between the 0.02 and 0.98 quantiles of the values in parentheses. i 4/25/2017

106 The One-Step Tukey BiWeight Estimator Used by Affymetrix
Let x1, x2, ..., xn denote observations. Let m = median ( x1, x2, ..., xn ). Let MAD = median ( |x1 – m|, |x2 – m|, ..., |xn – m| ). For each i = 1, 2, ..., n; let ti = xi - m 5 * MAD Factor Affymetrix uses to avoid division by 0. 4/25/2017

107 The One-Step Tukey BiWeight Estimator Used by Affymetrix (ctd.)
Recall the bisquare weight function defined as Bisquare Weight Function B(t) = ( 1 - t 2 ) for | t | < 1 = for | t | ≥ 1. B(t) n TBW ( x1, x2, ..., xn ) = Σi=1 B(ti) xi Σi=1 B(ti) n t 4/25/2017

108 An Example Compute TBW ( 1, 7, 13, 15, 28, 1075 ).
Ignore the factor to make calculations easier. Compute TBW ( 1, 7, 13, 15, 28, 1075 ). m = ( ) / 2 = 14. MAD = median ( |1-14|,|7-14|,|13-14|,|15-14|,|28-14|,| | ) = median ( 13, 7, 1, 1, 14, 1061 ) = median ( 1, 1, 7, 13, 14, 1061 ) = ( ) / 2 = 10. t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50 t4 = 1 / t5 = 14 / 50 t6 = 1061 / 50 4/25/2017

109 An Example (continued)
t1 = -13 / 50 t2 = -7 / 50 t3 = -1 / 50 t4 = 1 / t5 = 14 / 50 t6 = 1061 / 50 B(t1)=B(0.26)=( ) 2 = B(t2)=B(0.14)=( ) 2 = B(t3)=B(0.02)=( ) 2 = B(t4)=B(0.02)=( ) 2 = B(t5)=B(0.28)=( ) 2 = B(t6)=0 * * * * *28+0*1075 = 4/25/2017

110 Obtaining MAS5.0 Signal Values from Affymetrix .CEL Files
MAS5.0 Signal values can be obtained from Affymetrix software. Approximate MAS5.0 Signal values can be computed with the mas5 function that is part of the Bioconductor package affy. 4/25/2017

111 Robust Multi-array Average (RMA)
Background adjust PM values from .CEL files. Take the base-2 log of each background-adjusted PM intensity. Quantile normalize values from step 2 across all GeneChips. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe. For each row, find the average of the fitted values from step 4 to use as probe-set-specific expression measures for each GeneChip. 4/25/2017

112 RMA: Background Adjustment
Assume PM = S + B where signal S ~ Exp(λ) independent of background B ~ N+(μ,σ2). N+(μ,σ2) denotes N(μ,σ2) truncated on the left at 0. 4/25/2017

113 λe-λs The Probability Density Function of the
Exponential Distribution with Mean 1/λ = 10000 λe-λs s 4/25/2017

114 The Probability Density Function of the Normal Distribution
with Mean μ = 1000 and Variance σ2 = 3002 2 e-(b-μ) /(2σ ) 2 (2πσ2)0.5 b 4/25/2017

115 The Probability Density Function of s + b
where s~Exp(λ=1/10000) and b~N+(μ = 1000,σ2 = 3002) Density of s+b s+b 4/25/2017

116 RMA: Background Adjustment (continued)
N(0,1) density function N(0,1) distribution function Separately for each chip, estimate μ, σ, and λ from the observed PM distribution. Plug those estimates into the formula above to obtain an estimate of E(S|PM) for each PM value. These serve as background-adjusted PM values. 4/25/2017

117 RMA: Background Adjustment (continued)
Obtaining Estimates of μ, σ, and λ (unpublished description of the procedure) Estimate the mode of the PM distribution using a kernel density estimate of the PM density. Estimate the density of the PM values less than the mode. The mode of this distribution serves as an estimate of μ. Assume the data to the left of the estimate of μ are the background observations that fell below their mean. Use those observations to estimate σ. Subtract the estimate of μ from all observations larger than the estimate. The mode of this distribution estimates 1/λ. 4/25/2017

118 PM Density Estimate Based on Simulated Data
Data below the estimated mode is used to estimate background parameters μ and σ. Density 4/25/2017

119 Density Estimate of PM Data below the Estimated
Mode of the PM Distribution This data is used to estimate σ as Density Estimate of μ = 1612 4/25/2017

120 Estimate of σ According to the RMA R code, σ is estimated as follows:
The purpose of the factor of 2 in the numerator is not clear. 4/25/2017

121 Density Estimate of PM – μ Values
^ Density Estimate of PM – μ Values Greater than Zero The mean of these values would be a much better estimate of 1/λ in this case. (Mean is 9848 and 1/λ=10000.) Density Estimate of 1/λ = 2019 4/25/2017

122 RMA: Quantile Normalization
After background adjustment, find the smallest log2(PM) on each chip. Average the values from step 1. Replace each value in step 1 with the average computed in step 2. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values. 4/25/2017

123 RMA: Median Polish For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile- normalized value for GeneChip i and probe j. Assume yij = μi + αj + eij where α1 + α αn = 0. Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column. gene expression of the probe set on GeneChip i probe affinity affect for the jth probe in the probe set residual for the jth probe on the ith GeneChip 4/25/2017

124 RMA: Median Polish (continued)
Let yij denote the fitted value for yij that results from the median polish procedure. Let αj = y.j – y.. where y.j =Σi=1 yij and y..= Σi=1Σj=1 yij and and I denotes the number of GeneChips. Let μi = yi. =Σj=1 yij / J μi is the probe-set-specific measure of expression for GeneChip i. ^ ^ ^ ^ ^ I ^ ^ I J ^ I IJ ^ ^ J ^ ^ 4/25/2017

125 An Example Suppose the following are background-adjusted,
log2-transformed, quantile-normalized PM intensities for a single probe set. Determine the final RMA expression measures for this probe set. Probe GeneChip 4/25/2017

126 An Example (continued)
4 8 7 9 row medians matrix after removing row medians 4/25/2017

127 An Example (continued)
matrix after subtracting column medians column medians 4/25/2017

128 An Example (continued)
-1 row medians matrix after removing row medians 4/25/2017

129 An Example (continued)
matrix after subtracting column medians column medians 4/25/2017

130 An Example (continued)
All row medians and column medians are 0. Thus the median polish procedure has converged. The above is the residual matrix that we will subtract from the original matrix to obtain the fitted values. 4/25/2017

131 An Example (continued)
original matrix residuals from median polish matrix of fitted values row means = μ1 = μ2 = μ3 = μ4 = μ5 ^ 4.2 8.2 6.2 9.2 7.2 ^ RMA expression measures for the 5 GeneChips ^ ^ ^ 4/25/2017

132 Miscellaneous Comments on Normalization
We have only scratched the surface in terms of normalization methods. There are many variations on the techniques that were described previously as well as other approaches that we won’t discuss at this point in the course. Normalization affects the final results, but it is often not clear what normalization strategy is best. It would be good to integrate normalization and statistical analysis, but it is difficult to do so. The most common approach is to normalize data and then perform statistical analysis of the normalized data as a separate step in the microarray analysis process. 4/25/2017


Download ppt "Microarray Data Pre-Processing"

Similar presentations


Ads by Google