1 Models and methods for summarizing GeneChip probe set data.

Slides:



Advertisements
Similar presentations
Assumptions underlying regression analysis
Advertisements

3.3 Hypothesis Testing in Multiple Linear Regression
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
IAOS 2014 Conference – Meeting the Demands of a Changing World Da Nang, Vietnam, 8-10 October 2014 ROBUST REGRESSION IMPUTATION: CONSIDERATION ON THE INFLUENCE.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
Getting the numbers comparable
Probe Level Analysis of AffymetrixTM Data
Preprocessing Methods for Two-Color Microarray Data
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Regression III: Robust regressions
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Regression Diagnostics Checking Assumptions and Data.
Chapter 5 Transformations and Weighting to Correct Model Inadequacies
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Division of Human Cancer Genetics Ohio State University.
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
Model Building III – Remedial Measures KNNL – Chapter 11.
1 Robust estimation techniques in real-time robot vision Ezio Malis, Eric Marchand INRIA Sophia, projet ICARE INRIA Rennes, projet Lagadic.
© 1998, Geoff Kuenning General 2 k Factorial Designs Used to explain the effects of k factors, each with two alternatives or levels 2 2 factorial designs.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Assessing expression data quality in high-density oligonucliotide arrays.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
Robust Estimators.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Enhancements to IIIG LTMS By: Todd Dvorak
Tutorial I: Missing Value Analysis
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Understanding & Comparing Distributions Chapter 5.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE.
Introduction to Affymetrix GeneChip data
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Significance Analysis of Microarrays (SAM)
Lecture Slides Elementary Statistics Thirteenth Edition
CHAPTER 29: Multiple Regression*
Significance Analysis of Microarrays (SAM)
Getting the numbers comparable
Diagnostics and Remedial Measures
Model Adequacy Checking
Diagnostics and Remedial Measures
Pre-processing AFFY data
Presentation transcript:

1 Models and methods for summarizing GeneChip probe set data

2 Some Gene Expression Analysis Tasks Detection of gene expression – presence calls. Differential expression detection – comparative calls. Measurement of gene expression.

3 Objective: To compute probe set summaries which are good indicators of gene expression from background corrected, normalized, prefect match probe intensities for a set of arrays: PM * ijk i=1,…,I, J=1,…,J, k=1,…K Where i denotes probes in probe sets, j denotes arrays and k denotes probe sets.

4 Affymetrix spike-in data set used for illustration - 14 genes spiked in at different concentrations into a common pool of pancreas cRNA

5 Affy comment on non-responding probes: Affymetrix: “Certain probe pairs for 407_at and 36889_at do not work well. It is recommended that these two probe sets be excluded for final statistical tally.”

6 To log or not to log? In addition to providing good expression values, we would like the model to be easy to understand and analyse – Would like to fit a standard linear model: Homogeneity of variance Additivity Normality

7 Homogeneity of variance Look at association between the variance and the mean of the intensities – plot IQR of PM * across 59 replicates against the median of PM * across 59 replicates for probe sets spanning the range of intensities. Repeat for log 2 (PM * ).

8 Intensity scale – ALL

9 Log Intensity scale – ALL

10 Additivity Look at log-log plots of PM * vs concentrations for 14 spike-in fragments

11 PM.bgc.norm vs Conc log-log plot grp 1

12 PM.bgc.norm vs Conc log-log plot grp 2

13 Suggested additive model Log-log plots of PM * vs concentrations suggest the following model: log 2 (PM * ij ) = p i + c j +  ij (1) With p i a probe affinity effect, c j the log 2 scale expression level for chip j, and  ij an iid error term. For identifiability we fit with constraint  i p i =0.

14 Normality We can examine residuals from a least squares fit to model specified in (1) to verify the adequacy of the model in terms of additivity of effects and stability of variance. The shape of the distribution of the residuals can also be compared with a Gaussian distribution to see how far off we are from this ideal.

15 Res vs chip effects - grp 1

16 Res vs chip effects - grp 2

17 Figure – qqnorm residuals form additive fit

18 Res qqnorm - grp 1

19 Res qqnorm - grp 2

20 Analyzing the untransformed PM * values On the untransformed scale, one can fit a multiplicative model (Li-Wong): PM * ij =  i ·  j +  ij (2) The model is fitted by least squares by iteratively fitting the  s and the  s, regarding the other set as known. Fitting steps are interleaved with diagnostic checks used to exclude points from subsequent fits.

21 PM vs logConc - grp 1

22 Res vs chip effects - grp 1

23 qq - grp 1

24 Why robust? Bad probes – probe outliers Bad chips – chip outliers Image artifacts – individual outliers We would like a fitting procedure which yields good estimates in the presence of various types of outliers – individual points, probes, and chips.

25 Robust estimation Huber, Hampel, Rousseeuw Gross errors, round off errors, wrong model. Distinction between approach based on identification and exclusion of outliers and the modeling approach.

26 M estimators A general class of robust estimators are obtained as solutions to: min   i  (Y i -X i  ) Where  is a symmetric function. Or solving the following system:  i  (Y i -X i  )· X i =0

27 Robust fit by IRLS for each probe set Starting with robust fit, at each iteration: S = mad(r ij )·c – robust estimate of scale of  u ij = r ij /S – rescaled residuals w ij =  (|u ij |)/|u ij | – weights used in next LS fit. Theoretical considerations can lead to specification of . In practice, one selects  function with desirable characteristics.

28 Example  functions

29 Options for fitting models to probe sets Recall model log 2 (PM * ij ) = p i + c j +  ij (1) Can fit by: Least squares Least absolute deviation (  (x)=|x|) IRLS using various  functions Can also get a single chip robust probe set summary.

30 Robust fit example A

31 Actual vs fitted

32 Starting weights

33 Ending weights

34 Robust analysis of multi way tables Tukey & co – median polish. Tukey – one degree of freedom test – additivity or not – no partial judgement. Gentleman and Wilks – effect of one or two outliers on residuals. C. Daniel ** – estimate which cells are affected by interactions, and estimate the interactions in a set of cells.

35 Modified weights Standard IRLS procedures determines weights from each cell of the two way table individually. We can also look at residuals across cells in a row (column), to determine a weighting adjustment for the entire row (column): rw i =  (|u| i )/|u| i, cw j =  (|u| j )/|u| j And get a composite weight for each cell” ww ij = rw i · w ij www ij = cw j · rw i · w ij

36 Heuristic derivation of weights Consider the model with interactions: log 2 (PM * ij ) = p i + c j +  ij +  ij Can think of T ij = |r ij |/S as a test statistic for H 0 :  ij = 0 vs H 1 :  ij  0 and w ij =  (T ij )/ T ij as a transformation of this test statistic into a weight. Similarly, one could use T i = |r i |/S =mad i /mad to test H 0 :  ij = 0, j=1,…J vs H 1 :  ij  0 for some j and map this statistic into a weight.

37 See how it does Look initial (individual) weights & fit vs adjusted weights and fit and then to convergence. Also look at probe weights in all spike-in probe sets. Column weights

38 Starting weights

39 Ending weights

40 Robust fit – composite weights

41 Probe Weights

42 Low-weight Probes - 1

43 Low-weight Probes - 2

44 Low-weight Probes - 3

45 Chip Weights

46 Note on multi chip context Note that the residual variance in the model without probe effects, the single chip analysis set-up, is ~ 6x the residual variance in the model with probe effects. Ie. log 2 (PM * ij ) = p i + c j +  ij Vs. log 2 (PM * ij ) = c j +  ij

47 Compare fits on sample probe sets

48 Affy Probe Data Analysis – X hybridizing probe sets

49 X-Hybe probe 3

50 X-Hybe probe 3 – PM vs Phi, Theta

51 X-Hybe probe 4

52 X-Hybe probe 4 - PM vs Phi, Theta

53 X-Hybe probe 5

54 X-Hybe probe 5 – PM vs Phi, Theta

55 Affy Probe Data Analysis – Spike-n probe sets

56 Fit to spike 12

57 Fit to spike 7

58 Fit to spike 1

59 Fit to spike 2

60 Fit to spike 3

61 Fit to spike 4

62 Fit to spike 5

63 What have we gained? Look at residuals from fit across large number of probe sets to show benefits of IRLS over median polish.

64 boxplot Residuals from 1000 probe sets

65 boxplot estimated chip effects for 1000 probe sets

66 IQR chip effects for 1000 probe sets

67 What have we gained? In order to have a high enough breakdown point to ignore 6 out of 16 probes, and improve on the median polish estimate, we pay a high price in variability. Q. Can we find a better weighting scheme? Or show MP is optimal?

68 References 1.Irizarry, R. et.al (2003) Summaries of Affymetrix GeneChip probe level data, Nucleic Acids Research, 2003, Vol. 31, No. 4 e15 2.Irizarry, R. et. al. (2003) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, in press. 3.C. Li and W.H. Wong, Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection, Proceedings of the National Academy of Science U S A, 2001, Vol 98, pp

69 References - Robustness 1.P. J. Rousseeuw and A. M.Leroy, Robust Regression and Outlier Detection, John Wiley & Sons, P. J. Huber, Robust Statistics, John Wiley & Sons, F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, W. A. Stahel, Robust Statistics: The approach based on influence functions, John Wiley & Sons, 1986.

70 References – Robustness in multiway tables 4.J. D. Emerson, D. C. Hoaglin, Analysis of two-way tables by medians, in Understanding robust and exploratory data analysis, publisher=John Wiley \& Sons, Inc., edited by D. C. Hoaglin and F. Mosteller and J. W. Tukey, N. Cook, Three-way analyses, in Exploring data tables, trends, and shapes, ed. D. C. Hoaglin and F. Mosteller and J. W. Tukey, J. W. Tukey, One degree of freedom for non-additivity. Biometrics, 1949, 5,

71 References – Robustness in multiway tables 7. C. Daniel, Patterns in residuals in the two-way layout, Technometrics, 1978,20(4), J. F. Gentleman and M. B. Wilk, Detecting outliers in a two-way table: I. Statistical behavior of residuals, Tehnometrics, 1975, 17(1), J. F. Gentleman and M. B. Wilk, Detecting outliers: II. Supplementing the dierect analysis of residuals, Biometrics, 1975, 31,