Normalization of microarray data

Slides:

Advertisements

Similar presentations

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.

Advertisements

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.

M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.

Statistical tests for differential expression in cDNA microarray experiments (2): ANOVA Xiangqin Cui and Gary A. Churchill Genome Biology 2003, 4:210 Presented.

ECS 289A Presentation Jimin Ding Problem & Motivation Two-component Model Estimation for Parameters in above model Define low and high level gene expression.

Ch11 Curve Fitting Dr. Deshi Ye

Microarray Normalization

Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.

First analysis steps o quality control and optimization o calibration and error modeling o data transformations Wolfgang Huber Dep. of Molecular Genome.

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Getting the numbers comparable

DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.

Preprocessing Methods for Two-Color Microarray Data

Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.

1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.

Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review

Gene Expression Data Analyses (2)

Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.

Making Sense of Complicated Microarray Data

Linear and generalised linear models

Correlation and Regression Analysis

Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.

1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.

(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.

Objectives of Multiple Regression

Regression and Correlation Methods Judy Zhong Ph.D.

CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.

Multiple testing in high- throughput biology Petter Mostad.

(2) Ratio statistics of gene expression levels and applications to microarray data analysis Bioinformatics, Vol. 18, no. 9, 2002 Yidong Chen, Vishnu Kamat,

Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.

DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4.

CDNA Microarrays MB206.

© 1998, Geoff Kuenning Linear Regression Models What is a (good) model? Estimating model parameters Allocating variation Confidence intervals for regressions.

Panu Somervuo, March 19, cDNA microarrays.

Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.

Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg.

Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.

Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.

1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.

Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.

ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.

Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.

Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.

Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.

Statistics for Differential Expression Naomi Altman Oct. 06.

Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.

Microarray hybridization Usually comparative – Ratio between two samples Examples – Tumor vs. normal tissue – Drug treatment vs. no treatment – Embryo.

(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.

For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression.

Analyzing Expression Data: Clustering and Stats Chapter 16.

1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.

KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

CGH Data BIOS Chromosome Re-arrangements.

Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.

The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais.

Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.

Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.

Estimating standard error using bootstrap

Estimation of Gene-Specific Variance

Normalization Methods for Two-Color Microarray Data

Statistical Methods For Engineers

Getting the numbers comparable

Normalization for cDNA Microarray Data

Diagnostics and Remedial Measures

Linear Regression and Correlation

Presentation transcript:

Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Systematic differences between arrays The boxplots show distributions of log-ratios from 4 red-green 8448-clone cDNA arrays hybridised with zebrafish samples. Some are not centered at 0, and they are different from each other.

Experimental variation amount of RNA in the sample efficiencies of -RNA extraction -reverse transcription -labeling -photodetection Normalization: Correction of systematic effects arising from variations in the experimental process Systematic o similar effect on many measurements o corrections can be estimated from data Normalization

Ad-hoc normalization procedures 2-color cDNA-arrays: multiply all intensities of one channel with a constant such that the median of log-ratios is 0 (equivalent: shift log-ratios). Underlying assumption: equally many up- and downregulated genes. One-color arrays (Affy, radioactive): multiply intensities from each array k with a constant ck, such that some measure of location of the intensity distributions is the same for all arrays (e.g. the trimmed mean (Affy global scaling)).

log-log plot of intensities from the two channels of a microarray comparison of kidney cancer with normal kidney tissue, cDNA microarray with 8704 spots red line: median normalization blue lines: two-fold change

Assumptions for normalization When we normalize based on the observed data, we assume that the majority of genes are unchanged, or that there is symmetry between up- and downregulation. In some cases, this may not be true. Alternative: use (spiked) controls and base normalization on them.

1. Loess normalization M-A plot (minus vs. add): log(R) – log(G) =log(R/G) vs. log (R) + log(G)=log(RG) With 2-color-cDNA arrays, often “banana-shaped” scatterplots on the log-scale are observed.

Loess normalization trends are modeled by a regression curve, zebrafish data Intensity-dependent trends are modeled by a regression curve, M = f(A) + e. The normalized log-ratios are computed as the residuals e of the loess regression.

Loess regression Locally weighted regression. For each value xi of X, a linear or polynomial regression function fi for Y is fitted based on the data points close to xi. They are weighted according to their distance to xi. Local model: Y = fi(X) + e. Fit: Minimize the weighted sum of squares S wj (xj)(yj - fi(xj))2 Then, compute the overall regression as: Y = f(X) + e, where f(xi) = fi(xi).

Loess regression regression lines for each data point The user-defined width c of the weight function determines the degree of smoothing. x0 tricubic weight function

Print-tip normalization With spotted arrays, distributions of intensities or log-ratios may be different for spots spotted with different pins, or from different PCR plates. Normalize the data from each (e.g. print-tip) group separately.

Print-tips correspond to localization of spots Slide: 25x75 mm 4x4 or 8x4 sectors 17...38 rows and columns per sector ca. 4600…46000 probes/array Spot-to-spot: ca. 150-350 mm sector: corresponds to one print-tip

Print-tip loess normalization

2. Error models, variance stabilization and robust normalization

Sources of variation Systematic Stochastic Normalization Error model amount of RNA in the sample efficiencies of -RNA extraction -reverse transcription -labeling -photodetection PCR yield DNA quality spotting efficiency, spot size cross-/unspecific hybridization stray signal Systematic o similar effect on many measurements o corrections can be estimated from data Stochastic o too random to be ex-plicitely accounted for o “noise” Normalization Error model

A model for measurement error Rocke and Durbin (J. Comput. Biol. 2001): Yk: measured intensity of gene k bk: true expression level of gene k a: offset ,n:multiplicative/additive error terms, independent normal For large expression level bk, the multiplicative error is dominant. For bk near zero, the additive error is dominant.

A parametric form for the variance-mean dependence The model of Durbin and Rocke yields: Thus we obtain a quadratic dependence

Quadratic variance-vs-mean dependence data (cDNA slide) For each spot k, the variance (Rk – Gk)² is plotted against the mean (Rk + Gk)/2.

The two-component model raw scale log scale

The two-component model “additive” noise “multiplicative” noise raw scale log scale

Variance stabilizing transformations Let Xu be a family of random variables with EXu=u, VarXu=v(u). Define a transformation  Var h(Xu )  independent of u … = terms involving products of higher derivatives of f, and higher moments of X

Derivation of the variance-stabilizing transformation Let Xu be a family of random variables with EXu=u, VarXu= v(u), and h a transformation applied to Xu. Then, by linear approximation of h, Thus, if h’(u)2 = v(u)-1 ,Var(h(Xu)) is approx. independent of u.

Variance stabilizing transformations f(x) x

Variance stabilizing transformations 1.) constant CV (‘multiplicative’) 2.) offset 3.) additive and multiplicative

The “generalized log” transformation - - - f(x) = log(x) ——— hs(x) = arsinh(x/s) W. Huber et al., ISMB 2002 D. Rocke & B. Durbin, ISMB 2002

A model for measurement error Now we consider data from different arrays or color channels i. We assume they are related through an affine-linear transformation on the raw scale: Yki:measured intensity of gene k in array/color channel i bki: true expression level of gene k ai, gi : additive/multiplicative effects of array/color channel i ,n: multiplicative/additive error terms, independent normal with mean 0

A statistical model Assume an affine-linear transformation for normalization between arrays, and, after that, common parameters for the variance stabilizing transformation. The composite transformation for array/color channel i is given by ai and bi. The model is assumed to hold for genes that are unchanged; differentially expressed genes act as outliers. After calibration, data are on a common scale and have a common distribution. Hence, can apply variance stabilization -> variance independent of k

Robust parameter estimation Assume that the majority of genes is not differentially expressed. Use robust variant of maximum likelihood estimation: Alternate between maximum likelihood estimation (= least squares fit) for a fixed set K of genes and selection of K as the subset of (e.g. 50%) genes with smallest residuals. After calibration, data are on a common scale and have a common distribution. Hence, can apply variance stabilization -> variance independent of k

Robust normalization assumption: majority of genes unchanged location estimators: mean median least trimmed sum of squares (generalized) log-ratio

Normalized & transformed data generalized log scale log scale

Validation: standard deviation versus rank-mean plots After calibration, data are on a common scale and have a common distribution. Hence, can apply variance stabilization -> variance independent of k difference red-green rank(average)

Which normalization method should one use? How can one assess the performance of different methods? Diagnostic plots (e.g. scatterplots) Performance measures: The variance between replicate measurements should be low. Low bias: Changes in expression should be accurately measured. How to assess this (in most cases, the truth is unknown)?

Evaluation: sensitivity / specificity in quantifying differential expression o Data: paired tumor/normal tissue from 19 kidney cancers, hybridized in duplicate on 38 cDNA slides à 4000 genes. o Apply 6 different strategies for normalization and quantification of differential expression o Apply permutation test to each gene o Compare numbers of genes detected as differentially expressed, at a certain significance level, between the different normalization methods After calibration, data are on a common scale and have a common distribution. Hence, can apply variance stabilization -> variance independent of k

Comparison of methods Die Verwendung des Varianzmodells fuer die Microarraydaten fuehrt zu hoeherer Power bei nachfolgenden Inferenzverfahren, z.B. Test auf differentielle Expression. Mit den gleichen Daten kann man bei gleichem Fehler 1. Art mehr Hypothesenablehnungen "Resultate" bekommen bzw. die gleichen Resultate mit weniger Experimenten  Number of significant genes vs. significance level of permutation test

Parametric vs. non-parametric normalization Loess is non-parametric: it makes no assumptions which sort of transformation is appropriate. Disadvantage: Degree of smoothing is chosen in an arbitrary way. vsn uses a parametric model: affine-linear normalization. Disadvantage: the model assumptions may not always hold. Advantage: If the model assumptions do hold (at least approximately), the method should perform better.

vsn may also correct “banana shape” M-A plot of vsn- normalized zebrafish data, loess fit Different additive offsets may lead to non-linear scatter plots on the log scale.

References Software: R package modreg (loess), Bioconductor packages marrayNorm (loess normalization), vsn (variance stabilization) W.Huber, A.v.Heydebreck, H.Sültmann, A.Poustka, M.Vingron (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(S1), 96-104. Y.H.Yang, S.Dudoit at al. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 30(4):e15.