Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg.

Similar presentations


Presentation on theme: "Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg."— Presentation transcript:

1 Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg

2 Log-ratio with/without background correction Shrunken log-ratio (BHM) Variance stabilized log-ratio (=generalized log-ratio, “glog”)  Microarray intensities x 1,…,x n  How do you like to think about (interprete) it?  How do you estimate the parameters?  What comes out (the “bottom line”?)

3  ratios and fold changes Fold changes are useful to describe continuous changes in expression 1000 1500 3000 x3 x1.5 ABC 0 200 3000 ? ? ABC But what if the gene is “off” (below detection limit) in one condition?

4  ratios and fold changes Many interesting genes will be off in some of the conditions of interest 1.If you want expression measure (“net normalized spot intensity”) to be an unbiased estimator of abundance  many values  0  need something more than (log)ratio 2. If you let expression measure be biased (always>0)  can keep ratios.  how do you choose the bias?

5  ratios and fold changes Ratios are scale-free: But there is (at least) one absolute scale in the data: Can we use this to construct useful functions

6  How to compare microarray intensities with each other?  How to incorporate measurement uncertainty (“variance”)?  How to simultaneously and consistently deal with calibration (“normalization”)?  In the following:

7  Sources of variation amount of RNA in the biopsy efficiencies of -RNA extraction -reverse transcription -labeling -fluorescent detection probe purity and length distribution spotting efficiency, spot size cross-/unspecific hybridization stray signal CalibrationError model Systematic o similar effect on many measurements o corrections can be estimated from data Stochastic o too random to be ex- plicitely accounted for o remain as “noise”

8 a i per-sample offset  ik ~ N(0, b i 2 s 1 2 ) “additive noise” b i per-sample normalization factor b k sequence-wise probe efficiency  ik ~ N(0,s 2 2 ) “multiplicative noise”  modeling ansatz measured intensity = offset + gain  true abundance

9  The two-component model raw scalelog scale “additive” noise “multiplicative” noise B. Durbin, D. Rocke, JCB 2001

10  variance stabilizing transformations X u a family of random variables with EX u =u, VarX u =v(u). Define  var f(X u )  independent of u derivation: linear approximation

11  variance stabilizing transformations f(x) x

12  variance stabilizing transformations 1.) constant variance (‘additive’) 2.) constant CV (‘multiplicative’) 4.) additive and multiplicative 3.) offset

13  the “glog” transformation - - - f(x) = log(x) ——— h s (x) = asinh(x/s) P. Munson, 2001 D. Rocke & B. Durbin, ISMB 2002

14 parameter estimation o maximum likelihood estimator: straightforward – but sensitive to deviations from normality o model holds for genes that are unchanged; differentially transcribed genes act as outliers. o robust variant of ML estimator, à la Least Trimmed Sum of Squares regression. o works as long as <50% of genes are differentially transcribed

15 Least trimmed sum of squares regression minimize - least sum of squares - least trimmed sum of squares P. Rousseeuw, 1980s

16 evaluation: effects of different data transformations difference red-green rank(average)

17  Normality: QQ-plot

18 evaluation: sensitivity / specificity in detecting differential abundance o Data: paired tumor/normal tissue from 19 kidney cancers, in color flip duplicates on 38 cDNA slides à 4000 genes. o 6 different strategies for normalization and quantification of differential abundance o Calculate for each gene & each method: t-statistics, permutation-p o For threshold , compare the number of genes the different methods find, #{p i | p i  }

19 evaluation: comparison of methods more accurate quantification of differential expression  higher sensitivity / specificity one-sided test for upone-sided test for down

20 evaluation: a benchmark for Affymetrix genechip expression measures o Data: Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex background Dilution series: from GeneLogic 60 x HGU95Av2, liver & CNS cRNA in different proportions and amounts  o Benchmark: 15 quality measures regarding -reproducibility -sensitivity -specificity Put together by Rafael Irizarry (Johns Hopkins) http://affycomp.biostat.jhsph.edu

21  affycomp results (28 Sep 2003) good bad

22  ROC curves

23  Stratification i collaboration with R. Irizarry wiwi position- and sequence-specific effects w i (s): Naef et al., Phys Rev E 68 (2003)

24  glog versus "sliding z-score" sliding z-score

25  Availability o implementation in R o open source package vsn on www.bioconductor.org o Bioconductor is an international collaboration on open source software for bioinformatics and statistical omics

26  What to do with the gene lists: the functional genomics pipeline @ DKFZ High- throughput transcriptome sequencing: clones with unannotated full length ORFs functional characterization Neoplastic diseases association of mRNA profiles with - genetic aberrations - histopathology - clinical behavior

27  HT functional assays (S. Wiemann, D. Arlt) Library of "unknown" transcripts expression clone BrdU incorporation GFP-ORF- protein DAPI: identification CFP: expression BrdU: proliferation Image segmentation and quantification proliferation +activator -inhibitor automated microscope Rainer Pepperkok, EMBL

28 DAPIORF-YFPAnti-BrdU/Cy5 Detection of modulators of cell proliferation overlay YFP – Cy5 Dorit Arlt YFP channelCy5 channel 72.0 761.0 71.0 684.1 119.7 779.0 87.3 820.2 149.5 645.6 70.2 536.1 84.7 799.5 103.1 912.8 81.0 916.7 2621.8 267.6 74.1 766.2 156.8 866.6 169.0 819.8 105.5 757.7 156.0 367.8 76.5 746.2 135.2 731.2 86.2 567.3 77.7 896.3 92.6 1095.4 104.6 633.3 481.2 567.7 539.0 663.9 95.0 726.2 156.7 842.1 Measurement of fluorescence intensities 68.5, 231.6, 80.9, -4.8 by automated image analysis SMP Cell

29  activation inhibition   Statistical analysis of cellular assay data transfected cells control cells detect transfection effect: inh (p=10 -8) Plate summary plot

30  Cellular assays: challenges for statisticians oImage analysis: pattern recognition, classification oLow-level analysis what are good models for calibration, “normalization”, data transformation? oHigh-level analysis models for the dependence of cellular processes on over-/underexpression of genes connect results from different assays, microarray data

31  Summary olog-ratio  log: what about genes that are not expressed in some of the conditions of interest? o generalized log-ratio  h: a useful extrapolation - interpretability - sensitivity - specificity - computational convenience o what to do with the gene lists? systematic (high throughput) functional assays

32  Acknowledgements DKFZ Heidelberg Molecular Genome Analysis Annemarie Poustka Holger Sültmann Andreas Buneß Markus Ruschhaupt Katharina Finis Jörg Schneider Klaus Steiner Stefan Wiemann Dorit Arlt MPI Molekulare Genetik Anja von Heydebreck Martin Vingron Uni Heidelberg Günther Sawitzki DFCI Harvard Robert Gentleman UMC Leiden Judith Boer RZPD Anke Schroth Bernd Korn EMBL Urban Liebel...and many more!

33  Models are never correct, but some are useful True relationship: Model: linear dependence Model: quadratic dependence

34 raw scalelogglog difference log-ratio generalized log-ratio constant part variance: proportional part  variance stabilization

35  ratio compression Yue et al., (Incyte Genomics) NAR (2001) 29 e41


Download ppt "Transformations… what for? which one? Wolfgang Huber Div.Molecular Genome Analysis DKFZ Heidelberg."

Similar presentations


Ads by Google