Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.

Similar presentations


Presentation on theme: "Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry."— Presentation transcript:

1 Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry Speed, Walter & Eliza Hall Institute of Medical Research and Francois Collin,Gene Logic) http://biosun01.biostat.jhsph.edu/~ririzarr

2 Summary Review of technology Data exploration Probe level summaries (expression measures) Normalization Evaluate and compare through bias, variance and model fit to 4 expression measures Use Gene Logic spike-in and dilution study Conclusion/future work

3 Probe Arrays 24µm Millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array >200,000 different complementary probes Single stranded, labeled RNA target Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell Compliments of D. Gerhold

4 PM MM

5 Data and Notation PM ijn, MM ijn = Intensity for perfect/mis- match probe cell j, in chip i, in gene n i = 1,…, I (ranging from 1 to hundreds) j=1,…, J (usually 16 or 20) n = 1,…, N (between 8,000 and 12,000)

6 The Big Picture Summarize 20 PM,MM pairs (probe level data) into one number for each gene We call this number an expression measure Affymetrix GeneChip’s Software uses AvDiff as expression measure Does it work? Can it be improved?

7 What is the evidence? Lockhart et. al. Nature Biotechnology 14 (1996)

8 Competing Measures of Expression GeneChip ® software uses Avg.diff with A a set of “suitable” pairs chosen by software. Log ratio version is also used. For differential expression Avg.diffs are compared between chips.

9 Competing Measures of Expression GeneChip ® new version uses something else with MM* a version of MM that is never bigger than PM.

10 Competing Measures of Expression Li and Wong fit a model Consider expression in chip i Efron et. al. consider log PM – 0.5 log MM Another is second largest PM

11 Competing Measures of Expression Why not stick to what has worked for cDNA? with A a set of “suitable” pairs.

12 Features of Probe Level Data

13 SD vs. Avg of Defective Probes

14 ANOVA: Strong probe effect 5 times bigger than gene effect

15 Histograms of log 2 (PM/MM) stratifies by log 2 (PMxMM)/2 for mouse chip for defective and normal probe

16 Normalization at Probe Level

17 Spike-In Experiments Set A: 11 control cRNAs were spiked in, all at the same concentration, which varied across chips. Set B: 11 control cRNAs were spiked in, all at different concentrations, which varied across chips. The concentrations were arranged in 12x12 cyclic Latin square (with 3 replicates)

18 Set A: Probe Level Data (12 chips)

19 What Did We Learn? Don’t subtract or divide by MM Probe effect is additive on log scale Take logs

20 Why Remove Background?

21 Background Distribution

22 Average Log 2 (PM-BG) Normalize probe level data Compute BG = background mean by estimating the mode of the MM distribution Subtract BG from each PM If PM-BG < 0 use minimum of positives divided by 2 Take average

23 Expression after Normalization

24 Expression Level Comparison

25 Spike-In B Probe SetConc 1Conc 2Rank BioB-51000.51 BioB-30.525.02 BioC-52.075.04 BioB-M1.037.54 BioDn-31.550.05 DapX-335.73.06 CreX-350.05.07 CreX-512.52.08 BioC-325.01009 DapX-55.01.510 DapX-M3.01.011 Later we consider 23 different combinations of concentrations

26 Differential Expression

27

28

29

30 Observed Ranks GeneAvDiffMAS 5.0Li&WongAvLog(PM-BG) BioB-56211 BioB-316132 BioC-574625 BioB-M30373 BioDn-344564 DapX-323924 7 CreX-333373369 CreX-532763331288 BioC-3270985726816431 DapX-527091021220310 DapX-M16519136 Top 1515610

31 Observed vs True Ratio

32 Dilution Experiment cRNA hybridized to human chip (HGU95) in range of proportions and dilutions Dilution series begins at 1.25  g cRNA per GeneChip array, and rises through 2.5, 5.0, 7.5, 10.0, to 20.0  g per array. 5 replicate chips were used at each dilution Normalize just within each set of 5 replicates For each probe set compute expression, average and SD over replicates, and fit a line to log expression vs. log concentration Regression line should have slope 1 and high R 2

33 Dilution Experiment Data

34 Expression and SD

35 Slope Estimates and R 2

36 Model check Compute observed SD of 5 replicate expression estimates Compute RMS of 5 nominal SDs Compare by taking the log ratio Closeness of observed and nominal SD taken as a measure of goodness of fit of the model

37 Observed vs. Model SE

38

39 Conclusion Take logs PMs need to be normalized Using global background improves on use of probe-specific MM Gene Logic spike-in and dilution study show all four expression measures performed very well AvLog(PM-BG) is arguably the best in terms of bias, variance and model fit Future: better BG; robust/resistant summaries

40 Acknowledgements Gene Brown’s group at Wyeth/Genetics Institute, and Uwe Scherf’s Genomics Research & Development Group at Gene Logic, for generating the spike-in and dilution data Gene Logic for permission to use these data Ben Bolstad (UC Berkeley) Magnus Åstrand (Astra Zeneca Mölndal) Skip Garcia, Tom Cappola, and Joshua Hare (JHU)


Download ppt "Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry."

Similar presentations


Ads by Google