Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Bias, Variance, and Fit for Three Measures of Expression: AvDiff, Li &Wong’s, and AvLog(PM-BG) Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
Microarray Quality Assessment Issues in High-Throughput Data Analysis BIOS Spring 2010 Dr Mark Reimers.
MicroArray Image Analysis Robin Liechti
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Introduction to Affymetrix Microarrays
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute.
Getting the numbers comparable
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Identification of spatial biases in Affymetrix oligonucleotide microarrays Jose Manuel Arteaga-Salas, Graham J. G. Upton, William B. Langdon and Andrew.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Felix Naef & Marcelo Magnasco, GL meeting, Nov Outline Background subtraction Probeset statistics Excursions into.
1 Models and methods for summarizing GeneChip probe set data.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Theoretical and experimental comparisons of gene expression indexes for oligonucleotide microarrays Division of Human Cancer Genetics Ohio State University.
Gene expression array and SNP array
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Lecture 22 Introduction to Microarray
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Agenda Introduction to microarrays
Assessing expression data quality in high-density oligonucliotide arrays.
Microarray - Leukemia vs. normal GeneChip System.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Scenario 6 Distinguishing different types of leukemia to target treatment.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Statistical Analyses of High Density Oligonucleotide Arrays Rafael A. Irizarry Department of Biostatistics, JHU (joint work with Bridget Hobbs and Terry.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Introduction to Oligonucleotide Microarray Technology
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
Arrays How do they work ? What are they ?. WT Dwarf Transgenic Other species Arrays are inverted Northerns: Extract target RNA YFG Label probe + hybridise.
The simple linear regression model and parameter estimation
Introduction to Affymetrix GeneChip data
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Normalization Methods for Two-Color Microarray Data
Statistical Methods For Engineers
Getting the numbers comparable
Pre-processing AFFY data
Presentation transcript:

Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland

Overview the Affymetrix technology Normalization Relationships among probes in Combining Probe Information Quality Control

Affymetrix GeneChip ® Probe Arrays Single stranded, fluorescently labeled DNA target 20µm Each probe cell or feature contains millions of copies of a specific oligonucleotide probe Image of Hybridized Probe Array Image of Hybridized Probe Array Over 400,000 different probes complementary to genetic information of interest Oligonucleotide probe * * * * *1.28cm GeneChip Probe Array Hybridized Probe Cell

Affymetrix Probe Design Published Gene Sequence Multiple (11-20) 25-base oligonucleotide probes Perfect Match Mismatch5´3´ PM is exactly complementary to published sequence MM is changed on 13 th base

Affymetrix Image Reading About 100 pixels per probe cell Selects brightest contiguous pixels Take average of selected pixels Variability in best pixels ~ 5-20% Image courtesy of Affymetrix

Normalization Approaches Simple: find average of each chip; divide all values by chip average MAS5: fit regression line relative to a reference chip Invariant set: find subset of probes in almost same rank order as in a reference chip Quantile normalization: fit to average quantiles across experiment Others: local loess, local regression.

Comparing Probes on Different Chips Plots of two Affymetrix chips against the experiment means

MAS 5.0 Normalization Plot probes from each chip against common base- line chip Fit regression line to middle 98% of probes This method fits the ends well, but seems to miss an important trend between 1500 and 4000

Invariant Set (Li-Wong) Method Select baseline chip X For each other chip Y: Select probes p 1, …, p K, (K ~ 10000), such that p 1 < p 2 < …< p K in both chips X and Y Fit running median through points { (x p1,y p1 ), …, (x pK, y pK ) } Subtract fitted value along running meidan from each y value

Quantile Method (part of RMA) Distributions of probe intensities vary substantially among replicate chips This cannot be even approximately resolved by any linear transformation Apply a non-linear transform, based on the idea that comparable quantiles of the probe distribution should have comparable values This doesn’t wipe out individual gene differences, although it compresses variation at the high end

Probe Intensities in 23 Replicates

Density function Cumulative Distribution Function Distribution of Chip Intensities Reference Distribution Formula: x norm = F 2 -1 (F 1 (x)) Quantile Normalization Assumes: gene distribution changes little xy  F 1 (x)F 2 (x)

After Normalization vs Before: intensity scale

Ratio-Intensity: Before

Ratio-Intensity: After

Quantile normalization.vs. normalization by scaling Quantile normalization works

Methods for computing expression Affymetrix MicroArray Suite: v.4, 5 –robust average of probes on one chip Linear Model (multi-chip) methods –dChip: Li and Wong –Bioconductor affy package (RMA) Bolstad, Irizarry, Speed, et al Many others published –Some based on thermodynamic considerations

Probe Variation Probes vary by two orders of magnitude on each chip Signal from 16 probes for the GAPDH gene on one chip Individual probes don’t agree on fold changes across chips -Bright probes more often, but not always, more reliable

Probe Variation - II Typical probes are two orders of magnitude different! CG content is most important factor RNA target folding also affects hybridization 3x10 4 0

Principles of MAS 5 method First estimate background bg = MM (if physically possible) bg = MM (if physically possible) log(bg) = log(PM)-log(non-specific proportion) (if impossible) log(bg) = log(PM)-log(non-specific proportion) (if impossible) Non-specific proportion = max(SB,  ) Non-specific proportion = max(SB,  ) SB = Tukeybiweight(log(PM)-log(MM)) SB = Tukeybiweight(log(PM)-log(MM)) Signal = Tukeybiweight(log(Adjusted PM)) Signal = Tukeybiweight(log(Adjusted PM))

Critique of MAS 5principle ‘Average’ of different probes isn’t really meaningful, since probes have intrinsically different hybridization characteristics The MAS5 method doesn’t ‘learn’ based on cross-chip performance of individual probes

Motivation for multi-chip models: Raw data from a single probe set in a spike-in study; each color represents a different probe in the probe set; note the parallel trend across chips of all probes, although some probe signals depart from the pattern Courtesy of Terry Speed log(PM) log(concentration)

Linear Models Extension of linear regression Essential features: –Measurement errors independent of each other ‘random noise’ Needs normalization to eliminate systematic variation –Noise levels comparable at different levels of signal –Small number of factors combine in linear function or simple algebraic form to give predicted levels

Model for Probe Signal Each probe signal is proportional to –i) the amount of target sample –   –ii) the affinity of the specific probe sequence to the target –  j NB: High affinity is not the same as specificity –Probe can give high signal to intended target and also to other transcripts 11 22 Probes chip 1 chip 2      

Multiplicative Model Each gene has a set of probes p 1,…,p k Each probe p j binds the gene with efficiency (‘avidity’)  j In each sample there is an amount  i of the target transcript In principle, intensity of probe j on chip i – PM ij – should be proportional to  j x  i Always some noise; and some outliers!

Robust Statistics Outlier: a measure that is far beyond the typical random variation –common in biological measures –10-15% in Affy probe sets Robust methods try to fit the majority of data points –Issue is to identify which points to down-weight or ignore –iteratively re-weighted least squares –Median polish

Li & Wong (dChip) Model: PM ij =  i  j +  ij - Original model (dChip 1.0) used PM ij - MM ij =  i  j +  ij by analogy with Affy MAS 4 Outlier removal: –Identify extreme residuals –Remove –Re-fit –Iterate until converge Dark blue: PM values Red: fitted values Light blue: probe SD Fitting probes in one set on one chip

Critique of Li-Wong model Model assumes that noise for all probes has same magnitude All biological measurements exhibit intensity-dependent noise

For each probe set, take the log transform of PM ij =  i  j : i.e. fit the model: Fit this additive model by iteratively re-weighted least-squares or median polish Bolstad, Irizarry & Speed – (RMA) Where nlog() stands for logarithm after normalization

Critque of RMA Assumes probe noise is homoschedastic (comparable variances) on log scale In fact noise for low signal probes appears to be much greater Depends on normalization & bg compensation Variance-stabilizing transform seems better in principle; so far not a great deal of improvement in practice

Comparing Expression Measures Compare gene abundance estimates based on identical samples (These were non spike-in genes in the spike-in experiment) Better performance means variation of estimates should be smaller The figure shows standard deviations of expression estimates across arrays arranged in four groups of genes by increasing mean expression level Green: MAS5.0; Black: Li-Wong; Blue, Red: RMA Courtesy of Terry Speed

Comparison Summary Affymetrix Suite gets better every year –Affymetrix is developing their own multi-chip model MAS P & A calls reasonable proxies for confidence (not gene abundance) –based on probe-by probe comparison of PM & MM MAS 5.0 estimation does a reasonable job on abundant genes dChip and RMA do better on genes that are less abundant –Signalling proteins, transcription factors, etc

Model-based QC for Affy Chips Outliers from fitted model may show spatial pattern Portion of an Affy chip Image made with dChip Pink pixels represent probes that do not fit consensus pattern of relative probe intensities These probes will be down-weighted or ignored by a robust multi-chip model. If non-conforming probes are numerous and wide-spread then suspect such a chip

Current Work: Improving the Model How to use the MM information profitably –Combine estimates from PM and MM probes? Assessments of probe quality Accurate estimates of probe background Normalization method based on 2-d loess to correct spatial inhomogeneity

Relation Between PM and MM Across One Experiment Set Colored symbols are one probe MM PM

Probe Specific Background Horizontal lines represent probes; colored symbols correspond to arrays After subtracting individual backgrounds for each probe, the ratios among corresponding arrays are more consistent between probes Fitted Data Probe BG subtracted

Software for Affymetrix MAS provided by Affymetrix –Current version 6 in beta testing dChip from RMA from –affy package –Regularly updated –Version with probe background in September from my website: reimers.cgb.ki.se