Preprocessing of cDNA microarray data Lecture 19, Statistics 246, April 1, 2004.

Slides:



Advertisements
Similar presentations
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Advertisements

Microarray Quality Assessment Issues in High-Throughput Data Analysis BIOS Spring 2010 Dr Mark Reimers.
Pre-processing in DNA microarray experiments Sandrine Dudoit PH 296, Section 33 13/09/2001.
Microarray Normalization
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, The Walter and Eliza Hall Institute of Medical.
Normalization of microarray data
Mathematical Statistics, Centre for Mathematical Sciences
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Image Quantitation in Microarray Analysis More tomorrow...
Normalization of Microarray Data - how to do it! Henrik Bengtsson Terry Speed
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Getting the numbers comparable
Statistics for Microarrays
The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Fred Hutchinson Cancer Research Center March 9, 2001.
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
Microarray Data Preprocessing and Clustering Analysis
Normalization Class web site: Statistics for Microarrays.
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
Gene Expression Data Analyses (2)
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Some thoughts of the design of cDNA microarray experiments Terry Speed & Yee HwaYang, Department of Statistics UC Berkeley MGED IV Boston, February 14,
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Making Sense of Complicated Microarray Data
A robust neural networks approach for spatial and intensity-dependent normalization of cDNA microarray data A.L. Tarca, J.E.K. Cooke and J. MacKay Presented.
Corrections and Normalization in microarrays data analysis
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Image Quantitation in Microarray Analysis More tomorrow...
Scanning and Image Processing -by Steve Clough. GSI Lumonics cDNA microarrays use two dyes with well separated emission spectra such as Cy3 and Cy5 to.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
CDNA Microarrays MB206.
Panu Somervuo, March 19, cDNA microarrays.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
WORKSHOP SPOTTED 2-channel ARRAYS DATA PROCESSING AND QUALITY CONTROL Eugenia Migliavacca and Mauro Delorenzi, ISREC, December 11, 2003.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
The Analysis of Microarray data using Mixed Models David Baird Peter Johnstone & Theresa Wilson AgResearch.
1 Pre-processing - Normalization Databases Statistics for Microarray Data Analysis – Lecture 2 The Fields Institute for Research in Mathematical Sciences.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Pre-processing in DNA microarray experiments Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor short course Summer 2002.
Statistics for Differential Expression Naomi Altman Oct. 06.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
(1) Normalization of cDNA microarray data Methods, Vol. 31, no. 4, December 2003 Gordon K. Smyth and Terry Speed.
Pre-processing DNA Microarray Data Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor Short Course Winter 2002 © Copyright.
The second-simplest cDNA microarray data analysis problem Terry Speed, UC Berkeley Bioinformatic Strategies For Application of Genomic Tools to Environmental.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University Plate Effects in cDNA Microarray Data.
The microarray data analysis Ana Deckmann Carla Judice Jorge Lepikson Jorge Mondego Leandra Scarpari Marcelo Falsarella Carazzolle Michelle Servais Tais.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Lecture 2 – Pre-processing and Normalization José Luis Mosquera Computational Lab on Microarrays Data Analysis Special Topics in Computer Science Institute.
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Normalization Methods for Two-Color Microarray Data
Getting the numbers comparable
Normalization for cDNA Microarray Data
Presentation transcript:

Preprocessing of cDNA microarray data Lecture 19, Statistics 246, April 1, 2004

Was the experiment a success? What analysis tools should be used? Are there any specific problems? Begin by looking at the data

Red/Green overlay images Good: low bg, lots of d.e.Bad: high bg, ghost spots, little d.e. Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artifiacts such as dust or scratches

Always log, always rotate log 2 R vs log 2 GM=log 2 R/G vs A=log 2 √RG

Signal/Noise = log 2 (spot intensity/background intensity) Histograms

Boxplots of log 2 R/G Liver samples from 16 mice: 8 WT, 8 ApoAI KO.

Spatial plots: background from the two slides

Highlighting extreme log ratios Top (black) and bottom (green) 5% of log ratios

Boxplots and highlighting Clear example of spatial bias (here high is red, low green) Print-tip groups Log-ratios pin group #

Pin group (sub-array) effects Boxplots of log ratios by pin groupLowess lines through points from pin groups

Plate effects

KO #8 Probes: ~6,000 cDNAs, including 200 related to lipid metabolism. Arranged in a 4x4 array of 19x21 sub-arrays.

Time of printing effects Green channel intensities (log 2 G). Printing over 4.5 days. The previous slide depicts a slide from this print run. spot number

Normalization Why? To correct for systematic differences between samples on the same slide, or between slides, which do not represent true biological variation between samples. How do we know it is necessary? By examining self-self hybridizations, where no true differential expression is occurring. We find dye biases which vary with overall spot intensity, location on the array, plate origin, pins, scanning parameters,….

Self-self hybridizations False color overlayBoxplots within pin-groupsScatter (MA-)plots

From the NCI60 data set (Stanford web site) A series of non self-self hybridizations

Early Ngai lab, UC Berkeley

Early Goodman lab, UC Berkeley

From the Ernest Gallo Clinic & Research Center

Early PMCRI, Melbourne Australia

Normalization: methods a) Normalization based on a global adjustment log 2 R/G -> log 2 R/G - c = log 2 R/(kG) Choices for k or c = log 2 k are c = median or mean of log ratios for a particular gene set (e.g. housekeeping genes). Or, total intensity normalization, where k = ∑R i / ∑G i. b) Intensity-dependent normalization. Here we run a line through the middle of the MA plot, shifting the M value of the pair (A,M) by c=c(A), i.e. log 2 R/G -> log 2 R/G - c (A) = log 2 R/(k(A)G). One estimate of c(A) is made using the LOWESS function of Cleveland (1979): LOcally WEighted Scatterplot Smoothing.

Normalization: methods c) Within print-tip group normalization. In addition to intensity-dependent variation in log ratios, spatial bias can also be a significant source of systematic error. Most normalization methods do not correct for spatial effects produced by hybridization artifacts or print-tip or plate effects during the construction of the microarrays. It is possible to correct for both print-tip and intensity-dependent bias by performing LOWESS fits to the data within print-tip groups, i.e. log 2 R/G -> log 2 R/G - c i (A) = log 2 R/(k i (A)G), where c i (A) is the LOWESS fit to the MA-plot for the ith grid only.

Which spots to use for normalization? The LOWESS lines can be run through many different sets of points, and each strategy has its own implicit set of assumptions justifying its applicability. For example, we can justify the use of a global LOWESS approach by supposing that, when stratified by mRNA abundance, a) only a minority of genes are expected to be differentially expressed, or b) any differential expression is as likely to be up-regulation as down- regulation. Pin-group LOWESS requires stronger assumptions: that one of the above applies within each pin-group. The use of other sets of genes, e.g. control or housekeeping genes, involve similar assumptions.

Use of control spots M = log R/G = logR - logGA = ( logR + logG) /2 Positive controls (spotted in varying concentrations) Negative controls blanks Lowess curve

Global scale, global lowess, pin-group lowess; spatial plot after, smooth histograms of M after

MSP titration series ( Microarray Sample Pool) Control set to aid intensity- dependent normalization Different concentrations Spotted evenly spread across the slide Pool the whole library

Yellow: GAPDH, tubulin Light blue: MSP pool / titration Orange: Schadt-Wong rank invariant set Red line: lowess smooth MSP normalization compared to other methods

Composite normalization Before and after composite normalization -MSP lowess curve -Global lowess curve -Composite lowess curve (Other colours control spots) c i (A)=  A g(A)+(1-  A )f i (A)

Comparison of Normalization Schemes (courtesy of Jason Goncalves) No consensus on best segmentation or normalization method Scheme was applied to assess the common normalization methods Based on reciprocal labeling experiment data for a series of 140 replicate experiments on two different arrays each with 19,200 spots

DESIGN OF RECIPROCAL LABELING EXPERIMENT Replicate experiment in which we assess the same mRNA pools but invert the fluors used. The replicates are independent experiments and are scanned, quantified and normalized as usual

The following relationship would be observed for reciprocal microarray experiments in which the slides are free of defects and the normalization scheme performed ideally We can measure using real data sets how well each microarray normalization scheme approaches this ideal

Deviation metric to assess normalization schemes We now use the mean array average deviation to compare the normalization methods. Note that this comparison addresses only variance (precision) and not bias (accuracy) aspects of normalization.

***

Scale normalization: between slides Boxplots of log ratios from 3 replicate self-self hybridizations. Left panel: before normalization Middle panel: after within print-tip group normalization Right panel: after a further between-slide scale normalization.

The “NCI 60” experiments (no bg) Some scale normalization seems desirable

Scale normalization: another data set Log-ratios Only small differences in spread apparent. No action required. `

Assumption: All slides have the same spread in M True log ratio is  ij where i represents different slides and j represents different spots. Observed is M ij, where M ij = a i  ij Robust estimate of a i is MAD i = median j { |y ij - median(y ij ) | } One way of taking scale into account

A slightly harder normalization problem Global lowess doesn’t do the trick here.

Print-tip-group normalization helps

But not completely There is still a lot of scatter in the middle in a WT vs KO comparison.

Effects of previous normalisation Before normalisationAfter print-tip-group normalization

Within print-tip-group box plots of M after print-tip-group normalization

Assumption: All print-tip-groups have the same spread in M True log ratio is  ij where i represents different print-tip-groups and j represents different spots. Observed is M ij, where M ij = a i  ij Robust estimate of a i is MAD i = median j { |y ij - median(y ij ) | } Taking scale into account, cont.

Effect of location & scale normalization Clearly care is needed in making decisions like this one.

A comparison of three MA-plots Unnormalized Print-tip normalizationPrint tip & scale n.

The same idea on another data set After print-tip location and scale normalization. Log-ratios Print-tip groups

Follow-up experiment On each slide, half the spots (  8) are differentially expressed, the other half are not.

Paired-slides: dye-swap Slide 1, M = log 2 (R/G) - c Slide 2, M’ = log 2 (R’/G’) - c’ Combine by subtracting the normalized log-ratios: [ (log 2 (R/G) - c) - (log 2 (R’/G’) - c’) ] / 2  [ log 2 (R/G) + log 2 (G’/R’) ] / 2  [ log 2 (RG’/GR’) ] / 2 provided c = c’. Assumption: the normalization functions are the same for the two slides.

Checking the assumption MA plot for slides 1 and 2: it isn’t always like this.

Result of self-normalization (M - M’)/2 vs. (A + A’)/2

Summary of normalization —Reduces systematic (not random) effects —Makes it possible to compare several arrays —Use logratios (MA-plots) —Lowess normalization (dye bias) —MSP titration series – composite normalization —Pin-group location normalization —Pin-group scale normalization —Between slide scale normalization —More? Use controls! —Normalization introduces more variability —Outliers (bad spots) are handled with replication

What is missing? Principally, a discussion of data quality issues. Most image analysis programs collect a wide range of measurements associated with each spot: morphological measures such as area and perimeter (in pixels), uniformity measures such as the SD of foreground and background intensities in each channel, and of ratios of intensities (with and without background) across the pixels in a spot; and spot brightness indicators such as the ratio of spot foreground to spot background, and the fraction of pixels in the foreground with intensity greater than background intensity (or a given multiple thereof). From these, further derived measures can be calculated, such as coefficients of variation, and so on. How should we make use of the various quality indicators? Most programs include procedures for flagging spots on the basis of one or more indicators, and users typically omit flagged spots from their primary analyses. “Data filtering” of this kind clearly improves the appearance of the data, but….can we do more? That is a longer story, for another time.

Acknowledgments Jean Yee Hwa Yang (UCB) Sandrine Dudoit (UCB) Natalie Thorne (WEHI) Ingrid Lönnstedt (Uppsala) Henrik Bengtsson (Lund) Jason Goncalves (Iobion) Matt Callow (LLNL) Percy Luu (UCB) John Ngai (UCB) Vivian Peng (UCB) Dave Lin (Cornell)

Reference: Yang et al (2002) Nucleic Acids Research 30, e15. Some web sites: Technical reports, talks, software etc. Statistical software R (“GNU’s S”) Packages within R environment: -- SMA (statistics for microarray analysis) Spot