1 Analysis of Affymetrix GeneChip Data EPP 245/298 Statistical Analysis of Laboratory Data.

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

Overview of Bioconductor
Differential Gene Expression with the limma package
Bioconductor in R with a expectation free dataset Transcriptomics - practical 2012.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Introduction to Affymetrix Microarrays
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Getting the numbers comparable
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Preprocessing Methods for Two-Color Microarray Data
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Introduce to Microarray
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
SPH 247 Statistical Analysis of Laboratory Data. Two-Color Arrays Two-color arrays are designed to account for variability in slides and spots by using.
Gene expression array and SNP array
Analysis of microarray data
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Lecture 22 Introduction to Microarray
Data Type 1: Microarrays
Bioconductor Packages for Pre-processing DNA Microarray Data affy and marray Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor.
1 Two Color Microarrays EPP 245/298 Statistical Analysis of Laboratory Data.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis.
Agenda Introduction to microarrays
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Introduction to R / sma / Bioconductor Statistics for Microarray Data Analysis The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Artificial Intelligence Project #3 : Analysis of Decision Tree Learning Using WEKA May 23, 2006.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
SPH 247 Statistical Analysis of Laboratory Data April 23, 2013.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
GeneChip® Probe Arrays
1 Example Analysis of an Affymetrix Dataset Using AFFY and LIMMA 4/4/2011 Copyright © 2011 Dan Nettleton.
SPH 247 Statistical Analysis of Laboratory Data 1April 16, 2013SPH 247 Statistical Analysis of Laboratory Data.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Cluster validation Integration ICES Bioinformatics.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Introduction to Oligonucleotide Microarray Technology
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Using ArrayStar with a public dataset
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Numerical Descriptives in R
Gene Expression Arrays
Microarrays 1/31/2018.
Analysis of Affymetrix GeneChip Data
Getting the numbers comparable
Affymetrix and BioConductor
Lecture 3 From Images to Data
Pre-processing AFFY data
Presentation transcript:

1 Analysis of Affymetrix GeneChip Data EPP 245/298 Statistical Analysis of Laboratory Data

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 2 Basic Design of Expression Arrays For each gene that is a target for the array, we have a known DNA sequence. mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 3 TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC Exon Intron Probe Sequence cDNA arrays use variable length probes Long oligoarrays use 60-70mers Affymetrix GeneChips use multiple 25-mers For each exon, 8-20 distinct probes May overlap May cover more than one exon Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibit binding. This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 4 Expression Indices A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene). There have been a large number of suggested methods. Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 5 Usable Methods Li and Wong’s dCHIP and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA The RMA method of Irizarry et al. is available in Bioconductor. The GLA method (Durbin, Rocke, Zhou) is also available in Bioconductor

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 6 How to Install R and Bioconductor Download R install file from website and run installation Download Bioconductor as given below >source(" >biocLite()

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 7 Bioconductor Documentation > library(affy) Welcome to Bioconductor Vignettes contain introductory material. To view, simply type: openVignette() For details on reading vignettes, see the openVignette help page. > openVignette() Please select (by number) a vignette 1: affy primer 2: affy: Built-in Processing Methods 3: affy: Custom Processing Methods (HowTo) 4: affy: Automatic downloading of cdfenvs (HowTo): 5: affy: Import Methods (HowTo) 6: Biobase Primer 7: Howto Bioconductor 8: HowTo HowTo 9: esApply Introduction Selection: 6 [1] TRUE

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 8 The exprSet class An object of class exprSet has several slots –exprs A matrix of expression levels with chips as columns and “genes” as rows. –se.exprs A matrix of standard errors for exprs if available. –phenoData An object of class phenoData containing phenotype or experimental data.

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 9 –description A description of the experiment, object of class MIAME –annotation A character string indicating the base name for the associated annotation –notes Notes describing features or aspects of the data, experiment, etc. The phenoData class has slots –pData A dataframe where rows are cases and columns are variables –varLabels A list of labels and descriptions for the variables –The number of rows in pData is the same as the number of columns in exprs.

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 10 > library(affy) > data(geneData) > class(geneData) [1] "matrix" > dim(geneData) [1] > data(geneCov) > dim(geneCov) [1] 26 3 > covN <- list(cov1 = "Covariate 1; 2 levels",cov2 = "Covariate 2; 2 levels", + cov3 <- "Covariate 3; 3 levels") > pD <- new("phenoData",pData=geneCov, varLabels = covN) > eSet <- new("exprSet",exprs=geneData,phenoData=pD) > [1]

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 11 Reading Affy Data into R The CEL files contain the data from an array. We will look at data from an older type of array, the U95A which contains 12,625 probe sets and 409,600 probes. The CDF file contains information relating probe pair sets to locations on the array. These are built into the affy package for standard types.

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 12 The ReadAffy function ReadAffy() function reads all of the CEL files in the current working directory into an object of class AffyBatch ReadAffy(widget=T) does so in a GUI that allows entry of other characteristics of the dataset You can also specify filenames, phenotype or experimental data, and MIAME information

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 13 > getClass("exprSet") Slots: Name: exprs se.exprs phenoData description Class: exprMatrix exprMatrix phenoData characterORMIAME annotation notes character character > getClass("AffyBatch") Slots: Name: cdfName nrow ncol exprs Class: character numeric numeric exprMatrix se.exprs phenoData description annotation notes exprMatrix phenoData characterORMIAME character character Extends: "exprSet"

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 14 A Sample Experiment This consists of 10 samples one each from 10 cell lines. The cell lines are of 5 types We have 10 Affymetrix U95A arrays

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 15 group <- data.frame(as.factor(rep(1:5,each=2))) covN <- list(group = "Type of Cell Line") pD <- new("phenoData",pData=group,varLabels=covN) myMIAME <- new("MIAME",name="John Post, PhD", lab="Jane Q. Investigator, MD", contact=" ", title="Sample Experiment for EPP 298") Data <- ReadAffy(filenames=c("LN0A.CEL","LN0B.CEL","LN1A.CEL","LN1B.CEL", "LN2A.CEL","LN2B.CEL","LN3A.CEL","LN3B.CEL","LN4A.CEL","LN4B.CEL"), phenoData=pD,description=myMIAME) > [1] > LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL LN4A.CEL LN4B.CEL [1,] [2,] [3,] [4,] [5,]

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 16 > phenoData object with 1 variables and 10 cases varLabels group: Type of Cell Line > as.factor.rep.1.5..each > Experimenter name: John Post, PhD Laboratory: Jane Q. Investigator, MD Contact information: Title: Sample Experiment for EPP 298 URL: No abstract available. Information is available on: preprocessing

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 17 Expression Indices The 409,600 rows of the expression matrix in the AffyBatch object Data each correspond to a probe (25-mer) Ordinarily to use this we need to combine the probe level data for each probe set into a single expression number This has conceptually several steps

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 18 Steps in Expression Index Construction Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays. We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set.

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 19 Data transformation is the process of changing the scale of the data so that it is more comparable from high to low. Common transformations are the logarithm and generalized logarithm Normalization is the process of adjusting for systematic differences from one array to another. Normalization may be done before or after transformation, and before or after probe set summarization.

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 20 One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 21 The RMA Method Background correction that does not make 0 signal correspond to 0 amount Quantile normalization Log 2 transform Median polish summary of PM probes

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 22 > eidata <- rma(Data) Background correcting Normalizing Calculating Expression > class(eidata) [1] "exprSet" attr(,"package") [1] "Biobase" > [1]

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 23 > LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : LN4A.CEL LN4B.CEL Min. : Min. : st Qu.: st Qu.: Median : Median : Mean : Mean : rd Qu.: rd Qu.: Max. : Max. :12.153

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 24 The GLA Method The Glog Average (GLA) method is simpler than the RMA method, though it can require estimation of a parameter Background correction is intended to make a measured value of zero correspond to a zero quantity in the sample Transformation uses the glog ~ ln for large values Normalization via lowess Summary is a simple average of PM probes

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 25 > source("gla.r") > emat.gla class(emat.gla) [1] "matrix" > dim(emat.gla) [1]

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 26 > summary(emat.gla) X1 X2 X3 X4 Min. :3.859 Min. :3.801 Min. :3.811 Min. : st Qu.: st Qu.: st Qu.: st Qu.:4.942 Median :5.456 Median :5.449 Median :5.446 Median :5.453 Mean :5.569 Mean :5.573 Mean :5.569 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:6.034 Max. :8.794 Max. :8.724 Max. :8.733 Max. :8.731 X5 X6 X7 X8 Min. : Min. : Min. :3.792 Min. : st Qu.: st Qu.: st Qu.: st Qu.:4.936 Median : Median : Median :5.458 Median :5.441 Mean : Mean : Mean :5.597 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:6.036 Max. : Max. : Max. :9.948 Max. :9.976 X9 X10 Min. :3.907 Min. : st Qu.: st Qu.:4.937 Median :5.452 Median :5.453 Mean :5.569 Mean : rd Qu.: rd Qu.:6.053 Max. :8.772 Max. :8.732

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 27 glog <- function(y,lambda) { yt <- log(y+sqrt(y^2+lambda)) return(yt) }

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 28 lnorm<- function(mat1,span=.1) { print ("Start Lowess Normalization") mat2 <- as.matrix(mat1) p <- dim(mat2)[1] n <- dim(mat2)[2] rmeans <- apply(mat2,1,mean) r<-rnorm(p,0,1e-7) rmeans<-r+rmeans rranks <- rank(rmeans) matsort <- mat2[order(rranks),] r0 <- 1:p lcol <- function(x) { lx <- lowess(r0,x,f=span)$y # returns vector of length p } lmeans <- apply(matsort,2,lcol) # column lowess lgrand <- apply(lmeans,1,mean) # grand lowess lgrand <- matrix(rep(lgrand,n),byrow=F,ncol=n) # grand lowess matnorm0 <- matsort-lmeans+lgrand # normalized matrix matnorm1 <- matnorm0[rranks,] # returns row order to original print ("End Lowess Normalization") return(matnorm1) }

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 29 gla <- function(AB,lambda=1000,alpha=50) { gn <- geneNames(AB) I <- length(gn) # # ind will be a vector with i repeated as many # times as gn[i] has PM probes # This shows which gene number is associated with each probe # ind<-vector() for (i in 1:I) { ind <- c(ind, rep(i, dim(pm(AB,gn[i]))[1])) } mat <- pm(AB) EI <- matrix(0, I, dim(mat)[2]) mat.glog<-lnorm(glog(mat-alpha,lambda)) for (i in 1:I) { tmp <- apply(mat.glog[ind==i,],2,mean) EI[i,] <- t(tmp) } return(EI) }

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 30 Probe Sets not Genes It is unavoidable to refer to a probe set as measuring a “gene”, but nevertheless it can be deceptive The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied

November 10, 2005EPP 245 Statistical Analysis of Laboratory Data 31 Exercise (Optional) Download the ten arrays from the web site, and the gla.r file Load the arrays into R using Read.Affy and construct the RMA expression indices Source the gla.r functions, and construct the GLA expression indices