1 Example Analysis of an Affymetrix Dataset Using AFFY and LIMMA 4/4/2011 Copyright © 2011 Dan Nettleton.

Slides:



Advertisements
Similar presentations
Linear Models for Microarray Data
Advertisements

Exercise 1: Importing Illumina data  Using the Import tool File / Import folder. Select the folder IlluminaTeratospermiaHuman6v1_BS1 In the Import files.
Overview of Bioconductor
An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 4, 2013.
P. J. Munson, National Institutes of Health, Nov. 2001Page 1 A "Consistency" Test for Determining the Significance of Gene Expression Changes on Replicate.
Differential Gene Expression with the limma package
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
1 Introduction to Experimental Design 1/26/2009 Copyright © 2009 Dan Nettleton.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Getting the numbers comparable
1 Analysis of Affymetrix GeneChip Data EPP 245/298 Statistical Analysis of Laboratory Data.
Public data - available for projects 6 data sets: –Human Tissues –Leukemia –Spike-in –FARO compendium – Yeast Cell Cycle –Yeast Rosetta Find one yourself.
Preprocessing Methods for Two-Color Microarray Data
Basics of ANOVA Why ANOVA Assumptions used in ANOVA
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees.
Microarray Preprocessing
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Affymetrix GeneChips Oligonucleotide.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 25 Categorical Explanatory Variables.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
Bioconductor Packages for Pre-processing DNA Microarray Data affy and marray Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor.
Randomization issues Two-sample t-test vs paired t-test I made a mistake in creating the dataset, so previous analyses will not be comparable.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Differential Expression II Adding power by modeling all the genes Oct 06.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
SPH 247 Statistical Analysis of Laboratory Data 1April 16, 2013SPH 247 Statistical Analysis of Laboratory Data.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
1 Introduction to Mixed Linear Models in Microarray Experiments 2/1/2011 Copyright © 2011 Dan Nettleton.
1 Example Analysis of a Two-Color Array Experiment Using LIMMA 3/30/2011 Copyright © 2011 Dan Nettleton.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Lecture 6 Design Matrices and ANOVA and how this is done in LIMMA.
Paper Review on Cross- species Microarray Comparison Hong Lu
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
The Broad Institute of MIT and Harvard Differential Analysis.
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Consistency Assessment among Redundant Probe Sets Interrogating the Same Gene on the Affymetrix MOE430 GeneChip Department of Biostatistics, Section on.
> FDRpvalue FDRTPvalue plot(LimmaFit$coefficients[complete.data],-log(FDRTPvalue[complete.data],base=10),type="p",main="Two-Sample.
MATH Section 4.4.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Lab 5 Unsupervised and supervised clustering Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz.
Estimation of Gene-Specific Variance
Using ArrayStar with a public dataset
Differential Gene Expression
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Significance Analysis of Microarrays (SAM)
Analysis of Affymetrix GeneChip Data
Significance Analysis of Microarrays (SAM)
Getting the numbers comparable
Introduction to Mixed Linear Models in Microarray Experiments
Volume 3, Issue 1, Pages (July 2016)
Introduction to Experimental Design
Affymetrix and BioConductor
Pre-processing AFFY data
Presentation transcript:

1 Example Analysis of an Affymetrix Dataset Using AFFY and LIMMA 4/4/2011 Copyright © 2011 Dan Nettleton

2 Example Dataset Available at Gene Expression Omnibus (GEO) Website as GSE 6297 Somel M, Creely H, Franz H, Mueller U et al. (2008). Human and chimpanzee gene expression differences replicated in mice fed different diets. PLoS One 30;3(1):e1504. PMID:

3 Experiment Description from GEO We fed four groups of six genetically identical 8-week-old female NMR1 mice one of four diets ad libidum: (1) a diet consisting of vegetables, fruit and yogurt identical to the diet fed to chimpanzees in our ape facility ('Chimpanzee'); (2) a diet consisting exclusively of McDonalds fast food ('FastFood'); (3) a diet consisting of cooked food eaten by our staff in the Institute cafeteria ('HumanCafe'); (4) the mouse pellet diet on which they were raised ('Pellet').

4 Experiment Description from GEO (continued) At the end of a 2-week period, mice were euthanized by cervical dislocation and both liver and brain (right cerebral hemisphere) tissue were dissected. RNA was extracted from the 24 liver and brain samples as per established lab protocols and processed in two batches (containing equal numbers of individuals from all groups).

5 Example Analysis of Liver Samples Batch 1 Batch 2

6 #Load affy package. library(affy) #Examine documentation. vignette("affy") #Set the path to a directory containing the CEL files. setwd("C:\\z\\GEOdata") #Read the CEL files. D=ReadAffy()

7 D AffyBatch object size of arrays=1002x1002 features (15 kb) cdf=Mouse430_2 (45101 affyids) number of samples=24 number of genes=45101 annotation=mouse4302 notes= Warning message: package 'mouse4302cdf' was built under R version attributes(D) $.__classVersion__ R Biobase eSet AffyBatch "2.12.0" "2.10.0" "1.3.0" "1.2.0" $cdfName [1] "Mouse430_2"

8 $nrow Rows 1002 $ncol Cols 1002 $assayData $phenoData An object of class "AnnotatedDataFrame" sampleNames: GSM CEL GSM CEL... GSM CEL (24 total) varLabels: sample varMetadata: labelDescription $featureData An object of class "AnnotatedDataFrame": none

9 $featureData An object of class "AnnotatedDataFrame": none $experimentData Experiment data Experimenter name: Laboratory: Contact information: Title: URL: PMIDs: No abstract available. Information is available on: preprocessing notes: : $annotation [1] "mouse4302"

10 $protocolData An object of class "AnnotatedDataFrame" sampleNames: GSM CEL GSM CEL... GSM CEL (24 total) varLabels: ScanDate varMetadata: labelDescription $class [1] "AffyBatch" attr(,"package") [1] "affy"

11 sampleNames(D) [1] "GSM CEL" "GSM CEL" "GSM CEL" "GSM CEL" [5] "GSM CEL" "GSM CEL" "GSM CEL" "GSM CEL" [9] "GSM CEL" "GSM CEL" "GSM CEL" "GSM CEL" [13] "GSM CEL" "GSM CEL" "GSM CEL" "GSM CEL" [17] "GSM CEL" "GSM CEL" "GSM CEL" "GSM CEL" [21] "GSM CEL" "GSM CEL" "GSM CEL" "GSM CEL“ #Sometimes phenoData(D) will contain information about #how to match the CEL files to treatment information. #In this case, that information must be taken from #the GEO database. diet=rep(1:4,each=6) batch=rep(rep(1:2,each=3),4) rbind(diet,batch) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] diet batch [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] diet batch

Batch 1 Batch 2 1st GeneChip 6th GeneChip 19th GeneChip

13 #We can examine PM densities for all GeneChips. hist(D) #We can examine PM densities for subsets of GeneChips. hist(D[,batch==1],main="PM Densities for Batch 1") hist(D[,batch==2], main="PM Densities for Batch 2") #Make side-by-side boxplots of log2 PM #color-coded by batch (red=1,blue=2) boxplot(D,axes=F,xlab="GeneChip",ylab="log2 PM", col=2*batch) axis(1,labels= paste(diet,batch,substr(sampleNames(D),8,9),sep=""), at=1:24,las=3) axis(2) box()

14

15

16

17

18 #See an image of GeneChip 18. image(D[,18])

19 #Store probeset IDs and examine first 10. gn=geneNames(D) gn[1:10] [1] " _at" " _at" " _at" " _at" " _a_at" [6] " _at" " _a_at" " _at" " _at" " _at" #Shorten sample names. sampleNames(D)= paste("GC",substr(sampleNames(D),8,9),sep="") sampleNames(D) [1] "GC07" "GC08" "GC09" "GC10" "GC11" "GC12" [7] "GC13" "GC14" "GC15" "GC16" "GC17" "GC18" [13] "GC19" "GC20" "GC21" "GC22" "GC23" "GC24" [19] "GC25" "GC26" "GC27" "GC28" "GC29" "GC30"

20 #Examine PM data for 10th probeset and a subset of chips. pm(D, gn[10])[,11:20] GC17 GC18 GC19 GC20 GC21 GC22 GC23 GC24 GC25 GC _at _at _at _at _at _at _at _at _at _at _at

21 #Examine PM and MM data for first GeneChip. cbind(pm(D, gn[10])[,1],mm(D, gn[10])[,1]) [,1] [,2] _at _at _at _at _at _at _at _at _at _at _at plot(rep(1:11,2),c(pm(D, gn[10])[,1],mm(D, gn[10])[,1]), pch=rep(c("p","m"),each=11),col=(rep(c(4,2),each=11)), ylim=c(0,1000),xlab="Probe",ylab="Intensity")

22

23 #Find the proportion of MM that exceed PM. mean(pm(D)<mm(D)) [1] #Plot PM difference (M) vs. PM average (A) #for some GeneChip pairs. library(geneplotter) MAplot(D[,c(1,2,10)], pairs = TRUE, plot.method = "smoothScatter")

24

25 #Perform RMA normalization. r=rma(D) Background correcting Normalizing Calculating Expression r ExpressionSet (storageMode: lockedEnvironment) assayData: features, 24 samples element names: exprs protocolData sampleNames: GC07 GC08... GC30 (24 total) varLabels: ScanDate varMetadata: labelDescription phenoData sampleNames: GC07 GC08... GC30 (24 total) varLabels: sample varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: mouse4302

26 #Create a data matrix with a row for each probeset #and a column for each GeneChip. d=exprs(r) dim(d) [1]

27 head(d) GC07 GC08 GC09 GC10 GC11 GC12 GC _at _at _at _at _a_at _at GC14 GC15 GC16 GC17 GC18 GC19 GC _at _at _at _at _a_at _at GC21 GC22 GC23 GC24 GC25 GC26 GC _at _at _at _at _a_at _at GC28 GC29 GC _at _at _at _at _a_at _at

28 boxplot(as.data.frame(d),las=3,col=2*batch)

29 #To keep the example simple, we will go through #an analysis for the Batch 1 data. d1=d[,batch==1] diet=factor(diet[batch==1]) diet [1] Levels:

30 library(limma) #R uses "set first to zero" constraint by default. design=model.matrix(~diet) design (Intercept) diet2 diet3 diet

31 #In this case it may be more convenient to #use a treatment mean parameterization. design=model.matrix(~0+diet) design diet1 diet2 diet3 diet

32 #Name the columns of the design matrix. #This is equivalent to naming the parameters #in the regression vector. colnames(design)=c("u1","u2","u3","u4") #Fit linear model for each gene. fit=lmFit(d1,design) #Set up contrasts of interest. #In this case, we consider all pairwise #comparisons between treatment means. contr.mat=makeContrasts(u1-u2,u1-u3,u1-u4, u2-u3,u2-u4,u3-u4, levels=design)

33 #Estimate contrasts. fit2=contrasts.fit(fit,contr.mat) #Conduct analysis based on Smyth's #empirical Bayes approach. fit3=eBayes(fit2) #Estimates of parameters in the prior on #the gene-specific error variances. fit3$df.prior [1] fit3$s2.prior [1]

34 #Compare variance estimates before and after shrinkage. plot(log(fit3$sigma),log(sqrt(fit3$s2.post)), xlab="Log Sqrt Original Variance Estimate", ylab="Log Sqrt Empirical Bayes Variance Estimate") lines(c(-99,99),c(-99,99),col=2) lsrs20=log(sqrt(fit3$s2.prior)) abline(h=lsrs20,col=4) abline(v=lsrs20,col=4)

35

36 #Examine p-values for F-test whose null says #that all contrasts in contr.mat are #simultaneously zero. In this case, that is #equivalent to H_0:u1=u2=u3=u4. hist(fit3$F.p.value,col=4) box()

37

38 #Examine p-values for pairwise comparisons. p=fit3$p.value dim(p) [1] hist(p[,1],col=4,cex.lab=1.5, xlab="p-value", ylab="Number of Probesets", main="Chimp Diet Mean vs. Fast Food Diet Mean") box(). hist(p[,6],col=4,cex.lab=1.5, xlab="p-value", ylab="Number of Probesets", main="Cafe Diet Mean vs. Pellet Diet Mean") box()

39

40

41

42

43

44

45 #Load some multiple testing functions. source(" dnett/microarray/multtest.txt") #Estimate proportion of true nulls #for each pairwise comparison. apply(p,2,estimate.m0)/nrow(d1) u1 - u2 u1 - u3 u1 - u4 u2 - u3 u2 - u4 u3 - u #Separately for each column of p, #convert p-values to q-values. qvalues=apply(p,2,jabes.q)

46 #For each pairwise comparison, find #number of genes declared significant #at an estimated FDR level of apply(qvalues<=0.05,2,sum) u1 - u2 u1 - u3 u1 - u4 u2 - u3 u2 - u4 u3 - u

47 #Find probeset IDs of some significant genes. row.names(d1)[qvalues[,2]<=0.05] [1] " _at" " _s_at" " _s_at" [4] " _at" " _at" " _at" [7] " _at" " _at" " _a_at" [10] " _a_at" " _a_at" " _at" [13] " _s_at" " _a_at" " _a_at" [16] " _at" " _at" " _s_at" [19] " _x_at" " _a_at" " _a_at" [22] " _at" " _at" " _a_at" [25] " _at" " _at" " _x_at" [28] " _x_at" " _s_at" " _x_at" [31] " _x_at" " _at" " _at" [34] " _at" " _at" " _at" [37] " _at" " _at" " _at" [40] " _at" " _at" " _s_at" [43] " _x_at" " _a_at"