For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression 1 2 Statistical Model of observed data Estimate the model parameters based on the data Calculating t-statistic t*t* -t * Calculating p-value based on the “null distribution” of the t-statistic assuming 1 = 2
How do we perform t-test for 30,000 at once How do we handle results, present data and results What is significant How to compare different approaches to normalization of the data and the statistical analysis of results Ideally, we would like to maximize our ability to identify truly differentially expressed genes and minimize the falsely implicated genes. Doing it by hand (by R) first Using Bioconductor Genome-wide analysis
Calculating t-test for 30,000 genes at a time Data import : source(" >SimpleData<-read.table(file=" + header=TRUE,quote="",sep="\t",comment.char="") > SimpleData[1,] Name ID W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W6 1 no name Rn > W<-c(3,5,7,9,11,13) > C<-c(4,6,8,10,12,14)
Calculating t-test for 30,000 genes at a time Transforming data : source(" > NoZerosData<-SimpleData[,3:14] > NoZerosData[33525,] W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W > NoZerosData[NoZerosData==0]<-NA > NoZerosData[33525,] W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W NA NA log(0) = -Inf log(-1)=-Inf function(-Inf) = -Inf or Inf or NaN rm.na=TRUE > LSimpleData<-SimpleData > LSimpleData[,3:14]<-log(NoZerosData,base=2)
Calculating t-test for 30,000 genes at a time Calculating t-tests : source(" MW<-apply(t(LSimpleData[,W]),2,mean,na.rm=TRUE) VW<-apply(t(LSimpleData[,W]),2,var,na.rm=TRUE) MC<-apply(t(LSimpleData[,C]),2,mean,na.rm=TRUE) VC<-apply(t(LSimpleData[,C]),2,var,na.rm=TRUE) NW<-apply(t(!is.na(LSimpleData[,W])),2,sum,na.rm=TRUE) NC<-apply(t(!is.na(LSimpleData[,C])),2,sum,na.rm=TRUE) VWC<-(((NW-1)*VW)+((NC-1)*VC))/(NC+NW-2) DF<-NW+NC-2 TStat<-abs(MW-MC)/((VWC*((1/NW)+(1/NC)))^0.5) TPvalue<-2*pt(TStat,DF,lower.tail=FALSE)
source(" Displaying results – Scatter Plots
source(" Displaying results - Histograms
Expression Data on Individual Microarrays source("
Normalization is the process of removing systematic biases prior to statistical analysis Systematic intensity-dependent trends are considered a systematic bias since it is extremely unlikely that they are a consequence of some underlying biological mechanism of interest This particular bias is effectively removed by estimating the intensity-dependent "trend" using the local regression and subtracting it from the observed ratios We will generally consider that normalization procedures do not affect independence of experimental replicates – they are performed separately for each microarray Some biases cannot be factored out without introducing certain level of correlation between replicate. Such biases will be factored out within the statistical model that will then account for introducing such correlation (through multi-way Analysis of Variance Model) Microarray-Specific Normalization of Expression Data
Local Regression Normalization source("
Normalized Data source("
source( Normalized Data Displaying results – Scatter Plots
source(" Comparing Normalized and Raw Data Results Median 75 th Percentile 25 th Percentile 1.5xIQR
source(" Comparing Normalized and Raw Data Results