Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 23: Cross validation

Similar presentations


Presentation on theme: "Lecture 23: Cross validation"— Presentation transcript:

1 Lecture 23: Cross validation
Statistical Genomics Lecture 23: Cross validation Zhiwu Zhang Washington State University

2 Administration Homework 5, due April 13, Wednesday, 3:10PM
Final exam: May 3, 120 minutes (3:10-5:10PM), 50

3 Course evaluation and response
Genomic selection methods with packages in R GS by GWAS rrBLUP gBLUP cBLUP sBLUP Bayesian LASSO

4 Outline GS by GWAS Over fitting Cross validation K-fold validation
Jack knife Re-sampling Two ways of calculating accuracy Bias and correction

5 Setup GAPIT #source("http://www.bioconductor.org/biocLite.R")
#biocLite("multtest") #install.packages("gplots") #install.packages("scatterplot3d")#The downloaded link at: library('MASS') # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source(" source("

6 Import data and simulate phenotype
myGD=read.table(file=" myGM=read.table(file=" myCV=read.table(file=" #Simultate 10 QTN on the first half chromosomes X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] taxa=myGD[,1] set.seed(99164) GD.candidate=cbind(taxa,X1to5) source("~/Dropbox/GAPIT/Functions/GAPIT.Phenotype.Simulation.R") mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[index1to5,],h2=.5,NQTN=10, effectunit =.95,QTNDist="normal",CV=myCV,cveff=c(.51,.51)) setwd("~/Desktop/temp")

7 Prediction with PC and ENV
myGAPIT <- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, PCA.total=3, CV=myCV, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, #SNP.test=FALSE, memo="GLM",) ry2=cor(myGAPIT$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3)

8 Choosing the top ten SNPs
ntop=10 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)])

9 Prediction with top ten SNPs
myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) ry2=cor(myGAPIT2$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT2$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT2$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT2$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3) Improved Improved

10 Prediction with top 200SNPs
ntop=200 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)]) myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) Improved No Improve

11 Validation All individuals training Testing Phenothpe Genotype
Phenotype Accuracy SNP effect Prediction

12 Cross validation All individuals Testing Training Phenothpe Genotype
Phenotype Accuracy Prediction SNP effect

13 Five fold Cross validation
Inference Reference By Yao Zhou

14 Until every individuals get predicted
Jack Knife Until every individuals get predicted Inference Inference

15 Jack Knife: extreme case of K=N
N: number of individuals K: number of folds Leave-one-out cross-validation Inference (training) contain only one individuals Not possible to calculate correlation between observed and predicted within inference Evaluation of accuracy must be hold until every individuals receive predictions. Resampling is not available

16 Re-sampling Sample partial population, e.g., 20%, as inference (testing), and leave the rest as reference (Training) Instantly evaluate accuracy of inference Repeated for multiple times Average accuracy across replicates Some individuals may never be in the testing

17 Negative prediction accuracy
Theor Appl Genet Jan;126(1):13-22 Genomewide predictions from maize single-cross data. Massman JM1, Gordillo A, Lorenzana RE, Bernardo R.

18 Two ways of calculating correlation

19 Artifactual negative hold accuracy

20 Hold bias relates to number of fold

21 Problem of instant accuracy

22 Small sample causes bias

23 Correction of instant accuracy

24 Highlight GS by GWAS Over fitting Cross validation K-fold validation
Jack knife Re-sampling Two ways of calculating accuracy Bias and correction


Download ppt "Lecture 23: Cross validation"

Similar presentations


Ads by Google