Download presentation
Presentation is loading. Please wait.
1
Lecture 23: Cross validation
Statistical Genomics Lecture 23: Cross validation Zhiwu Zhang Washington State University
2
Administration Homework 5, due April 13, Wednesday, 3:10PM
Final exam: May 3, 120 minutes (3:10-5:10PM), 50
3
Course evaluation and response
Genomic selection methods with packages in R GS by GWAS rrBLUP gBLUP cBLUP sBLUP Bayesian LASSO
4
Outline GS by GWAS Over fitting Cross validation K-fold validation
Jack knife Re-sampling Two ways of calculating accuracy Bias and correction
5
Setup GAPIT #source("http://www.bioconductor.org/biocLite.R")
#biocLite("multtest") #install.packages("gplots") #install.packages("scatterplot3d")#The downloaded link at: library('MASS') # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source(" source("
6
Import data and simulate phenotype
myGD=read.table(file=" myGM=read.table(file=" myCV=read.table(file=" #Simultate 10 QTN on the first half chromosomes X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] taxa=myGD[,1] set.seed(99164) GD.candidate=cbind(taxa,X1to5) source("~/Dropbox/GAPIT/Functions/GAPIT.Phenotype.Simulation.R") mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[index1to5,],h2=.5,NQTN=10, effectunit =.95,QTNDist="normal",CV=myCV,cveff=c(.51,.51)) setwd("~/Desktop/temp")
7
Prediction with PC and ENV
myGAPIT <- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, PCA.total=3, CV=myCV, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, #SNP.test=FALSE, memo="GLM",) ry2=cor(myGAPIT$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3)
8
Choosing the top ten SNPs
ntop=10 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)])
9
Prediction with top ten SNPs
myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) ry2=cor(myGAPIT2$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT2$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT2$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT2$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3) Improved Improved
10
Prediction with top 200SNPs
ntop=200 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)]) myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) Improved No Improve
11
Validation All individuals training Testing Phenothpe Genotype
Phenotype Accuracy SNP effect Prediction
12
Cross validation All individuals Testing Training Phenothpe Genotype
Phenotype Accuracy Prediction SNP effect
13
Five fold Cross validation
Inference Reference By Yao Zhou
14
Until every individuals get predicted
Jack Knife Until every individuals get predicted Inference Inference
15
Jack Knife: extreme case of K=N
N: number of individuals K: number of folds Leave-one-out cross-validation Inference (training) contain only one individuals Not possible to calculate correlation between observed and predicted within inference Evaluation of accuracy must be hold until every individuals receive predictions. Resampling is not available
16
Re-sampling Sample partial population, e.g., 20%, as inference (testing), and leave the rest as reference (Training) Instantly evaluate accuracy of inference Repeated for multiple times Average accuracy across replicates Some individuals may never be in the testing
17
Negative prediction accuracy
Theor Appl Genet Jan;126(1):13-22 Genomewide predictions from maize single-cross data. Massman JM1, Gordillo A, Lorenzana RE, Bernardo R.
18
Two ways of calculating correlation
19
Artifactual negative hold accuracy
20
Hold bias relates to number of fold
21
Problem of instant accuracy
22
Small sample causes bias
23
Correction of instant accuracy
24
Highlight GS by GWAS Over fitting Cross validation K-fold validation
Jack knife Re-sampling Two ways of calculating accuracy Bias and correction
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.