Presentation on theme: "Aaron Lorenz Department of Agronomy and Horticulture"— Presentation transcript:
1Aaron Lorenz Department of Agronomy and Horticulture Genomic selectionAaron LorenzDepartment of Agronomy and Horticulture
2Role of markers in crop improvement Varies by objective, germplasm, trait genetic architecture.Bernardo, 2008
3Genomic selection Training Population Calibration Set DNA marker data Phenotypic dataModel trainingTraining PopulationCalibration SetPredict and selectNo QTL mappingNo testing for significant markersI’m not going to give much background on genomic prediction because I think it has become well known enough as a method. I’ll only introduce some terminology. The basic idea is you genotype a large population of indivduals, phenotype them well (this is called a training population or sometimes calibration set), combine all marker and phenotypic data into a single statistical model (termed model training), and use the model to predict the genetic value of individuals that had been genotyped but not phenotyped. You get your predictions and you select on them just as you would on phenotypes.Note that there is no QTL mapping, no delcaration of signficant markers. We’re using all markers….Selection candidates
4A genome-wide approach typically provides better predictions Genomic rAOne very nice thing about taking a genome-wide approach is that it typically works better than a QTL mapping/MAS approach. This study from Rex Bernardo’s lab looked at 36 instances of population-trait combinations, and in nearly every case GS does substantially better than MAS. This figures on the right is from a simulation study of my own showing the advantage in prediction accuracy of GS compared to MAS.MASGSMASGSMAS rALorenz (2013)Lorenzana and Bernardo (2009)
5Whittaker et al. (2000)When doing MAS, cannot include all the markers, so must select subset of markers to fit.No entirely satisfactory way of doing this exists.Objective is to evaluate ridge regression.Superior to subset selection when objective is to make predictions.
6Whittaker et al. (2000) Find subset of markers Q. Interested in Cannot include all markers in QIncreases variance of βIf number of markers really large, not enough d.f.
7Whittaker et al. (2000)Ridge regression – include all variables, but replace normal least-squares estimators withNormal estimates shrunk toward 0Degree of shrinkage determined by lambdaChoose lambda to minimize model errorAddition of λI term reduces collinearity and prevents the matrix XTX from becoming singular.
9MHG 2001Objective: “Compare statistical methods for their accuracy in predicting total breeding value of individuals in a situation where a limited number of recorded individuals are genotyped for many markers.”- Computer simulationindividuals- Need to estimate 50,000 haplotype effectsThe whole story starts with a simulation study in The authors set out to see how accurately the BVs could be predicted assuming very dense marker data was available, much densier marker data than was available at the time, but they were looking forward.Authors noted arbitrariness of setting marker effect to full value or zero simply because it surpassed some predetermined, and arbitrary, threshold.
11Genomic selection models LARGE p !!Shrinkage modelsRR-BLUP, G-BLUPDimension reduction methodsPartial least squaresPrincipal component regressionVariable selection modelsBayesB, BayesCπ, BayesDπKernel and machine learning methodsSupport vector machine regressionTraining populationLine Yield Mrk 1 Mrk 2 … Mrk pLine 1761Line 256Line 345Line 467Line n22……and in the day of high-density markers, this means we probably have many more markers than observations, resulting in the well-known large p, small n problem. This means ordintary least squares cannot be used for estimation, but a variety of other more sophisticated models can be used. The most population is RR-BLUP, where markers are treated as random effects to be sampled from a common distribution. That’s all I’ll say about that.smaller n !!
12Baseline model --More predictors than variables. --Solution: fit predictors as random effects.-- Constrain possible effects.-- What distribution is β being sampled from?
16Marker effect estimates Large-effect QTL simulatedMany small-effect QTL simulatedBayesCπI didn’t think that example was illustrative enough, so I simulated some data. Here, we have a large effect QTL present. You can see RR-BLUP shrinks this thing way down, whereas BayesCpi, the variable selection method, allows it to have an effect probably closer to reality.RR-BLUP
18G-BLUP Similar to tradition BLUP with pedigrees Calculate genomic relationship matrixUse genomic relationships in mixed-linear model to predict breeding value of relatives
19Training Pop. Training Pop. Selection candidates Selection candidates Relationships between TP and selection candidates leveraged for prediction
20Equivalency between RR-BLUP and G-BLUP From MVN distribution properties:Only valid with the normal prior!
21Predicting prediction accuracy Daetwyler et al. (2008)Lian et al. (2014)N = training pop sizeh2 = trait heritabilityMe = effective number of locir2 = LD between marker and QTL (see Lian ref)
22Factors affecting prediction accuracy Training population sizeTrait heritabilityInfluence of G x E, precision of measurementsMarker densityEffective population size of breeding populationi.e., genetic diversity of breeding populationGenetic relationship between training population and selection candidatesStatistical model
23Effect of relationships: Predicting across populations 1180 polymorphic markersValidation setsSubpop 2PC 2Subpop 1Training setsHere is a typical example. Here we have a PCA plot from marker data of barley lines from three different breeding programs.PC 1BuschAgUniversity of MNNDSU 6-row
24Effect of relationships: Presence of relatives in TP Pred accuracyMean relationship of top ten relativesClark et al. (2012)
25Models typically similar in accuracy Models also equivalent in:Bernardo and Yu (2007) [Maize]Lorenzana and Bernardo (2009) [Several plant species]Van Raden et al. (2009) [Holstein]Hayes (2009) [Holstein]RR-BLUPBayesCpiBayesianLASSOAccuracyDespite the different assumptions in genetic architecture made by the different models, and the fact the QTL effects are not of equal size and do have different genetic architectures, including epistasis, the simplest model, RR-BLUP, assuming all QTL effects of the same variance often do just as well, especially in empirical studies, as the more “realistic models”. The reason for this is probably that LD within domesticated species is extensive, and therefore several markers absorb the effect of large-effect QTL, making it seem that many markers control a trait, as RR-BLUP assumes.
26Why? Extensive LD in plant and animal breeding programs Perfect situation for G-BLUPLong stretches of genome that are identical by descent means relationships calculated with markers are good indicators of relationships at causal polymorphisms.Extensive LD also means it’s hard for variable selection models to zero in on markers in tight LD with casual polymorphisms.Expect variable selection models will be superior whenIndividuals are unrelatedVery large TP (millions?)Very high marker density so that markers in LD with causal polymorphisms
27Resources and packages rrBLUP packagecran.r-project.org/web/packages/rrBLUP/rrBLUP.pdfEndelman, J.B Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:Endelman, J.B., and J-L. Jannink Shrinkage estimation of the realized relationship matrix. G3:2:1045BLR (Bayesian Linear Regression) packagePerez et al Genomic-enabled prediction based on molecular markers and pedigree using the Bayesian linear regression package in R. Plant Genome 3:
28ReferencesBernardo, R Molecular markers and selection for complex traits in plants: Learning from the last 20 years. Crop Sci 48:Clark, S.A., J.M. Hickey, H.D. Daetwyler and van der Werf, Julius HJ The importance of information on relatives for the prediction of genomic breeding values and the implications for the makeup of reference data sets in livestock breeding schemes. Genet. Sel. Evol. 44:.Daetwyler, H.D., B. Villanueva and J.A. Woolliams Accuracy of predicting the genetic risk of disease using a genome-wide approach. Plos One 3:.de los Campos, G., J.M. Hickey, R. Pong-Wong, H.D. Daetwyler and M.P.L. Calus Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327-+.Lian, L., A. Jacobson, S. Zhong and R. Bernardo Genomewide prediction accuracy within 969 maize biparental populations. Crop Sci.Lorenz, A.J Resource allocation for maximizing prediction accuracy and genetic gain of genomic selection in plant breeding: A simulation experiment. G3-Genes Genomes Genetics 3:Lorenzana, R.E. and R. Bernardo Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theor. Appl. Genet. 120:Meuwissen, T.H., B.J. Hayes and M.E. Goddard Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:Whittaker, J.C., R. Thompson and M.C. Denham Marker-assisted selection using ridge regression. Genet. Res. 75: