Presentation is loading. Please wait.

Presentation is loading. Please wait.

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

Similar presentations


Presentation on theme: "I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)"— Presentation transcript:

1 I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS) ESTIMATES CURSE OF DIMENSIONALITY (Biased regression – ridge regression)

2 Genomic Prediction  Many important traits and diseases are moderately to highly heritable, suggesting that these traits could be predicted based on knowledge of the individual’s genotype.  Modern genotyping and sequencing methods provide very detailed description of genomes, even at the sequence level.  In principle, we should be able to use genotypic information for accurate prediction of complex traits and diseases.

3 The Task PhenotypesGenotypes Prediction model Predictions Decisions Phenotype Genetic Value Model residual

4 Confronting Complexity  How many markers?  Which markers?  What type of interactions?  Dominance  Epistasis (type, order)

5 The standard linear genetic model considers that the phenotypic response of the i th individual ( ) is explained by a factor common to all individuals (µ), a genetic factor specific to that individual ( ), and a residual comprising all other non-genetic factors ( ), among others, the environmental effects (temporal or spatial) and the effects described by the experimental design. Then, the linear genetic model for n genotypes (i=1,2,..,n) is represented as In this standard linear genetic model, the genetic factor can be described by using a summation of molecular marker effects or by using pedigree. Meuwissen et al. (2001) were the first to propose doing an explicit regression of phenotypes on the marker genotypes using the simple parametric regression model, such that (j=1,2,…,p), where is the regression of on the j th marker covariate (j=1,2,…,p), and is the number of copies of bi-allelic markers (0,1,2 or -1,0,1). In matrix notation, this can be represented as THE BASIC GENETIC MODEL

6 Ordinary Least Squares (OLS) Estimates Consider the following model: Where: is the phenotype of the i th individual, is an effect common to all individuals (an “intercept”), are covariates (e.g., marker genotypes), is the effect of the j th covariate and is a model residual. In matrix notation the model is expressed as:

7 OLS estimates The ordinary least squares estimate of is the solution to the following optimization problem: The argument to be minimized is the residual sum of squares. The solution to the above optimization problem is given by

8 MULTIPLE REGRESSION the residuals have common variance and are statistically independent (zero covariance). The response vector Y is a function of the random error plus constants; thus Var(Y)=Var(error ) The predictors are independent

9 The collinearity problem Singularity arise when some linear function of the independent variables is very close to zero. Some linear functions of the columns of X are zero or close to zero. Then a unique does not exist. In cases X is only close to be singular (nearly- singular) but the solution is very unstable and the variances of the regression coefficients become very large Interdependent independent variables that are closely linked in the system being studied cause near singularities in X (i.e., molecular markers)

10 Penalized regression The least squares estimators of the regression coefficients are BLUE with the MINIMUM VARIANCE. Under collinearity this MINIMUM VARIANCE may be unacceptable large. Relax the condition of UNBIASED estimator A measure of average closeness of an estimator to the parameter being estimated is the MEAN SQUARED ERROR of the estimator

11 The challenges of highly dimensional marker data Two different approaches can be used to confront the challenges posed by p>>n 1.subset selection we design an algorithm to select k out of p (k<p) predictors; our final model will include only these k predictors. 2.shrinkage estimation uses all available predictors and confronts the challenges posed by regressions with p>n by using shrinkage estimation methods

12 II. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE SHRINKAGE (PENALIZED) LINEAR REGRESSION  RIDGE REGRESSION (RR)  GBLUP BAYESIAN VERSION OF RR and GBLUP

13 Shrinkage (Penalized) estimates An approach used to solve the problems emerging in large-p with small-n regressions is to use penalized estimates; these estimates are obtained as the solution to an optimization problem that balances two components: how well the model fits the data and how-complex the model is.

14 General form of the optimization problem loss function that measures fitness (lack of fit) of the model to the data measures model complexity (degree of freedom) regularization parameter controlling the trade-offs between fitness and model complexity.

15 Ridge Regression (RR) Measure lack of fitness of the model Model complexity is where S define the set of coefficients to be penalized Then In matrix

16 The first order conditions of the above optimization problem are satisfied by the following system of linear equations In penalized estimation, the regularization parameter ( ) controls the trade-offs between model complexity and model goodness of fit and. This affects parameter estimates (their value, and the statistical properties), the model goodness of fit to the training dataset, and the ability of the model to predict unobserved phenotypes. Ridge Regression (RR)

17 Ridge Regression Singular square matrices can be made non- singular by adding a constant to the diagonal of the matrix. If is singular then is non singular, where k is a small positive constant. This small quantity makes the off-diagonal appear relatively less important and thus it suppresses the near-singularity. Ridge works with the CENTERED and SCALED INDEPENDENT VARIABLES ( Z ) Ridge Regression (RR)

18 The first order conditions of the above optimization problem are satisfied by the following system of linear equations In penalized estimation, the regularization parameter ( ) controls the trade-offs between model complexity and model goodness of fit and. This affects parameter estimates (their value, and the statistical properties), the model goodness of fit to the training dataset, and the ability of the model to predict unobserved phenotypes.

19 RIDGE REGRESSION (RR) -- SUMMARY The OLS estimates of regression coefficients are the solution to the following systems of equations The RR estimates has a very similar form, simply add a constant to the diagonal of the matrix of coefficients, that is: where is a constant and is a diagonal matrix with zero in its first diagonal entry (d 1 =0, this, to avoid shrinking the estimate of the intercept) and ones in the remaining diagonal entries of D. When D or equal zero, the solution to the above problem is OLS. Adding a constant to the diagonal entries of the coefficient matrix of the system of equations in,, makes it non-singular and shrinks the estimates of regression coefficients other than the intercept towards zero. This induces bias but reduces the variance of the estimates; in large-p with small-n problems this may reduce MSE of estimates and may yield more accurate predictions.

20 RIDGE REGRESSION (RR BLUP) One of the first penalized regression methods used in genomic prediction is Ridge Regression (RR), which is equivalent to the mixed models that yield Best Linear Unbiased Predictor (BLUP). This marker-based model (also called RR-BLUP) is expressed as where Z is the design matrix that relates individuals to phenotypic observations, X is the genotype matrix for the bi-allelic markers, where is the number of observation collected for the i th individual

21 RIDGE REGRESSION (RR-BLUP) The solution to the optimization problem for RR can be written as Here the ridge parameter is the ratio between the residual and the marker variance, and induces shrinkage of marker effects toward zero. Since is the vector of marker effects, the genomic estimated breeding value (GEBV) is

22 Let then is with and is the unknown residual variance parameter estimated from the data. When for then, the model is named GBLUP, the G matrix is the genomic derived relationship matrix, and These two models are equivalent. But the last model with does not provide the marker effects but is computationally simpler than the previous one. RIDGE REGRESSION (RR-BLUP) -- AN EQUIVALENT MODEL (GBLUP)

23 BAYESIAN VERSION OF RR From a Bayesian perspective, can be viewed as the conditional posterior mode in a model with Gaussian likelihood and IID (independent and identically distributed) Gaussian marker effects, that is, or

24 BAYESIAN VERSION OF RR Where is the prior variance of the marker effect. The posterior mean and mode of the above is equal to the RR estimate with and thus is the BLUP of the marker that is used to obtain the predicted genetic value of the individuals (their GEVB)

25 The posterior distribution of in the above model is multivariate normal with a mean (co-variance matrix) equal to the solution (inverse of the coefficient matrix) of the following system this is just the RR equations and is also the Best Linear Unbiased Predictor (BLUP) of given y. Recall that the ratio is equivalent to in RR. In a fully-Bayesian models we assign priors to each of these variance parameters, this allow inferring these unknowns from the same training data that is used to estimate marker effects BAYESIAN VERSION OF RR

26 Therefore from the predicted genetic values (GEBV) using the BLUP of the marker effects are How the GBLUP is obtained? Change the variable and get BAYESIAN VERSION OF RR

27 Alternatively, and using properties of the multivariate normal distribution, Therefore, with p>>n expression last equation is computationally more convenient. However, the last expression does not yield estimates of marker effects BAYESIAN VERSION OF GBLUP

28 SUMMARY BAYESIAN VERSION OF RR AND GBLUP These three models are equivalent. If genotypes are centered then G matrix can be calculated using

29 HOW TO MAP FROM TO The genetic values are then is an estimate of The is the allele frequency of the j th marker Then the contribution of each marker to the genetic variance is and the breeding value at each marker is


Download ppt "I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)"

Similar presentations


Ads by Google