3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子.

3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子

Contents 3.3 Bayesian Linear Regression 3.3.1 Parameter distribution 3.3.2 Predictive distribution 3.3.3 Equivalent kernel 3.4 Bayesian Model Comparison 3.5 The Evidence Approximation 3.5.1 Evaluation of the evidence function 3.5.2 Maximizing the evidence function 3.5.3 Effective number of parameters 3.6 Limitations of Fixed Basis Functions

3.3 Bayesian Linear Regression How do we decide model complexity (= number of basis functions)? Regularization coefficients control effective number of model complexity. But we still need to decide number and forms of basis functions. Bayesian Linear Regression (vs. Held-out dataset) can automatically determine model complexity with training data alone! Computationally expensive!! Wasteful of valuable data!!

3.3.1 Parameter distribution Prior probability of w (eq. 3.48) Posterior (eq. 3.49, 3.50, 3.51)

Consider a zero-mean isotropic Gaussian (eq. 3.52) Posterior (eq. 3.49, 3.53, 3.54) Log of the posterior Maximizing posterior distribution w.r.t. w = minimizing the sum-of-squares

Straight-line fitting Original function Model f(x,a) = a 0 + a 1 x (a 0 = -0.3 a 1 =0.5) t n = f + noise Samples from posterior prior×likelihood =posterior

Straight-line fitting Samples from posterior prior×likelihood =posterior

3.3.2 Predictive distribution Our real interests (eq. 3.57) (eq. 3.58, 3.59) Using the result from Section 8.1.4 Noise + uncertainty associated with w Prob. of t for new input values

with the sinusoidal data set predictive uncertainty Mean of the Gaussian predictive distribution N=1N=2 N=4N=25

Samples from the posterior

3.3.3 Equivalent kernel Predictive mean (eq. 3.60) Mean of the predictive distribution at a point x (eq. 3.61, 3.62) Model function with the posterior mean Linear combination of t n “smoother matrix”, “equivalent kernel”

Examples of equivalent kernels Polynomial basis functions Sigmoidal basis funcitons k(x, x’) for x = 0 x’

Covariance between y(x) and y(x’) cov[y(x), y(x’)] = k(x,x’) (see eq. 3.63) k(x, x n ) sums to one (eq. 3.64) An important property of kernel functions: Inner product of nonlinear functions (eq. 3.65)

3.4 Bayesian Model Comparison The problem of model selection from a Bayesian perspective The over-fitting associated with maximum likelihood can be avoided by marginalizing over the model parameters instead of making point estimates of their values. It also allows multiple complexity parameters to be determined simultaneously as part of the training process. The Bayesian view of model comparison simply involves the use of probabilities to represent uncertainty in the choice of model.

Compare a set of L models {M i } (i = 1, …L) Prior probability Posterior probability (eq. 3.66) Bayes factor Prior (we can express a preference for different models) Model evidence (marginal likelihood) Compare two models.

Predictive distribution (from the sum and product rules) (eq. 3.67) Model evidence (again, from the sum and product rules) (eq. 3.68) Assume that the posterior distribution is sharply peaked around the most probable value w MAP, with width ⊿ w posterior (eq. 3.70). Drop M i Assume that the prior is flat with width ⊿ w prior

Further insight Take the log (eq. 3.71) For a model having a set of M parameters, simple intermediate complex (eq. 3.72) Always negative. penalty term when parameters are finely tuned in the posterior Assume that all parameters have the same ratio ⊿ w posterior / ⊿ w prior

Implicit assumption: the true distribution from which the data are generated is contained within the set of models under consideration. Provided this, Bayesian model comparison favours the correct model. = Kullback-Leibler divergence (zero if the two distributions are equal)

3.5 The Evidence Approximation Fully Bayesian treatment of the linear basis function model –Prior distribution over hyperparametersα, β –Predictive distribution: Marginalize w.r.t. hyperparameters as well as w (eq. 3.74) –If the posterior distribution p(α, β | t ) is sharply peaked around values α and β, the predictive distribution is obtained simply by marginalizing over w in which α,β are fixed to α,β. (eq. 3.75)

From Bayes’ theorem (eq. 3.76) Then Hyperparameters can be determined from the training data alone from this method. (w/o recourse to cross-validation) Recall that the ratio α/β is analogous to a regularization parameter. Alternative means for maximizing evidence –Set evidence function’s derivative equal to zero, re- estimate equations for α,β. –Use technique called the expectation maximization (EM) algorithm. prior: relatively flat Marginal likelihood

3.5.1 Evaluation of the evidence function Marginal likelihood (eq. 3.77, 3.78, 3.79) Complete the square over w Hessian matrix

Example of the sinusoidal function fitting order Model evidence M = 3

3.5.2 Maximizing the evidence function Maximization of p(t| α, β ) Set derivative w.r.t α, β to zero. w.r.t. α (eq. 3.87, 3.91, 3.92) w.r.t. β (eq. 3.95) λ: eigen value u : eigenvector (Implicit solution)

3.5.3 Effective number of parameters prior Contours of the likelihood function the effective total number of well determined parameters (eq. 3.91) α<<λ i Well determined α>>λi eigenvalue λ: curvature λ 2 : small λ 1 : large

Compare the variance (inverse precision) estimation in two approaches In Maximum likelihood approach In Evidence approximation approach Corrects the bias of the maximum likelihood result (eq. 3.21) (eq. 3.95)

Optimal α Test err. Log evidence with the sinusoidal data Optimal α β:set to true vlaue Fig. 3.16 shows how α is determined in the evidence framework Fig 3.17 shows how α controls the magnitude of the parameters {w i } Fig. 3.16 Fig. 3.17 0 ≦ α ≦ ∞ α=0α=∞

If N >>M, all of the parameters will be well-determined by the data. So, γ=M. Use these as an easy-to-compute approximation. (3.98) (3.99)

3.6 Limitations of Fixed Basis Functions Models comprising a linear combination of fixed, nonlinear basis functions. –Have closed-form solutions to the least-squares problem. –Have a tractable Bayesian treatment. The difficulty –The basis functions are fixed before the training data set is observed, and is a manifestation of the curse of dimensionality. Properties of real data sets to alleviate this problem –The data vectors {x n } typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input space –Target variables may have significant dependence on only a small number of possible directions within the data manifold.

おわり

3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子.

Similar presentations

Presentation on theme: "3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.

Similar presentations

Presentation on theme: "3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子."— Presentation transcript:

Similar presentations

About project

Feedback

3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子.

Presentation on theme: "3. Linear Models for Regression 後半東京大学大学院学際情報学府中川研究室星野綾子."— Presentation transcript: