# Bayesian Learning & Estimation Theory

## Presentation on theme: "Bayesian Learning & Estimation Theory"— Presentation transcript:

Bayesian Learning & Estimation Theory

Maximum likelihood estimation
Example: For Gaussian likelihood P(x|q) = N (x|,2), Objective of regression: Minimize error E(w) = ½ Sn ( tn - y(xn,w) )2

A probabilistic view of linear regression
Precision b =1/s 2 Compare to error function: E(w) = ½ Sn ( tn - y(xn,w) )2 Since argminw E(w) = argmaxw , regression is equivalent to ML estimation of w

Bayesian learning P(q |D) = P(D,q) / P(D)  P(D,q)
View the data D and parameter q as random variables (for regression, D = (x, t) and q = w) The data induces a distribution over the parameter: P(q |D) = P(D,q) / P(D)  P(D,q) Substituting P(D,q) = P(D |q) P(q), we obtain Bayes’ theorem: P(q |D)  P(D |q) P(q) Posterior  Likelihood x Prior

Bayesian prediction Predictions (eg, predict t from x using data D) are mediated through the parameter: P(prediction|D) = q P(prediction|q ) P(q |D) dq Maximum a posteriori (MAP) estimation: q MAP = argmaxq P(q |D) P(prediction|D)  P(prediction| q MAP ) Accurate when P(q |D) is concentrated on q MAP

A probabilistic view of regularized regression
E(w) = ½ Sn ( tn - y(xn,w) )2 + l/2 Sm wm2 Prior: w’s are IID Gaussian p(w) = Pm (1/ 2pl-1 ) exp{- l wm2 / 2 } Since argminw E(w) = argmaxw p(t|x,w) p(w), regularized regression is equivalent to MAP estimation of w ln p(t|x,w) ln p(w)

Bayesian linear regression
Likelihood: b specifies precision of data noise Prior: a specifies precision of weights Posterior: This is an M+1 dimensional Gaussian density Prediction: m = 0 M wm| 0,a -1 Computed using linear algebra (see textbook)

y(x) sampled from posterior
Example: y(x) = w0 + w1x y(x) sampled from posterior Likelihood Prior Data Posterior No data 1st point 2nd point ... 20th point

Example: y(x) = w0 + w1x + … + wMxM
M = 9, a = 5x10-3: Gives a reasonable range of functions b = 11.1: Known precision of noise Mean and one std dev of the predictive distribution

Example: y(x) = w0 + w1f1(x) + … + wMfM(x)
Gaussian basis functions: 1

How are we doing on the pass sequence?
Least squares regression… Choosing a particular M and w seems wrong – we should hedge our bets Cross validation reduced the training data, so the red line isn’t as accurate as it should be Hand-labeled horizontal coordinate, t The red line doesn’t reveal different levels of uncertainty in predictions

How are we doing on the pass sequence?
Hand-labeled horizontal coordinate, t The red line doesn’t reveal different levels of uncertainty in predictions Cross validation reduced the training data, so the red line isn’t as accurate as it should be Choosing a particular M and w seems wrong – we should hedge our bets Hand-labeled horizontal coordinate, t Bayesian regression

Estimation theory Provided with a predictive distribution p(t|x), how do we estimate a single value for t? Example: In the pass sequence, Cupid must aim at and hit the man in the white shirt, without hitting the man in the striped shirt Define L(t,t*) as the loss incurred by estimating t* when the true value is t Assuming p(t|x) is correct, the expected loss is E[L] = t L(t,t*) p(t|x) dt The minimum loss estimate is found by minimizing E[L] w.r.t. t*

Squared loss E[L] = t ( t - t* )2 p(t|x) dt
A common choice: L(t,t*) = ( t - t* )2 E[L] = t ( t - t* )2 p(t|x) dt Not appropriate for Cupid’s problem To minimize E[L] , set its derivative to zero: dE[L]/dt* = -2t ( t - t* ) p(t|x) dt = 0 -2t t p(t|x)dt + t* = 0 Minimum mean squared error (MMSE) estimate: t* = E[t|x] = t t p(t|x)dt For regression: t* = y(x,w)

Other loss functions Absolute loss Squared loss

Absolute loss e t1 t2 t3 t4 t5 t6 t* t7 t
L = |t*-t1| + |t*-t2| + |t*-t3| + |t*-t4| + |t*-t5| + |t*-t6| + |t*-t7| Consider moving t* to the left by e L decreases by 6e and increases by e Changes in L are balanced when t* = t4 The median of t under p(t|x) minimizes absolute loss Important: The median is invariant to monotonic transformations of t Mean and median Median Mean

D-dimensional estimation
Suppose t is D-dimensional, t = (t1,…,tD) Example: 2-dimensional tracking Approach 1: Minimum marginal loss estimation Find td* that minimizes t L(td,td*) p(td|x) dtd Approach 2: Minimum joint loss estimation Define joint loss L(t,t*) Find t* that minimizes t L(t,t*) p(t|x) dt

Questions?

How are we doing on the pass sequence?
Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? t = 290 Feature, x Hand-labeled horizontal coordinate, t Compute 1st moment: x = 224 Man in white shirt is occluded Horizontal location Fraction of pixels in column with intensity > 0.9 320

How are we doing on the pass sequence?
Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? Not very well. Regression fails to identify that there really are two classes of solution Hand-labeled horizontal coordinate, t Feature, x

Similar presentations