Download presentation

Presentation is loading. Please wait.

Published byDorthy Gray Modified over 2 years ago

1
1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2

2
2 Regression Technically: learning a function f(x)=y where y is real-valued, rather than discrete. –Replace livesInSquirrelHill(x 1,x 2,…,x n ) with averageCommuteDistanceInMiles(x 1,x 2,…,x n ) –Replace userLikesMovie(u,m) with usersRatingForMovie(u,m) –…

3
3 Example: univariate linear regression Example: predict age from number of publications

4
4 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) ^ ^ assume MLE

5
5 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) Ways to estimate parameters –Find derivative wrt parameters a,b –Set to zero and solve Or use gradient ascent to solve Or …. ^ ^

6
6 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 How to estimate the slope? n*var(X) n*cov(X,Y)

7
7 Linear regression How to estimate the intercept? d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3

8
8 Bias/Variance Decomposition of Error

9
9 Return to the simple regression problem f:X Y y = f(x) + What is the expected error for a learned h? noise N(0, ) deterministic Bias – Variance decomposition of error

10
10 Bias – Variance decomposition of error learned from D true fctdataset Experiment (the error of which I’d like to predict): 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of h D on that example

11
11 Bias – Variance decomposition of error (2) learned from D true fctdataset Fix x, then do this experiment: 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of h D on that example

12
12 Bias – Variance decomposition of error t f y ^ why not? really y D ^

13
13 Bias – Variance decomposition of error Depends on how well learner approximates f Intrinsic noise

14
14 Bias – Variance decomposition of error Squared difference btwn our long- term expectation for the learners performance, E D [h D (x)], and what we expect in a representative run on a dataset D (hat y) Squared difference between best possible prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many datasets D, E D [h D (x)] BIAS 2 VARIANCE

15
15 Bias-variance decomposition How can you reduce bias of a learner? How can you reduce variance of a learner? Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data

16
16 A generalization of bias-variance decomposition to other loss functions “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ Define “optimal prediction”: y* = argmin y’ L(t,y’) Define “main prediction of learner” y m =y m,D = argmin y’ E D {L(t,y’)} Define “bias of learner”: B(x)=L(y*,y m ) Define “variance of learner” V(x)=E D [L(y m,y)] Define “noise for x”: N(x) = E t [L(t,y*)] Claim: E D,t [L(t,y) ] = c 1 N(x)+B(x)+c 2 V(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else m=n=|D |

17
17 Other regression methods

18
18 Example: univariate linear regression Example: predict age from number of publications Paul Erdős Hungarian mathematician, 1913-1996 x ~ 1500 age about 240

19
19 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 Summary: To simplify: assume zero-centered data, as we did for PCA let x=(x 1,…,x n ) and y =(y 1,…,y n ) then…

20
20 Onward: multivariate linear regression Univariate Multivariate row is example col is feature

21
21 Onward: multivariate linear regression regularized ^

22
22 Onward: multivariate linear regression Multivariate, multiple outputs

23
23 Onward: multivariate linear regression regularized ^ What does increasing λ do?

24
24 Onward: multivariate linear regression regularized ^ w=(w 1,w 2) What does fixing w 2 =0 do (if λ=0)?

25
25 Regression trees - summary Growing tree: –Split to optimize information gain At each leaf node –Predict the majority class Pruning tree: –Prune to reduce error on holdout Prediction: –Trace path to a leaf and predict associated majority class build a linear model, then greedily remove features estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features estimated error on training data using to a linear interpolation of every prediction made by every node on the path [Quinlan’s M5]

26
26 Regression trees: example - 1

27
27 Regression trees – example 2 What does pruning do to bias and variance?

28
28 Kernel regression aka locally weighted regression, locally linear regression, LOESS, …

29
29 Kernel regression aka locally weighted regression, locally linear regression, … Close approximation to kernel regression: –Pick a few values z 1,…,z k up front –Preprocess: for each example (x,y), replace x with x= where K(x,z) = exp( -(x-z) 2 / 2σ 2 ) –Use multivariate regression on x,y pairs

30
30 Kernel regression aka locally weighted regression, locally linear regression, LOESS, … What does making the kernel wider do to bias and variance?

31
31 Additional readings P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.A Unified Bias-Variance Decomposition and its Applications J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.Learning with Continuous Classes Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997Inducing model trees for continuous classes D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996. Active Learning with Statistical Models

Similar presentations

OK

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Class-9th physics ppt on motion Ppt on indian administrative services What does appt only mean one thing Ppt on industrial all risk policy Ppt on polarization of light Ppt on carl friedrich gauss formulas Ppt on going places by a.r.barton Free ppt on mobile number portability delhi Ppt on supply chain management of hyundai motors Free ppt on autism