Download presentation

Presentation is loading. Please wait.

Published byDorthy Gray Modified about 1 year ago

1
1 Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2

2
2 Regression Technically: learning a function f(x)=y where y is real-valued, rather than discrete. –Replace livesInSquirrelHill(x 1,x 2,…,x n ) with averageCommuteDistanceInMiles(x 1,x 2,…,x n ) –Replace userLikesMovie(u,m) with usersRatingForMovie(u,m) –…

3
3 Example: univariate linear regression Example: predict age from number of publications

4
4 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) ^ ^ assume MLE

5
5 Linear regression Model: y i = ax i + b + ε i where ε i ~ N(0,σ) Training Data: (x 1,y 1 ),….(x n,y n ) Goal: estimate a,b with w=(a,b) Ways to estimate parameters –Find derivative wrt parameters a,b –Set to zero and solve Or use gradient ascent to solve Or …. ^ ^

6
6 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 How to estimate the slope? n*var(X) n*cov(X,Y)

7
7 Linear regression How to estimate the intercept? d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3

8
8 Bias/Variance Decomposition of Error

9
9 Return to the simple regression problem f:X Y y = f(x) + What is the expected error for a learned h? noise N(0, ) deterministic Bias – Variance decomposition of error

10
10 Bias – Variance decomposition of error learned from D true fctdataset Experiment (the error of which I’d like to predict): 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of h D on that example

11
11 Bias – Variance decomposition of error (2) learned from D true fctdataset Fix x, then do this experiment: 1. Draw size n sample D = (x 1,y 1 ),….(x n,y n ) 2. Train linear function h D using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of h D on that example

12
12 Bias – Variance decomposition of error t f y ^ why not? really y D ^

13
13 Bias – Variance decomposition of error Depends on how well learner approximates f Intrinsic noise

14
14 Bias – Variance decomposition of error Squared difference btwn our long- term expectation for the learners performance, E D [h D (x)], and what we expect in a representative run on a dataset D (hat y) Squared difference between best possible prediction for x, f(x), and our “long-term” expectation for what the learner will do if we averaged over many datasets D, E D [h D (x)] BIAS 2 VARIANCE

15
15 Bias-variance decomposition How can you reduce bias of a learner? How can you reduce variance of a learner? Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data

16
16 A generalization of bias-variance decomposition to other loss functions “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ Define “optimal prediction”: y* = argmin y’ L(t,y’) Define “main prediction of learner” y m =y m,D = argmin y’ E D {L(t,y’)} Define “bias of learner”: B(x)=L(y*,y m ) Define “variance of learner” V(x)=E D [L(y m,y)] Define “noise for x”: N(x) = E t [L(t,y*)] Claim: E D,t [L(t,y) ] = c 1 N(x)+B(x)+c 2 V(x) where c 1 =Pr D [ y=y*] - 1 c 2 =1 if y m =y*, -1 else m=n=|D |

17
17 Other regression methods

18
18 Example: univariate linear regression Example: predict age from number of publications Paul Erdős Hungarian mathematician, 1913-1996 x ~ 1500 age about 240

19
19 Linear regression d2d2 d1d1 x1x1 x2x2 y2y2 y1y1 d3d3 Summary: To simplify: assume zero-centered data, as we did for PCA let x=(x 1,…,x n ) and y =(y 1,…,y n ) then…

20
20 Onward: multivariate linear regression Univariate Multivariate row is example col is feature

21
21 Onward: multivariate linear regression regularized ^

22
22 Onward: multivariate linear regression Multivariate, multiple outputs

23
23 Onward: multivariate linear regression regularized ^ What does increasing λ do?

24
24 Onward: multivariate linear regression regularized ^ w=(w 1,w 2) What does fixing w 2 =0 do (if λ=0)?

25
25 Regression trees - summary Growing tree: –Split to optimize information gain At each leaf node –Predict the majority class Pruning tree: –Prune to reduce error on holdout Prediction: –Trace path to a leaf and predict associated majority class build a linear model, then greedily remove features estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features estimated error on training data using to a linear interpolation of every prediction made by every node on the path [Quinlan’s M5]

26
26 Regression trees: example - 1

27
27 Regression trees – example 2 What does pruning do to bias and variance?

28
28 Kernel regression aka locally weighted regression, locally linear regression, LOESS, …

29
29 Kernel regression aka locally weighted regression, locally linear regression, … Close approximation to kernel regression: –Pick a few values z 1,…,z k up front –Preprocess: for each example (x,y), replace x with x= where K(x,z) = exp( -(x-z) 2 / 2σ 2 ) –Use multivariate regression on x,y pairs

30
30 Kernel regression aka locally weighted regression, locally linear regression, LOESS, … What does making the kernel wider do to bias and variance?

31
31 Additional readings P. Domingos, A Unified Bias-Variance Decomposition and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.A Unified Bias-Variance Decomposition and its Applications J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.Learning with Continuous Classes Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997Inducing model trees for continuous classes D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996. Active Learning with Statistical Models

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google