Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.

Similar presentations


Presentation on theme: "Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006."— Presentation transcript:

1 Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006

2 Summary Variable types and terminology Two simple approaches to prediction –Linear model and least squares –Nearest neighbor methods Statistical decision theory Curse of dimensionality Structured regression model Classes of restricted estimators Reading (ch2, ELS)

3 Variable Types and Terminology Input/output variables –Quantitative –Qualitative (categorical, discrete, factors) –Ordered categorical Regression: quantitative output Classification: qualitative output (numeric code) Terminology: –X: input, Y: regression output, G: classification output –x i : The i-th value of X (either scalar or vector) –Ŷ: prediction of Y, Ĝ: prediction of G

4 Two Approaches to Prediction (1) Linear model (OLS) Given a vector X=(X 1,..,X p ), predict Y via:, with intercept: Least square method: Differentiate w.r.t  : For X T X nonsingular: (2) Nearest Neighbor (NN) The k-NN for Ŷ is:

5 Example

6 Example - Linear Model Example: output G is either GREEN or Red The two classes are separated by a linear decision boundary Possible data scenarios: 1- Gaussian, uncorrelated, same variance, diff mean 2- Each class is a mixture of 10 different Gaussians

7 Example – 15-Nearest Neighbor The k-NN for Ŷ is: N k (x) is the neighborhood of x that has k points The classification rule is the majority voting among the neighbors of N k (x)

8 Example – 1-Nearest Neighbor 1-NN classification, no points misclassified OLS had 3 parameters, does the NN have 1 (i.e. k)? Indeed, k-NN uses N/k effective num of parameters

9 Example - Data Scenario Data scenario in the previous example: –Density for each class: mixture of 10 Gaussians –Green points: 10 means from N((1,0) T,I) –Red points: 10 means from N((0,1) T,I) –Variance was 0.2 for both sets See the book website for actual data The Bayes Error is the best possible performance

10 From OLS to NN… Many modern modeling procedures are variants of OLS or k-NN –Kernel smoothers –Local linear regression –Local basis expansion –Projection pursuit and neural networks

11 Statistical Decision Theory Case 1 – quantitative output Y X  p : a real-valued random input vector L(Y,f(X)): loss function for penalizing the prediction error Most common form of L is the least square loss: L(Y,f(X)) = (Y-f(X)) 2 Criterion for choosing f: EPE(f) = E(Y-f(X)) 2 The solution is: f(X) = E(Y|X=x) This is also known as the regression function

12 Statistical Decision Theory (Cont’d) Case 2 – qualitative output Y Prediction rule is Ĝ(X), where G and Ĝ(X) take values in G, | G |=K L(k,l): loss function for classifying G k as G l Single unit misclassification: 0-1 loss The expected prediction error is: EPE = E [L(G, Ĝ(X))] The solution is:

13 Statistical Decision Theory (Cont’d) Case 2 – qualitative output Y (cont’d) With 0-1 loss function, the solution is: Or, simply This is Bayes Classifier: pick the class having the maximum probability at point x

14 Further Discussions k-NN uses conditional expectations directly by –Approximating expectations by simple averages –Relaxing conditioning at a point to a region around the point As N,k , s.t. k/N  0, the k-NN estimate: f*(X)  E(Y|X=x) and therefore is consistent! OLS assumes a linear structure form for f(X)=X T , and minimizes sample version of EPE directly As sample size grows, our estimate for coefficients converges to the optimal linear:  opt = E(X T X) -1 E(X T Y) Model is limited by the linearity assumption

15 Example - Bayes Classifier Question: how did we build the classifier for our simulation example?

16 Curse of Dimensionality k-NN becomes difficult in higher dimensions: –It becomes difficult to gather k points close to x 0 –NN become spatially large and estimates are biased –Reducing the spatial size of the neighborhood means reducing k  the variance of neighborhood increases

17 Example 1 – Curse of Dimensionality Sampling density proportional to N 1/p If 100 points sufficient to estimate function in  1, 100 10 needed for the same accuracy in  10 Example 1: 1000 training points x i, generated uniformly on [-1,1] (no measurement error) Training set: T, use 1-NN to predict y 0 at point x 0 This is mean squared error (MSE) for estimating f(0) MSE(x 0 ) =

18 Example 1 – Curse of Dimensionality Bias-variance decomposition:

19 Example 2 – Curse of Dimensionality If linear model is correct, or almost correct, the NN will do much worst than OLS Assuming that we know this is the case, simple OLS is not affected by the dimension

20 Statistical Models Y=f(X)+  (X and  independent) The random additive error , where E(  )=0 Pr(Y|X) depends on X via conditional mean f(X) = E(Y|X=x) Approximation to the truth, all unmeasured variables in  N realizations: y i =f(x i )+  i, (  i and  j independent) Generally more complicated, e.g. Var(Y|X)=  2 (X) Additive errors not used with qualitative response E.g. binary trials, E(Y|X=x)=p(x) &Var(Y|X=x)=p(x)[1-p(x)] For qualitative, directly model:

21 Function Approximation The approximation has a set of parameters  E.g f(x)=x T  (  =  ), or f  (x)=  k h k (x)  k Estimate  by min RSS(  )=  i (y i - f  (x i )) 2 Assumes parametric form for f and loss function More general principle: Maximum Likelihood (ML) E.g. A random sample y i, i=1,..,N from a density Pr  (y) The log prob of the sample is: E.g. Multinomial qualitative likelihood

22 Example – Least Square Function Approximation

23 Structured Regression Models Any Function passing thru (x i,y i ) has RSS=0 Need to restrict the class Usually the restrictions impose local behavior Any method that attempts to approximate locally varying function is “cursed” Any method that overcomes the curse, assumes an implicit metric that does not allow neighborhood to be simultaneously small in all directions

24 Classes of Restricted Estimators Some of the classes of restricted methods that we cover: Roughness penalty and Bayesian methods Kernel methods and local regression –E.g. for k-NN, K k (x,x 0 )=I(||x-x 0 ||  ||x (k) -x 0 ||)

25 Model Selection and Bias-Variance Trade-offs Many of the flexible methods have a smoothing or a complexity parameter –The multiplier of the penalty term –The width of the kernel –Or the number of basis functions Cannot use RSS to determine this parameter – always interpolating learn data Use prediction error on unseen test cases to guide us Generally, as the model complexity increases, the variance is increased and the squared bias is decreased (and vice versa) Choose model complexity to trade bias with variance error s.t. to minimize the test error

26 Example – Bias-Variance Trade-offs k-NN on data, Y=f(X)+ , E(  )=0, Var(  )=  2 For nonrandom samples x i, test (generalization) error will be:

27 Bias-Variance Trade-offs More generally, as the model complexity of our procedure increases, the variance tends to increase and the square bias tends to decrease The opposite behavior occurs as model complexity is decreased In k-NN the model complexity controlled by k Choose your model complexity to trade-off variance with bias


Download ppt "Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006."

Similar presentations


Ads by Google