Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Model generalization Test error Bias, variance and complexity
Model Assessment, Selection and Averaging
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Model assessment and cross-validation - overview
Chapter 4: Linear Models for Classification
What is Statistical Modeling
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Kernel methods - overview
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Statistical Decision Theory, Bayes Classifier
Machine Learning CMPT 726 Simon Fraser University
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
1 The Elements of Statistical Learning Thomas Lengauer, Christian Merkwirth using the book by Hastie, Tibshirani, Friedmanbook.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
1 E. Fatemizadeh Statistical Pattern Recognition.
Optimal Bayes Classification
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Deep Feedforward Networks
Dept. Computer Science & Engineering, Shanghai Jiao Tong University
Machine learning, pattern recognition and statistical data modelling
The Elements of Statistical Learning
Machine learning, pattern recognition and statistical data modelling
Chapter 2: Overview of Supervised Learning
Overview of Supervised Learning
Bias and Variance of the Estimator
Predictive Learning from Data
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006

Summary Variable types and terminology Two simple approaches to prediction –Linear model and least squares –Nearest neighbor methods Statistical decision theory Curse of dimensionality Structured regression model Classes of restricted estimators Reading (ch2, ELS)

Variable Types and Terminology Input/output variables –Quantitative –Qualitative (categorical, discrete, factors) –Ordered categorical Regression: quantitative output Classification: qualitative output (numeric code) Terminology: –X: input, Y: regression output, G: classification output –x i : The i-th value of X (either scalar or vector) –Ŷ: prediction of Y, Ĝ: prediction of G

Two Approaches to Prediction (1) Linear model (OLS) Given a vector X=(X 1,..,X p ), predict Y via:, with intercept: Least square method: Differentiate w.r.t  : For X T X nonsingular: (2) Nearest Neighbor (NN) The k-NN for Ŷ is:

Example

Example - Linear Model Example: output G is either GREEN or Red The two classes are separated by a linear decision boundary Possible data scenarios: 1- Gaussian, uncorrelated, same variance, diff mean 2- Each class is a mixture of 10 different Gaussians

Example – 15-Nearest Neighbor The k-NN for Ŷ is: N k (x) is the neighborhood of x that has k points The classification rule is the majority voting among the neighbors of N k (x)

Example – 1-Nearest Neighbor 1-NN classification, no points misclassified OLS had 3 parameters, does the NN have 1 (i.e. k)? Indeed, k-NN uses N/k effective num of parameters

Example - Data Scenario Data scenario in the previous example: –Density for each class: mixture of 10 Gaussians –Green points: 10 means from N((1,0) T,I) –Red points: 10 means from N((0,1) T,I) –Variance was 0.2 for both sets See the book website for actual data The Bayes Error is the best possible performance

From OLS to NN… Many modern modeling procedures are variants of OLS or k-NN –Kernel smoothers –Local linear regression –Local basis expansion –Projection pursuit and neural networks

Statistical Decision Theory Case 1 – quantitative output Y X  p : a real-valued random input vector L(Y,f(X)): loss function for penalizing the prediction error Most common form of L is the least square loss: L(Y,f(X)) = (Y-f(X)) 2 Criterion for choosing f: EPE(f) = E(Y-f(X)) 2 The solution is: f(X) = E(Y|X=x) This is also known as the regression function

Statistical Decision Theory (Cont’d) Case 2 – qualitative output Y Prediction rule is Ĝ(X), where G and Ĝ(X) take values in G, | G |=K L(k,l): loss function for classifying G k as G l Single unit misclassification: 0-1 loss The expected prediction error is: EPE = E [L(G, Ĝ(X))] The solution is:

Statistical Decision Theory (Cont’d) Case 2 – qualitative output Y (cont’d) With 0-1 loss function, the solution is: Or, simply This is Bayes Classifier: pick the class having the maximum probability at point x

Further Discussions k-NN uses conditional expectations directly by –Approximating expectations by simple averages –Relaxing conditioning at a point to a region around the point As N,k , s.t. k/N  0, the k-NN estimate: f*(X)  E(Y|X=x) and therefore is consistent! OLS assumes a linear structure form for f(X)=X T , and minimizes sample version of EPE directly As sample size grows, our estimate for coefficients converges to the optimal linear:  opt = E(X T X) -1 E(X T Y) Model is limited by the linearity assumption

Example - Bayes Classifier Question: how did we build the classifier for our simulation example?

Curse of Dimensionality k-NN becomes difficult in higher dimensions: –It becomes difficult to gather k points close to x 0 –NN become spatially large and estimates are biased –Reducing the spatial size of the neighborhood means reducing k  the variance of neighborhood increases

Example 1 – Curse of Dimensionality Sampling density proportional to N 1/p If 100 points sufficient to estimate function in  1, needed for the same accuracy in  10 Example 1: 1000 training points x i, generated uniformly on [-1,1] (no measurement error) Training set: T, use 1-NN to predict y 0 at point x 0 This is mean squared error (MSE) for estimating f(0) MSE(x 0 ) =

Example 1 – Curse of Dimensionality Bias-variance decomposition:

Example 2 – Curse of Dimensionality If linear model is correct, or almost correct, the NN will do much worst than OLS Assuming that we know this is the case, simple OLS is not affected by the dimension

Statistical Models Y=f(X)+  (X and  independent) The random additive error , where E(  )=0 Pr(Y|X) depends on X via conditional mean f(X) = E(Y|X=x) Approximation to the truth, all unmeasured variables in  N realizations: y i =f(x i )+  i, (  i and  j independent) Generally more complicated, e.g. Var(Y|X)=  2 (X) Additive errors not used with qualitative response E.g. binary trials, E(Y|X=x)=p(x) &Var(Y|X=x)=p(x)[1-p(x)] For qualitative, directly model:

Function Approximation The approximation has a set of parameters  E.g f(x)=x T  (  =  ), or f  (x)=  k h k (x)  k Estimate  by min RSS(  )=  i (y i - f  (x i )) 2 Assumes parametric form for f and loss function More general principle: Maximum Likelihood (ML) E.g. A random sample y i, i=1,..,N from a density Pr  (y) The log prob of the sample is: E.g. Multinomial qualitative likelihood

Example – Least Square Function Approximation

Structured Regression Models Any Function passing thru (x i,y i ) has RSS=0 Need to restrict the class Usually the restrictions impose local behavior Any method that attempts to approximate locally varying function is “cursed” Any method that overcomes the curse, assumes an implicit metric that does not allow neighborhood to be simultaneously small in all directions

Classes of Restricted Estimators Some of the classes of restricted methods that we cover: Roughness penalty and Bayesian methods Kernel methods and local regression –E.g. for k-NN, K k (x,x 0 )=I(||x-x 0 ||  ||x (k) -x 0 ||)

Model Selection and Bias-Variance Trade-offs Many of the flexible methods have a smoothing or a complexity parameter –The multiplier of the penalty term –The width of the kernel –Or the number of basis functions Cannot use RSS to determine this parameter – always interpolating learn data Use prediction error on unseen test cases to guide us Generally, as the model complexity increases, the variance is increased and the squared bias is decreased (and vice versa) Choose model complexity to trade bias with variance error s.t. to minimize the test error

Example – Bias-Variance Trade-offs k-NN on data, Y=f(X)+ , E(  )=0, Var(  )=  2 For nonrandom samples x i, test (generalization) error will be:

Bias-Variance Trade-offs More generally, as the model complexity of our procedure increases, the variance tends to increase and the square bias tends to decrease The opposite behavior occurs as model complexity is decreased In k-NN the model complexity controlled by k Choose your model complexity to trade-off variance with bias