Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Model generalization Test error Bias, variance and complexity
Indian Statistical Institute Kolkata
Model Assessment and Selection
Model Assessment, Selection and Averaging
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
What is Statistical Modeling
x – independent variable (input)
Introduction to Predictive Learning
Statistical Methods Chichang Jou Tamkang University.
Evaluation.
Three kinds of learning
Instance Based Learning IB1 and IBK Small section in chapter 20.
Experimental Evaluation
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
Part I: Classification and Bayesian Learning
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Classification and Prediction: Regression Analysis
Introduction to machine learning
Introduction to Directed Data Mining: Decision Trees
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Overview DM for Business Intelligence.
Last lecture summary.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Machine Learning CSE 681 CH2 - Supervised Learning.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Copyright © Andrew W. Moore Slide 1 Cross-validation for detecting and preventing overfitting Andrew W. Moore Professor School of Computer Science Carnegie.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Learning from Observations Chapter 18 Through
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Support Vector Machines
Cross-validation for detecting and preventing overfitting
CSE 4705 Artificial Intelligence
Classification Evaluation And Model Selection
10701 / Machine Learning.
Overview of Supervised Learning
The basic notions related to machine learning
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
Hyperparameters, bias-variance tradeoff, validation
Overfitting and Underfitting
Model generalization Brief summary of methods
Machine learning overview
Feature Selection Methods
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Lecture 16. Classification (II): Practical Considerations
Support Vector Machines 2
Presentation transcript:

Last lecture summary

Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior model – one concrete combination of learner and parameters – tune the parameters using the training set – the generalization is assessed using test set (previously unseen data)

learning (training) – supervised a target vector t is known, parameters are tuned to achieve the best match between prediction and the target vector – unsupervised training data consists of a set of input vectors x without any corresponding target value clustering, vizualization

for most applications, the original input variables must be preprocessed – feature selection – feature extraction x 784 x6x6 x5x5 x4x4 x3x3 x2x2 x1x x 456 x 103 x5x5 x1x1 x 784 x6x6 x5x5 x4x4 x3x3 x2x2 x1x x * 666 x * 309 x * 152 x * 18 x * 784 x*6x*6 x*5x*5 x*4x*4 x*3x*3 x*2x*2 x*1x* selectionextraction

feature selection/extraction = dimensionality reduction – generally good thing – curse of dimensionality example: – learner: regression (polynomial, y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 + …) – parameters: weights (coeffiients) w, order of polynomial weights – adjusted so the the sum of squared errors SSE (error function) is as small as possible predicted known target suma čtverců chyb

New stuff

Model selection overfitting

RMS – root mean squared error odmocnina střední kvadratické chyby MSE – mean squared error střední kvadratická chyba comparing error for data sets of different size – root mean squared error RMS

Summary of errors sum of squared errors mean squared error root mean squared error

Training set Test set

the bad result for M = 9 may seem paradoxical because – polynomial of given order contains all lower order polynomials as special cases (M=9 polynomial should be at least as good as M=3 polynomial) OK, let’s examine the values of the coefficients w * for polynomials of various orders

M = 0M = 1M = 3M = 9 w0*w0* w1*w1* w2*w2* w3*w3* w4*w4* w5*w5* w6*w6* w7*w7* w8*w8* w9*w9*

for a given model complexity the overfitting problem becomes less severe as the size of the data set increases M = 9 N = 15 M = 9 N = 100 or in other words, the larger the data set is, the more complex (flexible) model can be fitted

Overfitting in classification

Bias-variance tradeoff low flexibility (low degree of polynomial) models have large bias and low variance – bias means large quadratic error of the model – variance means that the predictions of the model will depend only little on the particular sample that was used for building the model i.e. there is little change in the model if training data set is changed thus there is little change between predictions for given x for different models

high flexibility models have low bias and large variance – Large degree will make the polynomial very sensitive to the details of the sample. – Thus the polynomial changes dramatically upon the change of the data set. – However, bias is low, as the quadratic error is low.

A polynomial with too few parameters (too low degree) will make large errors because of a large bias. A polynomial with too many parameters (too high degree) will make large errors because of a large variance. The degree of the ”best” polynomial must be somewhere ”in-between” - bias-variance tradeoff MSE = variance + bias 2

This phenomenon is not specific to polynomial regression! In fact, it shows-up in any kind of model. Generally, the bias-variance tradeoff principle can be stated as: – Models with too few parameters are inaccurate because they are not flexible enough (large bias, large error of the model). – Models with too many parameters are inaccurate because they overfit data (large variance, too much sensitivity to the data) – Identifying the best model requires identifying the proper “model complexity” (number of parameters).

Test-data and Cross Validation

attributes, input/independent variables, features object instance sample class

Attribute types discrete – Has only a finite or countably infinite set of values. – nominal (also categorical) the values are just different labels (e.g. ID number, eye color) central tendency given by mode (median, mean not defined) – ordinal their values reflect the order (e.g. ranking, height in {tall, medium, short}) central tendency given by median, mode (mean not defined) – binary attributes - special case of discrete attributes continuous (also quantitative) – Has real numbers as attribute values. – central tendency given by mean, + stdev, …

A regression problem y = f(x) + noise Can we learn from this data? Consider three methods x y taken from Cross Validation tutorial by Andrew Moore

Linear regression What will the regression model will look like? y = ax + b Univariate linear regression with a constant term. x y taken from Cross Validation tutorial by Andrew Moore

Quadratic regression What will the regression model will look like? y = ax 2 + bx + c x y taken from Cross Validation tutorial by Andrew Moore

Join-the-dots Also known as piecewise linear nonparametric regression if that makes you feel better. x y taken from Cross Validation tutorial by Andrew Moore

Which is best? Why not to choose the method with the best fit to data? taken from Cross Validation tutorial by Andrew Moore

What do we really want ? Why not to choose the method with the best fit to data? How well are you going to predict future data? taken from Cross Validation tutorial by Andrew Moore

The test set method 1.Randomly choose 30% of data to be in test set. 2.The remainder is training set. 3.Perform regression on the training set. 4.Estimate future performance with the test set. x y linear regression MSE = 2.4 taken from Cross Validation tutorial by Andrew Moore

The test set method 1.Randomly choose 30% of data to be in test set. 2.The remainder is training set. 3.Perform regression on the training set. 4.Estimate future performance with the test set. x y quadratic regression MSE = 0.9 taken from Cross Validation tutorial by Andrew Moore

The test set method 1.Randomly choose 30% of data to be in test set. 2.The remainder is training set. 3.Perform regression on the training set. 4.Estimate future performance with the test set. x y join-the-dots MSE = 2.2 taken from Cross Validation tutorial by Andrew Moore

Test set method good news – very simple – Model selection: choose method with the best score. bad news – wastes data (we got an estimate of the best method by using 30% less data) – if you don’t have enough data, test set may be just lucky/unlucky test set estimator of performance has high variance TrainTest taken from Cross Validation tutorial by Andrew Moore

training error testing error model complexity the above examples were for different algorithms, this one is about the model complexity (for the given algorithm)

stratified division – same proportion of data in the training and test sets

LOOCV (Leave-one-out Cross Validation) y x 1.choose one data point 2.remove it from the set 3.fit the remaining data points 4.note your error Repeat these steps for all points. When you are done report the mean square error. taken from Cross Validation tutorial by Andrew Moore

MSE LOOCV = 2.12 taken from Cross Validation tutorial by Andrew Moore

MSE LOOCV = taken from Cross Validation tutorial by Andrew Moore

MSE LOOCV = 3.33 taken from Cross Validation tutorial by Andrew Moore

Which kind of Cross Validation? Can we get best of both worlds? taken from Cross Validation tutorial by Andrew Moore

k-fold Cross Validation x y Randomly break data set into k partitions. In our case k = 3. Red partition: Train on all points not in the red partition. Find the test set sum of errors on the red points. Blue partition: Train on all points not in the blue partition. Find the test set sum of errors on the blue points. Green partition: Train on all points not in the green partition. Find the test set sum of errors on the green points. Then report the mean error. linear regression MSE 3fold = 2.05 taken from Cross Validation tutorial by Andrew Moore

Results of 3-fold Cross Validation taken from Cross Validation tutorial by Andrew Moore

Which kind of Cross Validation?

Model selection via CV We are trying to decide which model to use. For the polynomial regression decide about the degree of polynom. Train each machine and make a table. Whichever model gave best CV score: train it with all the data. That’s the predictive model you’ll use. degreeMSE train MSE 10-fold Choice taken from Cross Validation tutorial by Andrew Moore,

Selection and testing Complete procedure to algorithm selection and estimation of its quality 1.Divide data to train/test 2.By Cross Validation on the Train choose the algorithm 3.Use this algorithm to construct a classifier using Train 4.Estimate its quality on the Test TrainTest Train Test Train Val

Training error can not be used as an indicator of model’s performance due to overfitting. Training data set - train a range of models, or a given model with a range of values for its parameters. Compare them on independent data – Validation set. – If the model design is iterated many times, then some overfitting to the validation data can occur and so it may be necessary to keep aside a third Test set on which the performance of the selected model is finally evaluated.

Fnally comes our first machine learning algorithm

Which class (Blue or Orange) would you predict for this point? And why? classification boundary x y ?

x y ? And now? Classification boundary is quadratic

x y ? And now? And why?

Nearest Neighbors Classification

instances

But, what does it mean similar? A BC D source: Kardi Teknomo’s Tutorials,