Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Chapter 9: Algorithm-Independent Machine Learning (Sections 1-7) 1. Introduction 2. Lack of Inherent Superiority of Any Classifier 3. Bias and Variance 4. Resampling for Estimating Statistics 5. Resampling for Classifier Design 6. Estimating and Comparing Classifiers 7. Combining Classifiers

Pattern Classification, Chapter 9 2 Algorithm independent machine learning means: Mathematical foundations that do not depend on a particular classifier or learning algorithm Techniques that can be used with different classifiers to provide guidance in their use 1. Introduction

Pattern Classification, Chapter 9 3 2.1 No Free Lunch Theorem There are no a priori reasons to favor one learning or classification method over another 2. Lack of Inherent Superiority of Any Classifier

Pattern Classification, Chapter 9 4

6 Part 1 – averaged over all target functions, the expected off- training set error is the same for all learning algorithms Part 2 – even if we know the training set, no learning algorithm yields an off-training set error superior to any other Parts 3 & 4 – similar to Parts 1 & 2 for nonuniform target function distributions

10 2.3 Minimum Descriptive Length (MDL) Sometimes claimed to justify preferring one classifier over another, “simpler” over “complex” classifiers Algorithmic (Kolmogorov-Chaitin) complexity of a binary string x, defined by analogy to entropy, is the size of the shortest program y that computes the string x and halts, where U = universal Turing machine

Pattern Classification, Chapter 9 11 Examples x = binary string of n 1s x = string of n binary digits of the constant  x = “truly” random string of n binary digits grows with n = length of x does not grow with n requires log 2 n to specify condition for halting

Pattern Classification, Chapter 9 12 2.4 Minimum Descriptive Length Principle

Pattern Classification, Chapter 9 14 2.5 Overfitting Avoidance and Occam’s Razor Although we have mentioned the need to avoid overfitting via regularization, pruning, etc., the No Free Lunch result brings such techniques into question Nevertheless, although we can’t prove they help, these techniques have been found useful in practice

Pattern Classification, Chapter 9 15 Bias and variance are two ways to measure the “match” or “alignment” of the learning algorithm to the classification problem Bias measures accuracy of the match: high => poor match Variance measures precision of match: high => weak match Bias and variance are not independent 3. Bias and Variance

Pattern Classification, Chapter 9 16 3.1 Bias and Variance for Regression

Pattern Classification, Chapter 9 17 Regression

Pattern Classification, Chapter 9 18 3.2 Bias and Variance for Classification We will skip the mathematics here because it is complex And just review the figure on the next slide

Pattern Classification, Chapter 9 19 Classification

Pattern Classification, Chapter 9 23 4.1 Jackknife We have methods of estimating the mean and variance of a sample but not for estimating other statistics like the median, mode, and bias The jackknife and bootstrap methods are two of the most popular and theoretically grounded resampling techniques for extending estimates to arbitrary statistics The jackknife method is essentially a leave-one-out procedure for estimating various statistics 4. Resampling for Estimating Statistics

Pattern Classification, Chapter 9 24 4.2 Bootstrap This method randomly selects B “bootstrap” data sets by repeatedly selecting n points from the training set, with replacement Note: with replacement means samples can be repeated The bootstrap estimate of a statistic is then merely the mean of the B estimates

Pattern Classification, Chapter 9 25 The generic term arcing – adaptive reweighting and combining – refers to reusing or selecting data in order to improve classification 5.1 Bagging (from “bootstrap aggregation”) Uses multiple versions of a training set, each created by drawing a subset of n’ < n samples from the training set with replacement Each bootstrap data set is used to train a different component classifier and the final decision is based on a vote of the component classifiers 5. Resampling for Classifier Design

Pattern Classification, Chapter 9 26 5.2 Boosting Each bootstrap data set is used to train a different component classifier and the final decision is based on a vote of the component classifiers Classification accuracy is “boosted” by adding component classifiers to form an ensemble with high accuracy on the training set The subsets of training data chosen are “most informative” given the current set of component classifiers

Pattern Classification, Chapter 9 27 Boosting

Pattern Classification, Chapter 9 28 5.3 Learning with Queries This is a special case of resampling Also called active learning or interactive learning Uses human “oracle” to label “valuable” patterns Two methods of selecting informative patterns Select the pattern for which the two largest discriminant functions have nearly the same value For multiclassifier systems, select the pattern yielding the greatest disagreement among the k classifiers

Pattern Classification, Chapter 9 30 5.4 Arcing, Learning with Queries, Bias and Variance Resampling in general, and learning with queries in particular, seems to violate the idea of drawing i.i.d. training data How can we do better with these techniques? In learning with queries we are not fitting parameters in a model, but instead seeking decision boundaries more directly As the number of component classifiers is increased, techniques like boosting effectively broaden the class of implementable functions

Pattern Classification, Chapter 9 31 There are two main reasons for determining the generalization rate of a classifier See if the classifier is good enough to be useful To compare performance with competing designs 6.1 Parametric Models Computing the generalization rate from the model is dangerous Estimates often overly optimistic – unrepresentativeness of training samples not revealed Should always suspect validity of the model Difficult to compute the error rate for complex distributions 6.2 Cross-Validation 6. Estimating and Comparing Classifiers

Pattern Classification, Chapter 9 32 6.2 Cross-Validation Simple validation – split the training set into two parts Usual training – typically 90% of the data Validation – 10% – used for estimating generalization error Stop training when the error on the validation set reaches a minimum See next slide

Pattern Classification, Chapter 9 34 6.2 Cross-Validation (continued) m-fold cross-validation – divide training set into m parts of equal size n/m The classifier is trained m times, each time with a different set held out for validation, and the estimated performance is the mean of the m tests In the limit where m = n, the method is in effect the leave- one-out approach The validation error gives a classifier accuracy estimate

Pattern Classification, Chapter 9 35 For example, if no errors are made on 50 test samples, with probability 0.95 the true error rate is between zero and 8%

Pattern Classification, Chapter 9 37 6.3 Jackknife and Bootstrap Estimation of Classification Accuracy The jackknife approach trains the classifier n times, each time leaving out one training sample, and the estimated classifier accuracy is simply the mean of the leave-one-out accuracies The bootstrap approach trains B classifiers, each with a different bootstrap data set, and estimates accuracy as the mean

Pattern Classification, Chapter 9 38 6.4 Maximum-Likelihood Model Comparison 6.5 Bayesian Model Comparison 6.6 The Problem-Average Error Rate skip

Pattern Classification, Chapter 9 39 6.7 Predicting Final Performance from Learning Curves Training on large data sets can be computationally intensive, se we would like to use a classifier’s performance on a relatively small training set to predict its performance on a large one

Pattern Classification, Chapter 9 42 6.8 The Capacity of a Separating Plane Of the 2n possible dichotomies of n points in d dimensions, the fraction linearly separable is for n=4 & d=1, f(n,d)=0.5

Pattern Classification, Chapter 9 43 6.8 The Capacity of a Separating Plane (cont.) For four patterns in one dimension, f(n,d)=0.5 The following table shows all 16 of the equally likely labels of four patterns along a line (1D)

Pattern Classification, Chapter 9 44 This works well if each classifier is an “expert” in a different region of the pattern space 7.1 Component Classifiers with Discriminant Functions Provides a mixture distribution Basic architecture shown on next slide 7. Combining Classifiers

Pattern Classification, Chapter 9 46 7.1 Component Classifiers without Discriminant Functions For example, might combine three classifiers: Neural network => analog values kNN => rank order of the classes decision tree => single output Convert outputs to discriminant values g i that sum to 1

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Similar presentations

Presentation on theme: "Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley."— Presentation transcript:

Similar presentations

About project

Feedback