Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these.

Slides:



Advertisements
Similar presentations
Boosting Rong Jin.
Advertisements

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Longin Jan Latecki Temple University
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
A Brief Introduction to Adaboost
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
For Better Accuracy Eick: Ensemble Learning
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Machine Learning CS 165B Spring 2012
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CpSc 881: Machine Learning Evaluating Hypotheses.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Machine Learning Chapter 5. Evaluating Hypotheses
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Ensemble Learning  Which of the two options increases your chances of having a good grade on the exam? –Solving the test individually –Solving the test.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification Chapter 2(Part 3) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
A “Holy Grail” of Machine Learing
Outline Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Evaluating Hypotheses
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Ensemble learning.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Model Combination.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Algorithm-Independent Machine Learning Anna Egorova-Förster University of Lugano Pattern Classification Reading Group, January 2007 All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Pattern Classification, Chapter 9 1 So far: different classifiers and methods presented BUT: Is some classifier better than all others? How to compare classifiers? Is comparison possible at all? Is at least some classifier always better than random? AND Do techniques exist which boost all classifiers? Algorithm-Independent Machine Learning

Pattern Classification, Chapter 9 2 No Free Lunch Theorem For any two learning algorithms P 1 (h|D) and P 2 (h|D), the following are true, independent of the sampling distribution P(x) and the number n of training points: 1. Uniformly averaged over all target functions F,  1 (E|F,n) -  2 (E|F,n) = For any fixed training set D, uniformly averaged over F,  1 (E|F,D) -  2 (E|F,D) = 0 3. Uniformly averaged over all priors P(F),  1 (E|n) -  2 (E|n) = 0 4. For any fixed training set D, uniformly averaged over P(F),  1 (E|D) -  2 (E|D) = 0

Pattern Classification, Chapter 9 3 No Free Lunch Theorem 1. Uniformly averaged over all target functions F,  1 (E|F,n) -  2 (E|F,n) = 0. Average over all possible target functions, the error will be the same for all classifiers. Possible target functions: For any fixed training set D, uniformly averaged over F,  1 (E|F,D) -  2 (E|F,D) = 0 Even if we know the training set D, the off-training errors will be the same. xFh1h1 h2h Training set D Off-Training set

Pattern Classification, Chapter 9 4 Consequences of the No Free Lunch Theorem If no information about the target function F(x) is provided: No classifier is better than some other in the general case No classifier is better than random in the general case

Pattern Classification, Chapter 9 5 Ugly Duckling Theorem Features Comparison Binary feature f i Patterns x i in the form: f 1 and f 2, f 1 or f 2 etc. Rank of a predicate r: the number of simplest patterns it contains. Rank 1: x 1 : f 1 AND NOT f 2 x 2 : f 1 AND f 2 x 3 : f 2 AND NOT f 1 Rank 2: x 1 OR x 2 : f 1 Rank 3: x 1 OR x 2 OR x 3 : f 1 OR f 2 Venn diagram

Pattern Classification, Chapter 9 6 Features with prior information

Pattern Classification, Chapter 9 7 Features Comparison To compare two patterns: take the number of features they share? Blind_left = {0,1} Blind_right = {0,1} Is (0,1) more similar to (1,0) or to (1,1)??? Different representations also possible: Blind_right = {0,1} Both_eyes_same = {0,1} With no prior information about the features available  impossible to prefer some representation over another

Pattern Classification, Chapter 9 8 Ugly Duckling Theorem Given that we use a finite set of predicates that enables us to distinguish any two patterns under consideration, the number of predicates shared by two such patterns is constant and independent of the choice of those patterns. Furthermore, if pattern similarity is based on the total number of predicates shared by two patterns, then any two patterns are “equally similar”. An ugly duckling is as similar to the beautiful swan 1 as does beautiful swan 2 to beautiful swan 1.

Pattern Classification, Chapter 9 9 Ugly Duckling Theorem Use for comparison of patterns the number of predicates they share. For two different patterns x i and x j : No same predicates of rank 1 One of rank 2: x i OR x j In the general case: Result is independent of choice of x i and x j !

Pattern Classification, Chapter 9 10 Bias and Variance Bias: given the training set D, we can accurately estimate F from D. Variance: given different training sets D, there will be no (little) differences between the estimations of F. Low bias means usually high variance High bias means usually low variance Best: low bias, low variance Only possible with as much as possible information about F(x).

Pattern Classification, Chapter 9 11 Bias and variance

Pattern Classification, Chapter 9 12 Resampling for estimating statistics Jackknife Remove some point from the training set: Calculate the statistics with the new training set Repeat for all points Calculate the jackknife statistics

Pattern Classification, Chapter 9 13 Bagging Draw n’ < n training points and train a different classifier Combine classifiers’ votes into end result Classifiers are of same type: all neural networks, decision trees etc. Instability: small changes in the training sets leads to significantly different classifiers and/or results

Pattern Classification, Chapter 9 14 Boosting Improve the performance of different types of classifiers Weak learners: the classifier has accuracy only slightly better than random Example: three component-classifiers for a two- class problem Draw three different training sets D 1, D 2 and D 3 and train three different classifiers C 1, C 2 and C 3 (weak learners).

Pattern Classification, Chapter 9 15 Boosting D1: randomly draw n 1 < n training points from D. Train C 1 with D 1 D2: “most informative” dataset with respect to D 1. Half of the points are classified properly by C 1, half of them not. Flip a coin: if head, find the first pattern in D/D 1 misclassified by C 1. If tails, find a pattern properly classified by C 1. Continue until possible Train C 2 with D 2 D3: most informative with respect to C 1 and C 2. Randomly select a pattern from D/(D 1,D 2 ) If C 1 and C 2 disagree, add it to D 3 Train C 3 with D 3

Pattern Classification, Chapter 9 16 Boosting