Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff.

Slides:



Advertisements
Similar presentations
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Advertisements

Is Random Model Better? -On its accuracy and efficiency-
DECISION TREES. Decision trees  One possible representation for hypotheses.
A Syntactic Justification of Occam’s Razor 1 John Woodward, Andy Evans, Paul Dempster Foundations of Reasoning Group University of Nottingham Ningbo, China.
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
1er. Escuela Red ProTIC - Tandil, de Abril, Decision Tree Learning 3.1 Introduction –Method for approximation of discrete-valued target functions.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Phil 148 Explanations. Inferences to the Best Explanation. IBE is also known as ‘abductive reasoning’ It is the kind of reasoning (not deduction) that.
CS 4700: Foundations of Artificial Intelligence
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Covering Algorithms. Trees vs. rules From trees to rules. Easy: converting a tree into a set of rules –One rule for each leaf: –Antecedent contains a.
Bayesian Learning Rong Jin.
Ensemble Learning (2), Tree and Forest
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Mohammad Ali Keyvanrad
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
CS 391L: Machine Learning: Ensembles
For Friday No reading No homework. Program 4 Exam 2 A week from Friday Covers 10, 11, 13, 14, 18, Take home due at the exam.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Learning from Observations Chapter 18 Through
 2003, G.Tecuci, Learning Agents Laboratory 1 Learning Agents Laboratory Computer Science Department George Mason University Prof. Gheorghe Tecuci 5.
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Intro Learning (Reading: Chapter.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Chapter 10 Re-expressing the data
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Bayesian Averaging of Classifiers and the Overfitting Problem Rayid Ghani ML Lunch – 11/13/00.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
For Monday Finish chapter 19 Take-home exam due. Program 4 Any questions?
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Scientific Communication Timothy G. Standish, Ph. D.
Decision Tree Learning
Decision Tree Learning Presented by Ping Zhang Nov. 26th, 2007.
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Occam's razor: states that the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference to any.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
CSE573 Autumn /09/98 Machine Learning Administrative –Last topic: Decision Tree Learning Reading: 5.1, 5.4 Last time –finished NLP sample system’s.
Machine Learning: Ensemble Methods
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Bayesian Averaging of Classifiers and the Overfitting Problem
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning: Lecture 3
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Version Space Machine Learning Fall 2018.
Presentation transcript:

Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff Webb Monash University

2 Intelligent Systems Occam's razor Principle of parsimony Non sunt multiplicanda entia praeter necessitatem entities are not to be multiplied beyond necessity. Modern interpretation: of multiple explanations that are equal in all other respects, prefer the least complex Pervasive in Western thought Frequently invoked in machine learning

3 Intelligent Systems Some observations Not propositional! Complex can mean many things (Bunge, 1963): syntactic: number of words or other syntactic elements required to express the theory semantic: complexity of the meaning of the theory / number of presuppositions it requires epistemological: number of transcendent terms required by the theory pragmatic: complexity of applying of the theory

4 Intelligent Systems The Occam Thesis Blumer, Ehrenfeucht, Haussler and Warmuth (1987): to wield Occam’s razor is to adopt the goal of discovering “the simplest hypothesis that is consistent with the sample data” in the expectation that the simplest hypothesis will “perform well on further observations taken from the same source”. Quinlan, (1986): “Given a choice between two decision trees, each of which is correct on the training set, it seems preferable to prefer the simpler one on the grounds that it is more likely to capture structure inherent in the problem. The simpler tree would therefore be expected to classify correctly more objects outside the training set.”

5 Intelligent Systems My personal de-Occamization In the early nineties I lost faith in the Occam Thesis developed a rule learner that found substantially simpler rule sets but did not improve accuracy worked with specific to general search and hence open to finding complex variants of rules worked with disjunctive rules

6 Intelligent Systems Objections to the Occam Thesis There is no theoretical relationship between syntactic complexity and classifier accuracy. Equivalent classifiers expressed in different languages will have different levels of complexity. It is only possible to judge a selection criterion in the context of a performance objective. Conservation law of generalisation performance: there are no universal learning biases.

7 Intelligent Systems How to convince the community? Logic didn’t work Murphy and Pazzani (1994): for a number of classification learning tasks, the simplest consistent decision trees have lower predictive accuracy than slightly more complex consistent trees. but most accurate were close to the simplest had same complexity as the ‘true’ class! Boosting and bagging Bayesian averaging of many simple models

8 Intelligent Systems What about … A systematic process for adding complexity to the dominant model (decision trees) while improving accuracy without changing resubstitution performance! Decision tree grafting

9 Intelligent Systems Outline  Take decision tree formed by conventional learning  Look for regions of instance space that are not occupied by training examples  Look for evidence supporting a change in class  Graft tests and leaves that reclassify the regions appropriately  To maximize likelihood of improving performance, select only the best such graft for each leaf

10 Intelligent Systems Example Instance Space

11 Intelligent Systems Guess the Class

12 Intelligent Systems C4.5's Partitions

13 Intelligent Systems Evidence supporting a change of class? During learning there will often be multiple potential cuts of which one is selected on a (fairly) arbitrary basis Look for how such a cut would have projected across the empty region and the evidence it would have provided for a different classification

14 Intelligent Systems Alternative cuts at root

15 Intelligent Systems Evidence for alternative classifications Use Laplace accuracy estimate for the alternative leaves that project through the empty region Laplace = (correct + 1) / (total + 2) (4+1)/(5+2) vs (9+1)/(9+2)

16 Intelligent Systems Algorithm PDF

17 Intelligent Systems Visit each leaf in turn

18 Intelligent Systems Consider each ancestor

19 Intelligent Systems Consider each cut that projects across empty regions of the leaf A≤7A≤6A≤5A≤4 A≤3A>6A>5A>4 A>3A>2B≤10B≤9 B≤8B≤7B≤6B≤5 B≤4B≤3B≤2B≤1 B>0B>1B>2B>3 B>4

20 Intelligent Systems Consider each cut that projects across empty regions of the leaf

21 Intelligent Systems Next ancestor

22 Intelligent Systems Root

23 Intelligent Systems A stronger cut, which is selected in preference to the weaker

24 Intelligent Systems Final tree

25 Intelligent Systems Features All new partitions define regions with volume > zero containing no objects from training set. new cuts are not simple duplications of existing cuts at ancestor nodes. every modification adds non-redundant complexity to the tree.

26 Intelligent Systems Experiments 100 x 80% / 20% holdout evaluation All 11 locally held UCI datasets containing continuous attributes 2 variants of hypothyroid subsequently added to examine why its results differed from rest

27 Intelligent Systems UCI data sets used for experimentation No. of % %No. of defaultNo. of NameAttrs.continmissingobjectsacc %classes breast cancer Wisconsin < Cleveland heart disease1346< credit rating discordant results echocardiogram glass type hepatitis Hungarian heart disease hypothyroid iris new thyroid Pima indians diabetes sick euthyroid

28 Intelligent Systems Percentage predictive accuracy for unpruned decision trees Data C4.5 C4.5X tp breast cancer Wisconsin94.1± ± Cleveland heart disease72.8± ± credit rating82.2± ± discordant results98.6± ± echocardiogram 72.0± ± glass type74.0± ± hepatitis79.6± ± Hungarian heart disease77.0± ± hypothyroid99.5± ± iris95.4± ± new thyroid89.9± ± Pima indians diabetes70.2± ± sick euthyroid98.7± ±

29 Intelligent Systems Percentage accuracy for pruned decision trees. Data C4.5 C4.5Xtp breast cancer Wisconsin95.1± ± Cleveland heart disease74.1± ± credit rating84.1± ± discordant results98.8± ± echocardiogram74.2± ± glass type74.4± ± hepatitis79.9± ± Hungarian heart disease79.2± ± hypothyroid99.5± ± iris95.4± ± new thyroid89.6± ± Pima indians diabetes72.2± ± sick euthyroid98.7± ±

30 Intelligent Systems Size of pruned trees DataC4.5C4.5X breast cancer Wisconsin19.2± ± Cleveland heart disease44.6± ± credit rating51.2± ± discordant results24.9± ± echocardiogram 10.4± ± glass type36.6± ± hepatitis13.7± ± Hungarian heart disease26.8± ± hypothyroid23.6± ± iris8.2± ± new thyroid14.1± ± Pima indians diabetes112.0± ± sick euthyroid46.5± ±

31 Intelligent Systems Vindication! Substantial increases in complexity No change in performance on training data Accuracy increased significantly more often than not

32 Intelligent Systems Hey, this might actually be useful! IJCAI-97 Allow grafts to correct misclassifications Also graft discrete valued attributes Add all grafts that pass a significance test Graft onto empty nodes by treating them as if occupied by items at parent

33 Intelligent Systems Example

34 Intelligent Systems Summary of results Substantial increase in complexity Small increase in accuracy Prune+graft is more effective than graft alone

35 Intelligent Systems All-tests-but-one-partition (ATBOP) Original approach is computationally expensive must consider every value of every attribute for every ancestor of every leaf Instead form a single partition and test grafts within it Partition contains all training instances that fail no more than one test on the path to the leaf

36 Intelligent Systems All-tests-but-one-partition

37 Intelligent Systems Resulting tree

38 Intelligent Systems Data Sets

39 Intelligent Systems Experimental treatments Grafting improves both pruned and unpruned trees. Prune & graft provides highest average accuracy. C4.5: C4.5 release 8 pruned trees. C4.5x: C4.5 with grafting. C4.5a: C4.5 with grafting from ATBOP.

40 Intelligent Systems Experimental design 10 unstratified 3-fold cross validation experiments for each data set. Allows estimation of Kohavi-Wolpert bias and variance  using a similar technique  all training objects used 20 times for training and 10 times for testing.

41 Intelligent Systems ATBOP Error

42 Intelligent Systems ATBOP Bias

43 Intelligent Systems ATBOP Variance

44 Intelligent Systems Compare Bagging t=10 (Error)

45 Intelligent Systems Compare Bag t=10 (nodes)

46 Intelligent Systems Conclusions Grafting provides strong evidence against the Occam Thesis Grafting achieves bagging-like variance reduction without forming a committee. Grafting forms less complex classifiers than bagging fewer nodes single directly interpretable structure

47 Intelligent Systems Complexity Merriam-Webster: Main Entry: 2 com·plex Pronunciation: käm-'pleks, k&m-', 'käm-" Function: adjective Etymology: Latin complexus, past participle of complecti to embrace, comprise (a multitude of objects), from com- + plectere to braid -- more at PLY 1 a : composed of two or more parts : COMPOSITE b (1) of a word : having a bound form as one or more of its immediate constituents (2) of a sentence : consisting of a main clause and one or more subordinate clauses 2 : hard to separate, analyze, or solve 3 : of, concerned with, being, or containing complex numbers PLYcomposedCOMPOSITE