Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff.

http://www.csse.monash.edu.au/~webb Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff Webb Monash University

http://www.csse.monash.edu.au/~webb 2 Intelligent Systems Occam's razor Principle of parsimony Non sunt multiplicanda entia praeter necessitatem entities are not to be multiplied beyond necessity. Modern interpretation: of multiple explanations that are equal in all other respects, prefer the least complex Pervasive in Western thought Frequently invoked in machine learning

http://www.csse.monash.edu.au/~webb 3 Intelligent Systems Some observations Not propositional! Complex can mean many things (Bunge, 1963): syntactic: number of words or other syntactic elements required to express the theory semantic: complexity of the meaning of the theory / number of presuppositions it requires epistemological: number of transcendent terms required by the theory pragmatic: complexity of applying of the theory

http://www.csse.monash.edu.au/~webb 4 Intelligent Systems The Occam Thesis Blumer, Ehrenfeucht, Haussler and Warmuth (1987): to wield Occam’s razor is to adopt the goal of discovering “the simplest hypothesis that is consistent with the sample data” in the expectation that the simplest hypothesis will “perform well on further observations taken from the same source”. Quinlan, (1986): “Given a choice between two decision trees, each of which is correct on the training set, it seems preferable to prefer the simpler one on the grounds that it is more likely to capture structure inherent in the problem. The simpler tree would therefore be expected to classify correctly more objects outside the training set.”

http://www.csse.monash.edu.au/~webb 5 Intelligent Systems My personal de-Occamization In the early nineties I lost faith in the Occam Thesis developed a rule learner that found substantially simpler rule sets but did not improve accuracy worked with specific to general search and hence open to finding complex variants of rules worked with disjunctive rules

http://www.csse.monash.edu.au/~webb 6 Intelligent Systems Objections to the Occam Thesis There is no theoretical relationship between syntactic complexity and classifier accuracy. Equivalent classifiers expressed in different languages will have different levels of complexity. It is only possible to judge a selection criterion in the context of a performance objective. Conservation law of generalisation performance: there are no universal learning biases.

http://www.csse.monash.edu.au/~webb 7 Intelligent Systems How to convince the community? Logic didn’t work Murphy and Pazzani (1994): for a number of classification learning tasks, the simplest consistent decision trees have lower predictive accuracy than slightly more complex consistent trees. but most accurate were close to the simplest had same complexity as the ‘true’ class! Boosting and bagging Bayesian averaging of many simple models

http://www.csse.monash.edu.au/~webb 8 Intelligent Systems What about … A systematic process for adding complexity to the dominant model (decision trees) while improving accuracy without changing resubstitution performance! Decision tree grafting

http://www.csse.monash.edu.au/~webb 9 Intelligent Systems Outline  Take decision tree formed by conventional learning  Look for regions of instance space that are not occupied by training examples  Look for evidence supporting a change in class  Graft tests and leaves that reclassify the regions appropriately  To maximize likelihood of improving performance, select only the best such graft for each leaf

http://www.csse.monash.edu.au/~webb 10 Intelligent Systems Example Instance Space

http://www.csse.monash.edu.au/~webb 11 Intelligent Systems Guess the Class

http://www.csse.monash.edu.au/~webb 12 Intelligent Systems C4.5's Partitions

http://www.csse.monash.edu.au/~webb 13 Intelligent Systems Evidence supporting a change of class? During learning there will often be multiple potential cuts of which one is selected on a (fairly) arbitrary basis Look for how such a cut would have projected across the empty region and the evidence it would have provided for a different classification

http://www.csse.monash.edu.au/~webb 14 Intelligent Systems Alternative cuts at root

http://www.csse.monash.edu.au/~webb 15 Intelligent Systems Evidence for alternative classifications Use Laplace accuracy estimate for the alternative leaves that project through the empty region Laplace = (correct + 1) / (total + 2) (4+1)/(5+2) vs (9+1)/(9+2)

http://www.csse.monash.edu.au/~webb 16 Intelligent Systems Algorithm PDF

http://www.csse.monash.edu.au/~webb 17 Intelligent Systems Visit each leaf in turn

http://www.csse.monash.edu.au/~webb 18 Intelligent Systems Consider each ancestor

http://www.csse.monash.edu.au/~webb 19 Intelligent Systems Consider each cut that projects across empty regions of the leaf A≤7A≤6A≤5A≤4 A≤3A>6A>5A>4 A>3A>2B≤10B≤9 B≤8B≤7B≤6B≤5 B≤4B≤3B≤2B≤1 B>0B>1B>2B>3 B>4

http://www.csse.monash.edu.au/~webb 20 Intelligent Systems Consider each cut that projects across empty regions of the leaf

http://www.csse.monash.edu.au/~webb 21 Intelligent Systems Next ancestor

http://www.csse.monash.edu.au/~webb 22 Intelligent Systems Root

http://www.csse.monash.edu.au/~webb 23 Intelligent Systems A stronger cut, which is selected in preference to the weaker

http://www.csse.monash.edu.au/~webb 24 Intelligent Systems Final tree

http://www.csse.monash.edu.au/~webb 25 Intelligent Systems Features All new partitions define regions with volume > zero containing no objects from training set. new cuts are not simple duplications of existing cuts at ancestor nodes. every modification adds non-redundant complexity to the tree.

http://www.csse.monash.edu.au/~webb 26 Intelligent Systems Experiments 100 x 80% / 20% holdout evaluation All 11 locally held UCI datasets containing continuous attributes 2 variants of hypothyroid subsequently added to examine why its results differed from rest

http://www.csse.monash.edu.au/~webb 27 Intelligent Systems UCI data sets used for experimentation No. of % %No. of defaultNo. of NameAttrs.continmissingobjectsacc %classes breast cancer Wisconsin 9 100 <1 699 66 2 Cleveland heart disease1346<1303542 credit rating15401690562 discordant results292463772982 echocardiogram683374682 glass type91000214403 hepatitis19326155792 Hungarian heart disease134620295642 hypothyroid292463772924 iris41000150333 new thyroid51000215703 Pima indians diabetes81000768652 sick euthyroid292463772942

http://www.csse.monash.edu.au/~webb 28 Intelligent Systems Percentage predictive accuracy for unpruned decision trees Data C4.5 C4.5X tp breast cancer Wisconsin94.1±1.894.4±01.7-3.20.002 Cleveland heart disease72.8±5.074.4±04.8-6.10.000 credit rating82.2±3.483.0±03.3-7.60.000 discordant results98.6±0.598.6±00.5-5.40.000 echocardiogram 72.0±9.873.5±10.2-2.80.007 glass type74.0±7.075.3±07.2-4.20.000 hepatitis79.6±7.180.8±06.9-3.30.001 Hungarian heart disease77.0±5.377.4±05.2-1.80.082 hypothyroid99.5±0.299.5±00.24.40.000 iris95.4±3.495.7±03.5-2.20.028 new thyroid89.9±4.290.1±04.3-1.00.302 Pima indians diabetes70.2±3.571.3±03.6-8.10.000 sick euthyroid98.7±0.598.7±00.5-0.00.963

http://www.csse.monash.edu.au/~webb 29 Intelligent Systems Percentage accuracy for pruned decision trees. Data C4.5 C4.5Xtp breast cancer Wisconsin95.1±1.795.2±1.7-2.00.051 Cleveland heart disease74.1±5.374.8±5.3-3.70.000 credit rating84.1±3.284.6±3.2-5.30.000 discordant results98.8±0.498.8±0.4-2.60.010 echocardiogram74.2±9.375.1±9.8-1.60.118 glass type74.4±6.975.4±6.9-3.30.001 hepatitis79.9±6.280.7±6.2-3.00.003 Hungarian heart disease79.2±4.979.4±4.8-1.00.310 hypothyroid99.5±0.299.5±0.25.40.000 iris95.4±3.695.7±3.7-1.60.109 new thyroid89.6±4.289.8±4.2-0.80.451 Pima indians diabetes72.2±3.572.8±3.5-5.90.000 sick euthyroid98.7±0.498.7±0.4-0.70.480

http://www.csse.monash.edu.au/~webb 30 Intelligent Systems Size of pruned trees DataC4.5C4.5X breast cancer Wisconsin19.2±05.033.1±08.6-34.90.000 Cleveland heart disease44.6±08.368.3±12.8-43.60.000 credit rating51.2±14.878.4±24.2-25.80.000 discordant results24.9±05.632.5±08.8-21.10.000 echocardiogram 10.4±03.014.8±04.8-21.00.000 glass type36.6±05.561.0±09.5-48.50.000 hepatitis13.7±04.819.8±06.6-30.70.000 Hungarian heart disease26.8±11.441.2±17.3-22.10.000 hypothyroid23.6±02.937.1±05.6-46.70.000 iris8.2±01.914.8±03.9-30.30.000 new thyroid14.1±02.722.5±04.3-36.90.000 Pima indians diabetes112.0±16.4163.9±24.0-62.50.000 sick euthyroid46.5±05.872.6±08.7-76.70.000

http://www.csse.monash.edu.au/~webb 31 Intelligent Systems Vindication! Substantial increases in complexity No change in performance on training data Accuracy increased significantly more often than not

http://www.csse.monash.edu.au/~webb 32 Intelligent Systems Hey, this might actually be useful! IJCAI-97 Allow grafts to correct misclassifications Also graft discrete valued attributes Add all grafts that pass a significance test Graft onto empty nodes by treating them as if occupied by items at parent

http://www.csse.monash.edu.au/~webb 33 Intelligent Systems Example

http://www.csse.monash.edu.au/~webb 34 Intelligent Systems Summary of results Substantial increase in complexity Small increase in accuracy Prune+graft is more effective than graft alone

http://www.csse.monash.edu.au/~webb 35 Intelligent Systems All-tests-but-one-partition (ATBOP) Original approach is computationally expensive must consider every value of every attribute for every ancestor of every leaf Instead form a single partition and test grafts within it Partition contains all training instances that fail no more than one test on the path to the leaf

http://www.csse.monash.edu.au/~webb 36 Intelligent Systems All-tests-but-one-partition

http://www.csse.monash.edu.au/~webb 37 Intelligent Systems Resulting tree

http://www.csse.monash.edu.au/~webb 38 Intelligent Systems Data Sets

http://www.csse.monash.edu.au/~webb 39 Intelligent Systems Experimental treatments Grafting improves both pruned and unpruned trees. Prune & graft provides highest average accuracy. C4.5: C4.5 release 8 pruned trees. C4.5x: C4.5 with grafting. C4.5a: C4.5 with grafting from ATBOP.

http://www.csse.monash.edu.au/~webb 40 Intelligent Systems Experimental design 10 unstratified 3-fold cross validation experiments for each data set. Allows estimation of Kohavi-Wolpert bias and variance  using a similar technique  all training objects used 20 times for training and 10 times for testing.

http://www.csse.monash.edu.au/~webb 41 Intelligent Systems ATBOP Error

http://www.csse.monash.edu.au/~webb 42 Intelligent Systems ATBOP Bias

http://www.csse.monash.edu.au/~webb 43 Intelligent Systems ATBOP Variance

http://www.csse.monash.edu.au/~webb 44 Intelligent Systems Compare Bagging t=10 (Error)

http://www.csse.monash.edu.au/~webb 45 Intelligent Systems Compare Bag t=10 (nodes)

http://www.csse.monash.edu.au/~webb 46 Intelligent Systems Conclusions Grafting provides strong evidence against the Occam Thesis Grafting achieves bagging-like variance reduction without forming a committee. Grafting forms less complex classifiers than bagging fewer nodes single directly interpretable structure

http://www.csse.monash.edu.au/~webb 47 Intelligent Systems Complexity Merriam-Webster: Main Entry: 2 com·plex Pronunciation: käm-'pleks, k&m-', 'käm-" Function: adjective Etymology: Latin complexus, past participle of complecti to embrace, comprise (a multitude of objects), from com- + plectere to braid -- more at PLY 1 a : composed of two or more parts : COMPOSITE b (1) of a word : having a bound form as one or more of its immediate constituents (2) of a sentence : consisting of a main clause and one or more subordinate clauses 2 : hard to separate, analyze, or solve 3 : of, concerned with, being, or containing complex numbers PLYcomposedCOMPOSITE

Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff.

Similar presentations

Presentation on theme: "Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff.

Similar presentations

Presentation on theme: "Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff."— Presentation transcript:

Similar presentations

About project

Feedback