Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff.

Similar presentations

Presentation on theme: "Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff."— Presentation transcript:

1 Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff Webb Monash University

2 2 Intelligent Systems Occam's razor Principle of parsimony Non sunt multiplicanda entia praeter necessitatem entities are not to be multiplied beyond necessity. Modern interpretation: of multiple explanations that are equal in all other respects, prefer the least complex Pervasive in Western thought Frequently invoked in machine learning

3 3 Intelligent Systems Some observations Not propositional! Complex can mean many things (Bunge, 1963): syntactic: number of words or other syntactic elements required to express the theory semantic: complexity of the meaning of the theory / number of presuppositions it requires epistemological: number of transcendent terms required by the theory pragmatic: complexity of applying of the theory

4 4 Intelligent Systems The Occam Thesis Blumer, Ehrenfeucht, Haussler and Warmuth (1987): to wield Occam’s razor is to adopt the goal of discovering “the simplest hypothesis that is consistent with the sample data” in the expectation that the simplest hypothesis will “perform well on further observations taken from the same source”. Quinlan, (1986): “Given a choice between two decision trees, each of which is correct on the training set, it seems preferable to prefer the simpler one on the grounds that it is more likely to capture structure inherent in the problem. The simpler tree would therefore be expected to classify correctly more objects outside the training set.”

5 5 Intelligent Systems My personal de-Occamization In the early nineties I lost faith in the Occam Thesis developed a rule learner that found substantially simpler rule sets but did not improve accuracy worked with specific to general search and hence open to finding complex variants of rules worked with disjunctive rules

6 6 Intelligent Systems Objections to the Occam Thesis There is no theoretical relationship between syntactic complexity and classifier accuracy. Equivalent classifiers expressed in different languages will have different levels of complexity. It is only possible to judge a selection criterion in the context of a performance objective. Conservation law of generalisation performance: there are no universal learning biases.

7 7 Intelligent Systems How to convince the community? Logic didn’t work Murphy and Pazzani (1994): for a number of classification learning tasks, the simplest consistent decision trees have lower predictive accuracy than slightly more complex consistent trees. but most accurate were close to the simplest had same complexity as the ‘true’ class! Boosting and bagging Bayesian averaging of many simple models

8 8 Intelligent Systems What about … A systematic process for adding complexity to the dominant model (decision trees) while improving accuracy without changing resubstitution performance! Decision tree grafting

9 9 Intelligent Systems Outline  Take decision tree formed by conventional learning  Look for regions of instance space that are not occupied by training examples  Look for evidence supporting a change in class  Graft tests and leaves that reclassify the regions appropriately  To maximize likelihood of improving performance, select only the best such graft for each leaf

10 10 Intelligent Systems Example Instance Space

11 11 Intelligent Systems Guess the Class

12 12 Intelligent Systems C4.5's Partitions

13 13 Intelligent Systems Evidence supporting a change of class? During learning there will often be multiple potential cuts of which one is selected on a (fairly) arbitrary basis Look for how such a cut would have projected across the empty region and the evidence it would have provided for a different classification

14 14 Intelligent Systems Alternative cuts at root

15 15 Intelligent Systems Evidence for alternative classifications Use Laplace accuracy estimate for the alternative leaves that project through the empty region Laplace = (correct + 1) / (total + 2) (4+1)/(5+2) vs (9+1)/(9+2)

16 16 Intelligent Systems Algorithm PDF

17 17 Intelligent Systems Visit each leaf in turn

18 18 Intelligent Systems Consider each ancestor

19 19 Intelligent Systems Consider each cut that projects across empty regions of the leaf A≤7A≤6A≤5A≤4 A≤3A>6A>5A>4 A>3A>2B≤10B≤9 B≤8B≤7B≤6B≤5 B≤4B≤3B≤2B≤1 B>0B>1B>2B>3 B>4

20 20 Intelligent Systems Consider each cut that projects across empty regions of the leaf

21 21 Intelligent Systems Next ancestor

22 22 Intelligent Systems Root

23 23 Intelligent Systems A stronger cut, which is selected in preference to the weaker

24 24 Intelligent Systems Final tree

25 25 Intelligent Systems Features All new partitions define regions with volume > zero containing no objects from training set. new cuts are not simple duplications of existing cuts at ancestor nodes. every modification adds non-redundant complexity to the tree.

26 26 Intelligent Systems Experiments 100 x 80% / 20% holdout evaluation All 11 locally held UCI datasets containing continuous attributes 2 variants of hypothyroid subsequently added to examine why its results differed from rest

27 27 Intelligent Systems UCI data sets used for experimentation No. of % %No. of defaultNo. of NameAttrs.continmissingobjectsacc %classes breast cancer Wisconsin 9 100 <1 699 66 2 Cleveland heart disease1346<1303542 credit rating15401690562 discordant results292463772982 echocardiogram683374682 glass type91000214403 hepatitis19326155792 Hungarian heart disease134620295642 hypothyroid292463772924 iris41000150333 new thyroid51000215703 Pima indians diabetes81000768652 sick euthyroid292463772942

28 28 Intelligent Systems Percentage predictive accuracy for unpruned decision trees Data C4.5 C4.5X tp breast cancer Wisconsin94.1±1.894.4±01.7-3.20.002 Cleveland heart disease72.8±5.074.4±04.8-6.10.000 credit rating82.2±3.483.0±03.3-7.60.000 discordant results98.6±0.598.6±00.5-5.40.000 echocardiogram 72.0±9.873.5±10.2-2.80.007 glass type74.0±7.075.3±07.2-4.20.000 hepatitis79.6±7.180.8±06.9-3.30.001 Hungarian heart disease77.0±5.377.4±05.2-1.80.082 hypothyroid99.5±0.299.5± iris95.4±3.495.7±03.5-2.20.028 new thyroid89.9±4.290.1±04.3-1.00.302 Pima indians diabetes70.2±3.571.3±03.6-8.10.000 sick euthyroid98.7±0.598.7±00.5-0.00.963

29 29 Intelligent Systems Percentage accuracy for pruned decision trees. Data C4.5 C4.5Xtp breast cancer Wisconsin95.1±1.795.2±1.7-2.00.051 Cleveland heart disease74.1±5.374.8±5.3-3.70.000 credit rating84.1±3.284.6±3.2-5.30.000 discordant results98.8±0.498.8±0.4-2.60.010 echocardiogram74.2±9.375.1±9.8-1.60.118 glass type74.4±6.975.4±6.9-3.30.001 hepatitis79.9±6.280.7±6.2-3.00.003 Hungarian heart disease79.2±4.979.4±4.8-1.00.310 hypothyroid99.5±0.299.5± iris95.4±3.695.7±3.7-1.60.109 new thyroid89.6±4.289.8±4.2-0.80.451 Pima indians diabetes72.2±3.572.8±3.5-5.90.000 sick euthyroid98.7±0.498.7±0.4-0.70.480

30 30 Intelligent Systems Size of pruned trees DataC4.5C4.5X breast cancer Wisconsin19.2±05.033.1±08.6-34.90.000 Cleveland heart disease44.6±08.368.3±12.8-43.60.000 credit rating51.2±14.878.4±24.2-25.80.000 discordant results24.9±05.632.5±08.8-21.10.000 echocardiogram 10.4±03.014.8±04.8-21.00.000 glass type36.6±05.561.0±09.5-48.50.000 hepatitis13.7±04.819.8±06.6-30.70.000 Hungarian heart disease26.8±11.441.2±17.3-22.10.000 hypothyroid23.6±02.937.1±05.6-46.70.000 iris8.2±01.914.8±03.9-30.30.000 new thyroid14.1±02.722.5±04.3-36.90.000 Pima indians diabetes112.0±16.4163.9±24.0-62.50.000 sick euthyroid46.5±05.872.6±08.7-76.70.000

31 31 Intelligent Systems Vindication! Substantial increases in complexity No change in performance on training data Accuracy increased significantly more often than not

32 32 Intelligent Systems Hey, this might actually be useful! IJCAI-97 Allow grafts to correct misclassifications Also graft discrete valued attributes Add all grafts that pass a significance test Graft onto empty nodes by treating them as if occupied by items at parent

33 33 Intelligent Systems Example

34 34 Intelligent Systems Summary of results Substantial increase in complexity Small increase in accuracy Prune+graft is more effective than graft alone

35 35 Intelligent Systems All-tests-but-one-partition (ATBOP) Original approach is computationally expensive must consider every value of every attribute for every ancestor of every leaf Instead form a single partition and test grafts within it Partition contains all training instances that fail no more than one test on the path to the leaf

36 36 Intelligent Systems All-tests-but-one-partition

37 37 Intelligent Systems Resulting tree

38 38 Intelligent Systems Data Sets

39 39 Intelligent Systems Experimental treatments Grafting improves both pruned and unpruned trees. Prune & graft provides highest average accuracy. C4.5: C4.5 release 8 pruned trees. C4.5x: C4.5 with grafting. C4.5a: C4.5 with grafting from ATBOP.

40 40 Intelligent Systems Experimental design 10 unstratified 3-fold cross validation experiments for each data set. Allows estimation of Kohavi-Wolpert bias and variance  using a similar technique  all training objects used 20 times for training and 10 times for testing.

41 41 Intelligent Systems ATBOP Error

42 42 Intelligent Systems ATBOP Bias

43 43 Intelligent Systems ATBOP Variance

44 44 Intelligent Systems Compare Bagging t=10 (Error)

45 45 Intelligent Systems Compare Bag t=10 (nodes)

46 46 Intelligent Systems Conclusions Grafting provides strong evidence against the Occam Thesis Grafting achieves bagging-like variance reduction without forming a committee. Grafting forms less complex classifiers than bagging fewer nodes single directly interpretable structure

47 47 Intelligent Systems Complexity Merriam-Webster: Main Entry: 2 com·plex Pronunciation: käm-'pleks, k&m-', 'käm-" Function: adjective Etymology: Latin complexus, past participle of complecti to embrace, comprise (a multitude of objects), from com- + plectere to braid -- more at PLY 1 a : composed of two or more parts : COMPOSITE b (1) of a word : having a bound form as one or more of its immediate constituents (2) of a sentence : consisting of a main clause and one or more subordinate clauses 2 : hard to separate, analyze, or solve 3 : of, concerned with, being, or containing complex numbers PLYcomposedCOMPOSITE

Download ppt "Intelligent Systems On the efficacy of Occam's Razor as a model selection criterion for classification learning Geoff."

Similar presentations

Ads by Google