Download presentation

Presentation is loading. Please wait.

Published byConor Stamp Modified over 2 years ago

1
1 L. Orseau Induction of decision trees Induction of Decision Trees Laurent Orseau (laurent.orseau@agroparistech.fr) AgroParisTech based on slides by Antoine Cornuéjols

2
2 L. Orseau Induction of decision trees Task Learning a discrimination function for patterns of several classes Protocol Supervised learning by greedy iterative approximation Criterion of success Classification error rate Inputs Attribute-value data (space with N dimensions) Target functions Decision trees

3
3 L. Orseau Induction of decision trees 1- Decision trees: example Decision trees are classifiers for attribute/value instances A node of the tree test for an attribute branch There is a branch for each value of the tested attribute The leaves specify the categories (two or more) abdomen pain? appendicitis cough? fever? yes yes no a coldcooling nothing aucune chest infarctus fever? yesno a cold throat aches throat no

4
4 L. Orseau Induction of decision trees 1- Decision trees: the problem Each instance is described by an attribute/value vector Input: an set of instances with their class (given by an expert) Learning algorithm must build a decision tree E.g. a decision tree for diagnostic (common application in Machine Learning) CoughFeverWeightPain Marienoyesnormalthroat Frednoyesnormalabdomen Julieyesyesthinnone Elvisyesnoobesechest CoughFeverWeightPainDiagnostic Marienoyesnormalthroata cold Frednoyesnormalabdomenappendicitis.....

5
5 L. Orseau Induction of decision trees 1- Decision trees: expressive power The choice of the attributes is very important! If a crucial attribute is not represented Not possible to induce a good decision tree If two instances have the same representation but belong to two different classes, the language of the instances (attributes) is said to be inadequate. CoughFeverWeightPainDiagnostic Marienoyesnormalabdomena cold Polonoyesnormalabdomenappendicitis..... inadequate language

6
6 L. Orseau Induction of decision trees 1- Decision trees: expressive power Any boolean function can be represented with a decision tree –Note: with 6 boolean attributes, there are about 1.8*10^19 boolean functions … Depending on the functions to represent, the trees are more or less large E.g. “parity” and “majority” function: exponential growth Sometimes a single node is enough Limited to propositional logic (only attribute-value, no relation) A tree can be represented by a disjunction of rules: (Si Feathers = noAlors Classe= not-bird) OR (Si Feathers = yesAND Color= brownAlors Classe= not-bird) OR (Si Feathers = yesAND Color= B&WAlors Classe= bird) OR (Si Feathers = yesAND Color= yellowAlors Classe= bird) DT4

7
7 L. Orseau Induction of decision trees 2- Decision trees: choice of a tree ColorWingsFeathersSonarConcept Falconyellowyesyesnobird PigeonB&Wyesyesnobird Batbrownyesnoyesnot bird Feathers? yes no birdnot bird Sonar? yes no not bird bird Color? brown yellow not bird bird B&W bird Color? brown yellow not bird bird B&W bird Feathers? yes no not bird Quatre decision trees coherents with the data: DT1 DT2 DT3DT4

8
8 L. Orseau Induction of decision trees 2- Decision trees: the choice of a tree When the langage is adequate, it is always possible to build a decision trees that correctly classifies all the training examples. There are often many correct decision trees. Enumeration of all trees is not possible (NP-completeness) for binary trees How to give a value to a tree? Requires a constructive iterative method

9
9 L. Orseau Induction of decision trees 2- What model for generalization? Among all possible coherent hypotheses, which one to choose for a good generalization? Is the intuitive answer… ... confirmed by theory? Some learnability theory [Vapnik,82,89,95] empirical risk minimization Consistence of the empirical risk minimization (ERM) structural risk minimization Principle of structural risk minimization (SRM) In short, trees must be short How? Methods of induction of decision trees

10
10 L. Orseau Induction of decision trees 3- Induction of decision trees: Example [Quinlan,86] AttributesPifTempHumidWind Possible Values sunny,cloudy,rainhot,warm,coolnormal,hightrue,false class

11
11 L. Orseau Induction of decision trees 3- Induction of decision trees Strategy: Top-down induction: TDIDT Best first search, no backtracking, with a evaluation function Recursive choice of an attribute to test until stopping criterion Operation: Choose the first attribute as the root of the tree: the most informative one Then, iterate with same operation on all sub-nodes recursive algorithm

12
12 L. Orseau Induction of decision trees 3- Induction of decision trees: example If we choose attribute Temp?... Temp? hotwarmcool J3,J4,J5,J7,J9,J10,J11,J12,J13 J1,J2, J6,J8,J14 +-+- J3,J13 J1,J2 +-+- J4,J10,J11,J13 J8,J14 +-+- J5,J7,J9 J6 +-+-

13
13 L. Orseau Induction of decision trees 3- Induction of decision trees: TDIDT algorithm PROCEDURE AAD(T,E) IFall examples of E are in the same class Ci THENlabel the current node with Ci. END ELSE select an a ttribute A with values v 1...v n Partition E with v 1...v n into E 1,...,E n For j=1 to n AAD(T j, E j ). T1T1 E v2v2 v1v1 T2T2 E2E2 T TnTn EnEn vnvn E1E1 A={ v 1...v n } E= E 1 .. E n

14
14 L. Orseau Induction of decision trees 3- Induction of decision trees: selection of attribute Wind? true false J3,J4,J5,J7,J9,J10,J11,J12,J13 J1,J2, J6,J8,J14 +-+- +-+- +-+- Pif? cloudyrain sunny J3,J4,J5,J9,10,J13 J1,J8 +-+- J3,J13,J7,J12 +-+- J4,J5,J10 J6,J14 +-+- J9,J11 J1,J8,J2 +-+- J7,J11,J12 J2,J6,J14 J3,J4,J5,J7,J9,J10,J11,J12,J13 J1,J2, J6,J8,J14

15
15 L. Orseau Induction of decision trees 3- La selection of a warm attribute of test How to build a “simple” tree? Minimize expected number of tests to class a new object Simple tree: Minimize expected number of tests to class a new object How to translate this global criterion into a local choice procedure? Criterions to choose a node We don't know how to associate a local criterion to the global objective criterion Use of heuristics Notion of measure of ”impurity” –Gini Index –Entropic criterion (ID3, C4.5, C5.0) –...

16
16 L. Orseau Induction of decision trees 3- Measure of impurity: the Gini index Ideally: Null measure if all populations are homogeneous Maximal measure if the populations are maximally mixed Gini Index [Breiman and al.,84]

17
17 L. Orseau Induction of decision trees 3- The entropic criterion(1/3) Boltzmann's entropy...... and Shannon's entropy Shannon, 1949, proposed a measure of entropy for discrete probability distributions. Expresses the quantity of information, i.e. the number of bits need to specify the distribution Information entropy: where p i is the probability of class C i.

18
18 L. Orseau Induction of decision trees 3- The entropic criterion(2/3) Information entropy of S (in C classes): Null when only one class The most equiprobable the classes are, the highest I(S) = log 2 (k) when the k classes are equiprobable Unit: the bit of information p(c i ): probability of the class c i

19
19 L. Orseau Induction of decision trees 3- The entropic criterion(3/3): case of two classes For C=2: I(S) = - p + x log 2 (p + ) - p - x log 2 (p - ) From hypothesis p + = p/ (p+n) and p - = n/ (p+n) ThusI(S) = - p log ( p ) - n log( n ) (p+n) (p+n) (p+n) (p+n) et I(S) = - P log P - (1-P) log(1-P) I(S) P P=p/(p+n)=n/(n+p)=0.5 equiprobability

20
20 L. Orseau Induction of decision trees 3- Entropic gain associated with an attribute |S v |: size of the sub-population in the branch v of A How is the knowledge of the value of attribute A informative about the class of an example

21
21 L. Orseau Induction of decision trees 3- Example (1/4) Entropy of initial set of examples I(p,n) = - 9/14 log 2 (9/14) - 5/14 log 2 (5/14) Entropy of subtrees associated with test on Pif? p 1 = 4 n 1 = 0: I(p 1,n 1 ) = 0 p 2 = 2 n 2 = 3: I(p 2,n 2 ) = 0.971 p 3 = 3 n 3 = 2: I(p 3,n 3 ) = 0.971 Entropy of subtrees associated with test on Temp? p 1 = 2 n 1 = 2: I(p 1,n 1 ) = 1 p 2 = 4 n 2 = 2: I(p 2,n 2 ) = 0.918 p 3 = 3 n 3 = 1: I(p 3,n 3 ) = 0.811

22
22 L. Orseau Induction of decision trees 3- Example (2/4) val1 val2val3 N1+N2+N3=N N objects n+p=N E(N,A)= N1/N x I(p1,n1) + N2/N x I(p2,n2) + N3/N x I(p3,n3) Information gain of A : GAIN(A)= I(S)-E(N,A) Attribute A N1 objects n1+p1=N1 N2 objects n2+p2=N2 N3 objects n3+p3=N3 I(S)

23
23 L. Orseau Induction of decision trees 3- Example (3/4) For the initial examples I(S) = - 9/14 log 2 (9/14) - 5/14 log 2 (5/14) Entropy of the tree associated with test on Pif? E(Pif) = 4/14 I(p 1,n 1 ) + 5/14 I(p 2,n 2 ) + 5/14 I(p 3,n 3 ) Gain(Pif) = 0.940 - 0.694 = 0.246 bits Gain(Temp) = 0.029 bits Gain(Humid) = 0.151 bits Gain(Wind)= 0.048 bits Choice of attribute Pif for the first test

24
24 L. Orseau Induction of decision trees 3- Example (4/4) Finale built tree: cloudy Pif play Wind yes don't play play rain Humid normalhigh play don't play sunny no

25
25 L. Orseau Induction of decision trees 3- Some TDIDT systems Input: vector of attributes-values associated with each example Output: decision tree CLS(Hunt, 1966) [analyse of data] ID3 (Quinlan 1979) ACLS (Paterson & Niblett 1983) ASSISTANT (Bratko 1984) C4.5 (Quinlan 1986) CART(Breiman, Friedman, Ohlson, Stone, 1984)

26
26 L. Orseau Induction of decision trees 4- Potential problems 1. Continuous value attributes 2. Attributes with different branching factors 3. Missing values 4. Overfitting 5. Greedy search 6. Choice of attributes 7. Variance of results: Different trees from similar data

27
27 L. Orseau Induction of decision trees 4.1. Discretization of continuous attribute values Here, two possible thresholds: 16°C and 30°C attribute Temp >16°C is the most informative, and is kept Temp. 6°C8°C 14°C18°C20°C28°C32°C Non Oui Non Play au golf

28
28 L. Orseau Induction of decision trees 4.2. Different branching factors The entropic gain criterion favors attributes with higher branching factor Problem: The entropic gain criterion favors attributes with higher branching factor Two solutions: Make all attributes binary –But loss of legibility of trees Introduce a normalization factor Gain_norm(S,A) Gain(S,A) S i S log S i S i 1 nb values of A

29
29 L. Orseau Induction of decision trees 4.3. Processing missing values Let an example x, c(x) for which we don't know the value for attribute A How to compute gain(S,A)? 1. Take the most frequent value in entire S 2. Take the most frequent value at this node fictitious examples 3. Split example in fictitious examples with the different possible values of A weighted by their respective frequency E.g. if 6 examples at this node take the value A=a 1 and 4 the value A=a 2 A(x) = a 1 with prob=0.6 and A(x) = a 2 with prob=0.4 For prediction, class the example with the label of the most probable leaf.

30
30 L. Orseau Induction of decision trees 5- The generalization problem Training set. Ensemble test. Learning curve Methods to evaluate generalization On a test set Cross validation –“Leave-one-out” Did we learn a good decision tree?

31
31 L. Orseau Induction of decision trees 5.1. Overfitting: Effect of noise on induction Types of noise Description errors Classification errors “clashes” Missing values Effects Over-developed tree: too deep, too many leaves

32
32 L. Orseau Induction of decision trees 5.1. Overfitting: The generalization problem Low empirical risk. High real risk. SRM (Structural Risk Minimization) Justification [Vapnik,71,79,82,95] –Notion of “capacity” of the hypothesis space –Vapnik-Chervonenkis dimension We must control the hypothesisspace We must control the hypothesis space

33
33 L. Orseau Induction of decision trees 5.1. Control of space H : motivations & strategies Motivations: Improve generalization performance (SRM) Build a legible model of the data (for experts) Strategies: pruning 1. Direct control of the size of the induced tree: pruning 2. Modify the state space (trees) in which to search 3. Modify the search algorithm 4. Restrain the data base 5. Translate built trees into another representation

34
34 L. Orseau Induction of decision trees 5.2. Overfitting: Controlling the size with pre-pruning Idea: modify the termination criterion Depth threshold (e.g. [Holte,93]: threshold =1 or 2) Chi2 test Laplacian error Low information gain Low number of examples Population of examples not statistically significant Comparison between ”static error” and ” dynamic error” Problem: often too short-sighted

35
35 L. Orseau Induction of decision trees 5.2. Example: Chi2 test Let a binary attribute A A gd (ne g1,ne g2 ) (ne d1,ne d2 ) A gd (n g1,n g2 ) (n d1,n d2 ) (n) = (n 1,n 2 ) P(1-P) (n) = (n 1,n 2 ) n 1 = n g1 + n d1 n 2 = n g2 + n d2 ne g1 = Pn 1 ; ne d1 = (1-P)n 1 ne g2 = Pn 2 ; ne d2 = (1-P)n 2 Null hypothesis P(1-P)

36
36 L. Orseau Induction of decision trees post-pruning 5.3. Overfitting: Controlling the size with post-pruning Idea: Prune after the construction of whole tree, by replacing subtrees that optimize a pruning criterion on a node. Many methods. Still lots of research. Minimal Cost-Complexity Pruning (MCCP) (Breiman and al.,84) Reduced Error Pruning (REP) (Quinlan,87,93) Minimum Error Pruning (MEP) (Niblett & Bratko,86) Critical Value Pruning (CVP) (Mingers,87) Pessimistic Error Pruning (PEP) (Quinlan,87) Error-Based Pruning (EBP) (Quinlan,93) (used in C4.5) ...

37
37 L. Orseau Induction of decision trees 5.3- Cost-Complexity pruning [Breiman and al.,84] Cost-complexity for a tree:

38
38 L. Orseau Induction of decision trees 6. Forward search Instead of a greedy search, search n nodes ahead If I choose this node and then this node and then … But exponential growth of the number of computations

39
39 L. Orseau Induction of decision trees 6. Modification of the search strategy Idea: no more depth first search Methods that use a different measure: Minimum Description Length principle –Measure of the complexity of the tree –Measure of the complexity of the examples not coded by the tree –Keep tree that minimizes the sum of these measures Measure of low learnability theory Kolmogorov-Smirnoff measure Class separation measure Mix of selection tests

40
40 L. Orseau Induction of decision trees 7. Modification of the search space Modification of the node tests To solve the problems of an inadequate representations Methods of constructive induction (e.g. multivariate tests) E.g. Oblique decision trees Methods: Numerical Operators –Perceptron trees –Trees and Genetic Programming Logical operators

41
41 L. Orseau Induction of decision trees 7. Oblique trees x2x2 x1x1 x 1 < 0.70 x 2 < 0.30 x 2 < 0.88 1.1x 1 + x 2 < 0.2 x 2 < 0.62 c1c1 c2c2 c2c2 c2c2 x 1 < 0.17 c1c1 c2c2 c2c2 c1c1 c2c2 c1c1

42
42 L. Orseau Induction of decision trees 7. Induction of oblique trees Other cause of leafy trees: an inadequate representation Solutions: Ask an expert (e.g. chess endgame [Quinlan,83]) Do an PCA beforehand Other attribute selection method Apply a constructive induction Induction of oblique trees

43
43 L. Orseau Induction of decision trees 8. Translation into other representations Idea: Translate a complex tree into a representation where the result is simpler Translation into decision graphs Translation rule sets

44
44 L. Orseau Induction of decision trees 9. Conclusions Appropriate for: Classification of attribute-value examples Attributes with discrete values Resistance to noise Strategy: Search by incremental construction of hypothesis Local criterion (gradient) based on statistical criterion Generates Interpretable decision trees (e.g. production rules) Requires a control of the size of the tree

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google