Presentation is loading. Please wait.

Presentation is loading. Please wait.

Oliver Schulte Machine Learning 726 Decision Tree Classifiers.

Similar presentations


Presentation on theme: "Oliver Schulte Machine Learning 726 Decision Tree Classifiers."— Presentation transcript:

1 Oliver Schulte Machine Learning 726 Decision Tree Classifiers

2 2/13 Overview Parent Node/ Child Node DiscreteContinuous DiscreteMaximum Likelihood Decision Trees logit distribution (logistic regression) Continuousconditional Gaussian (not discussed) linear Gaussian (linear regression)

3 3/13 Decision Tree Popular type of classifier. Easy to visualize. Especially for discrete values, but also for continuous. Learning: Information Theory.

4 4/13 Decision Tree Example

5 5/13 Exercise Find a decision tree to represent A OR B, A AND B, A XOR B. (A AND B) OR (C AND notD AND E)

6 6/13 Decision Tree Learning Basic Loop: 1. A := the “best” decision attribute for next node. 2. For each value of A, create new descendant of node. 3. Assign training examples to leaf nodes. 4. If training examples perfect classified, then STOP. Else iterate over new leaf nodes.

7 7/13 Entropy

8 8/13 Uncertainty and Probability The more “balanced” a probability distribution, the less information it conveys (e.g., about class label). How to quantify? Information Theory: Entropy = Balance. S is sample, p + is proportion positive, p - negative. Entropy(S) = -p + log2(p + ) - p - log2(p - )

9 9/13 Entropy: General Definition Important quantity in coding theory statistical physics machine learning

10 10/13 Intuition

11 11/13 Entropy

12 12/13 Coding Theory Coding theory: X discrete with 8 possible states (“messages”); how many bits to transmit the state of X ? Shannon information theorem: optimal code length assigns p(x) to each “message” X = x. All states equally likely

13 13/13 Another Coding Example

14 14/13 Zipf’s Law General principle: frequent messages get shorter codes. e.g., abbreviations. Information Compression.

15 15/13 The Kullback-Leibler Divergence Measures information-theoretic “distance” between two distributions p and q. Code length of x in true distribution Code length of x in wrong distribution

16 16/13 Information Gain

17 17/13 Splitting Criterion A new attribute value changes the entropy. Intuitively, want to split on attribute that has the greatest reduction in entropy, averaged over its attribute values. Gain(S,A) = expected reduction in entropy due to splitting on A.

18 18/13 Example

19 19/13 Playtennis


Download ppt "Oliver Schulte Machine Learning 726 Decision Tree Classifiers."

Similar presentations


Ads by Google