Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision trees (concept learnig)

Similar presentations


Presentation on theme: "Decision trees (concept learnig)"— Presentation transcript:

1 Decision trees (concept learnig)
02/03/2016

2 These slides are based on
Tom M. Mitchell’s Machine Learning slides (2011)

3 Concept learning example concept: ”the days where the weather is suitable for playing (outdoor) tennis.” Outlook Temp. Humidity Wind Tennis? Sunny Hot High Weak No Strong Overcast Cold Normal Yes Rain Mild

4 Hypothesis hypothesis h in „concept learning” is the disjunctive normal form of feature values feature values can be a particular value, e.g. Wind=Strong or any value, e.g. Wind=? a possible h is Outlook Temp. Humidity Wind Sunny ? ? Strong

5 Decision trees

6 Decision Tree for PlayTennis
Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

7 Decision Tree for PlayTennis
Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High Normal Each branch corresponds to an attribute value node No Yes Each leaf node assigns a classification

8 Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?
No Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes

9 Decision Tree for Conjunction
Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Wind No No Strong Weak No Yes

10 Decision Tree for Disjunction
Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Yes Wind Wind Strong Weak Strong Weak No Yes No Yes ICS320

11 Decision Tree decision trees represent disjunctions of conjunctions
Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes (Outlook=Sunny  Humidity=Normal)  (Outlook=Overcast)  (Outlook=Rain  Wind=Weak)

12 Advantages of decision trees
explicit relationship among features human interpretable modell

13 Training a decision tree (algorithm ID3)
A  the “best” decision attribute for next node Assign A as decision attribute for node For each value of A create new descendant Sort training examples to leaf node according to the attribute value of the branch If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

14 Which Attribute is ”best”?
True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

15 Entropy Entropy(S) = -p+ log2 p+ - p- log2 p-
S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p-

16 Entropy -p+ log2 p+ - p- log2 p-
Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p+ log2 p+ - p- log2 p- Note that: 0Log20 =0

17 Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv) Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

18 Information Gain Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

19 ICS320-Foundations of Adaptive and Learning Systems
Training Examples No Strong High Mild Rain D14 Yes Weak Normal Hot Overcast D13 D12 Sunny D11 D10 Cool D9 D8 D7 D6 D5 D4 D3 D2 D1 Play Tennis Wind Humidity Temp. Outlook Day Part3 Decision Tree Learning

20 Selecting the Next Attribute
Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.811 E=1.0 E=0.985 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048

21 Selecting the Next Attribute
Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247

22 ID3 Algorithm [D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain
Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

23 Converting a Tree to Rules
Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes R3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No R5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes

24 ID3 and the space of hypotheses
A2 A1 A2 A2 - - A3 A4 + - - +

25 ID3 and the space of hypotheses
hill climbing algorithm from a simple to a complex hypothesis The hypothesis space is complete (the optimal solution is present) ID3 outputs one hypothesis no backtrack (greedy) → local minima It prefers smaller trees (as attributes with greater information gain are placed close to the root)

26 ID3 → C4.5 GainRatio() Continous features Missing feature values
Pruning

27 Attributes with many values
Gain() prefers attributes with many values GainRatio(S,A) = Gain(S,A) / Split(S,A), where Split(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S| Si is a subset of S where A értéke takes vi

28 Continous attributes Convert continous attributes to disrete intervalls Temperature=240C, Temperature =270C (Temperature > 20.00C) = {true, false} How to recognise a good threshold? Temperature 150C 180C 190C 220C 240C 270C Tennis No Yes

29 Missing attributes During training: at node n and attribute A
use the most frequent value of A in n use the most frequent value of A in n among the instances with the same classlabel use the expected value of A estimated from n Prediction time evaluate each possible values aggregate the prediction of the leaves (calculate the probability)

30 Overfitting

31 Pruning of decision trees
to avoid overtraining Stop growing the tree if the improvement is minor prune back the complete tree

32 Pruning of decision trees
Split your training data into training and validation sets and repeat until improves: Evaluate each tree where the lowest level branches are pruned (replace it with a leaf) Prune the one with greatest improvement (greedy iteration) Corollary: there will be heterogenous leaves

33

34 General questions of machine learning

35 Generalization ability overfitting bias-variance dilemma

36

37 Overfitting modell hH overfits if there is a h’H
errortrain(h) < errortrain(h’) and errorX(h) > errorX(h’)

38 Occam’s razor Among competing hypotheses, the one with the fewest assumptions should be selected. If a long hypothesis fits to the data it can be due to randomness (imagine that a shorter also fits to the data)

39 The bias-variance dilemma
Regression problem of the function F(x) g(x;D) is the prediction of the modell trained on D We use multiple D bias=error on the training data variance=the difference among modells learnt on various D

40 The bias-variance dilemma
Overfitting=low bias but great variance

41 © Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)

42 © Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)

43 „generalisation” metaparameter
Each machine learning approach has one (or a few) metaparameters to finetune the bias/variance tradeoff kNN: k Parzen windows: window size Naive Bayes: m-estimate decision tree: pruning

44 The curse of dimensionality

45 Feature space Introducing new features (information to the system) helps: BUT it can easily can indicate overfitting!

46 The curse of dimensionality
if we ha 100 binary features we’d need training samples… What does „similarity” mean at d=1000? In practice the performance can decrease by introducing new features  “Curse of dimensionality” parameter estimation becomes more complex easily overfits

47 Performance measure of supervised machine learners

48 Estimation of the performance of supervised learner
Supervised learning: Based on training examples, learn a modell which works fine on previously unseen examples. Selection among models: Selection of the machine learning approach Feature space construction Metaparameter optimisation (e.g. pruning of decision trees)

49 Leave-out technique Split your dataset into D = {(v1,y1),…,(vn,yn)} training (Dt) and test (Dv=D\Dt) sets Train Dt Test D\Dt Simulation of the performance on unseen data.

50 train-dev-test train test train dev test

51 n-fold cross validation
n-fold cross validation splits the training data D into n disjunct folds: D1,D2,…,Dk Train your model on D\Di–n then evaluate on Di Repeat this n times and average the results achieved D1 D2 D3 Dk D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4

52 Summary Decision trees Concept learning Entropy ID3 -> C4.5
General questions of machine learning Generalization ability Curse of dimensionality Performance evaluation in supervised learning


Download ppt "Decision trees (concept learnig)"

Similar presentations


Ads by Google