Download presentation
Presentation is loading. Please wait.
1
Decision trees (concept learnig)
02/03/2016
2
These slides are based on
Tom M. Mitchell’s Machine Learning slides (2011)
3
Concept learning example concept: ”the days where the weather is suitable for playing (outdoor) tennis.” Outlook Temp. Humidity Wind Tennis? Sunny Hot High Weak No Strong Overcast Cold Normal Yes Rain Mild
4
Hypothesis hypothesis h in „concept learning” is the disjunctive normal form of feature values feature values can be a particular value, e.g. Wind=Strong or any value, e.g. Wind=? a possible h is Outlook Temp. Humidity Wind Sunny ? ? Strong
5
Decision trees
6
Decision Tree for PlayTennis
Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes
7
Decision Tree for PlayTennis
Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High Normal Each branch corresponds to an attribute value node No Yes Each leaf node assigns a classification
8
Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?
No Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes
9
Decision Tree for Conjunction
Outlook=Sunny Wind=Weak Outlook Sunny Overcast Rain Wind No No Strong Weak No Yes
10
Decision Tree for Disjunction
Outlook=Sunny Wind=Weak Outlook Sunny Overcast Rain Yes Wind Wind Strong Weak Strong Weak No Yes No Yes ICS320
11
Decision Tree decision trees represent disjunctions of conjunctions
Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes (Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)
12
Advantages of decision trees
explicit relationship among features human interpretable modell
13
Training a decision tree (algorithm ID3)
A the “best” decision attribute for next node Assign A as decision attribute for node For each value of A create new descendant Sort training examples to leaf node according to the attribute value of the branch If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.
14
Which Attribute is ”best”?
True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]
15
Entropy Entropy(S) = -p+ log2 p+ - p- log2 p-
S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p-
16
Entropy -p+ log2 p+ - p- log2 p-
Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p+ log2 p+ - p- log2 p- Note that: 0Log20 =0
17
Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv) Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]
18
Information Gain Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]
19
ICS320-Foundations of Adaptive and Learning Systems
Training Examples No Strong High Mild Rain D14 Yes Weak Normal Hot Overcast D13 D12 Sunny D11 D10 Cool D9 D8 D7 D6 D5 D4 D3 D2 D1 Play Tennis Wind Humidity Temp. Outlook Day Part3 Decision Tree Learning
20
Selecting the Next Attribute
Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.811 E=1.0 E=0.985 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048
21
Selecting the Next Attribute
Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247
22
ID3 Algorithm [D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain
Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
23
Converting a Tree to Rules
Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Yes R3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=No R5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes
24
ID3 and the space of hypotheses
A2 A1 A2 A2 - - A3 A4 + - - +
25
ID3 and the space of hypotheses
hill climbing algorithm from a simple to a complex hypothesis The hypothesis space is complete (the optimal solution is present) ID3 outputs one hypothesis no backtrack (greedy) → local minima It prefers smaller trees (as attributes with greater information gain are placed close to the root)
26
ID3 → C4.5 GainRatio() Continous features Missing feature values
Pruning
27
Attributes with many values
Gain() prefers attributes with many values GainRatio(S,A) = Gain(S,A) / Split(S,A), where Split(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S| Si is a subset of S where A értéke takes vi
28
Continous attributes Convert continous attributes to disrete intervalls Temperature=240C, Temperature =270C (Temperature > 20.00C) = {true, false} How to recognise a good threshold? Temperature 150C 180C 190C 220C 240C 270C Tennis No Yes
29
Missing attributes During training: at node n and attribute A
use the most frequent value of A in n use the most frequent value of A in n among the instances with the same classlabel use the expected value of A estimated from n Prediction time evaluate each possible values aggregate the prediction of the leaves (calculate the probability)
30
Overfitting
31
Pruning of decision trees
to avoid overtraining Stop growing the tree if the improvement is minor prune back the complete tree
32
Pruning of decision trees
Split your training data into training and validation sets and repeat until improves: Evaluate each tree where the lowest level branches are pruned (replace it with a leaf) Prune the one with greatest improvement (greedy iteration) Corollary: there will be heterogenous leaves
34
General questions of machine learning
35
Generalization ability overfitting bias-variance dilemma
37
Overfitting modell hH overfits if there is a h’H
errortrain(h) < errortrain(h’) and errorX(h) > errorX(h’)
38
Occam’s razor Among competing hypotheses, the one with the fewest assumptions should be selected. If a long hypothesis fits to the data it can be due to randomness (imagine that a shorter also fits to the data)
39
The bias-variance dilemma
Regression problem of the function F(x) g(x;D) is the prediction of the modell trained on D We use multiple D bias=error on the training data variance=the difference among modells learnt on various D
40
The bias-variance dilemma
Overfitting=low bias but great variance
41
© Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)
42
© Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)
43
„generalisation” metaparameter
Each machine learning approach has one (or a few) metaparameters to finetune the bias/variance tradeoff kNN: k Parzen windows: window size Naive Bayes: m-estimate decision tree: pruning
44
The curse of dimensionality
45
Feature space Introducing new features (information to the system) helps: BUT it can easily can indicate overfitting!
46
The curse of dimensionality
if we ha 100 binary features we’d need training samples… What does „similarity” mean at d=1000? In practice the performance can decrease by introducing new features “Curse of dimensionality” parameter estimation becomes more complex easily overfits
47
Performance measure of supervised machine learners
48
Estimation of the performance of supervised learner
Supervised learning: Based on training examples, learn a modell which works fine on previously unseen examples. Selection among models: Selection of the machine learning approach Feature space construction Metaparameter optimisation (e.g. pruning of decision trees)
49
Leave-out technique Split your dataset into D = {(v1,y1),…,(vn,yn)} training (Dt) and test (Dv=D\Dt) sets Train Dt Test D\Dt Simulation of the performance on unseen data.
50
train-dev-test train test train dev test
51
n-fold cross validation
n-fold cross validation splits the training data D into n disjunct folds: D1,D2,…,Dk … Train your model on D\Di–n then evaluate on Di Repeat this n times and average the results achieved D1 D2 D3 Dk D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4
52
Summary Decision trees Concept learning Entropy ID3 -> C4.5
General questions of machine learning Generalization ability Curse of dimensionality Performance evaluation in supervised learning
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.