Decision trees (concept learnig)

Decision trees (concept learnig)
02/03/2016

These slides are based on
Tom M. Mitchell’s Machine Learning slides (2011)

Concept learning example concept: ”the days where the weather is suitable for playing (outdoor) tennis.” Outlook Temp. Humidity Wind Tennis? Sunny Hot High Weak No Strong Overcast Cold Normal Yes Rain Mild

Hypothesis hypothesis h in „concept learning” is the disjunctive normal form of feature values feature values can be a particular value, e.g. Wind=Strong or any value, e.g. Wind=? a possible h is Outlook Temp. Humidity Wind Sunny ? ? Strong

Decision trees

Decision Tree for PlayTennis
Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Decision Tree for PlayTennis
Outlook Sunny Overcast Rain Humidity Each internal node tests an attribute High Normal Each branch corresponds to an attribute value node No Yes Each leaf node assigns a classification

Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?
No Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes

Decision Tree for Conjunction
Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Wind No No Strong Weak No Yes

Decision Tree for Disjunction
Outlook=Sunny  Wind=Weak Outlook Sunny Overcast Rain Yes Wind Wind Strong Weak Strong Weak No Yes No Yes ICS320

Decision Tree decision trees represent disjunctions of conjunctions
Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes (Outlook=Sunny  Humidity=Normal)  (Outlook=Overcast)  (Outlook=Rain  Wind=Weak)

Advantages of decision trees
explicit relationship among features human interpretable modell

Training a decision tree (algorithm ID3)
A  the “best” decision attribute for next node Assign A as decision attribute for node For each value of A create new descendant Sort training examples to leaf node according to the attribute value of the branch If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

Which Attribute is ”best”?
True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

Entropy Entropy(S) = -p+ log2 p+ - p- log2 p-
S is a sample of training examples p+ is the proportion of positive examples p- is the proportion of negative examples Entropy measures the impurity of S Entropy(S) = -p+ log2 p+ - p- log2 p-

Entropy -p+ log2 p+ - p- log2 p-
Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code) Why? Information theory optimal length code assign –log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S: -p+ log2 p+ - p- log2 p- Note that: 0Log20 =0

Information Gain Gain(S,A): expected reduction in entropy due to sorting S on attribute A Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv) Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

Information Gain Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 Entropy([18+,33-]) = 0.94 Entropy([8+,30-]) = 0.62 Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A1=? True False [21+, 5-] [8+, 30-] [29+,35-] A2=? True False [18+, 33-] [11+, 2-] [29+,35-]

ICS320-Foundations of Adaptive and Learning Systems
Training Examples No Strong High Mild Rain D14 Yes Weak Normal Hot Overcast D13 D12 Sunny D11 D10 Cool D9 D8 D7 D6 D5 D4 D3 D2 D1 Play Tennis Wind Humidity Temp. Outlook Day Part3 Decision Tree Learning

Selecting the Next Attribute
Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.811 E=1.0 E=0.985 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048

Selecting the Next Attribute
Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247

ID3 Algorithm [D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain
Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

Converting a Tree to Rules
Outlook Sunny Overcast Rain Humidity High Normal Wind Strong Weak No Yes R1: If (Outlook=Sunny)  (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny)  (Humidity=Normal) Then PlayTennis=Yes R3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain)  (Wind=Strong) Then PlayTennis=No R5: If (Outlook=Rain)  (Wind=Weak) Then PlayTennis=Yes

ID3 and the space of hypotheses
A2 A1 A2 A2 - - A3 A4 + - - +

ID3 and the space of hypotheses
hill climbing algorithm from a simple to a complex hypothesis The hypothesis space is complete (the optimal solution is present) ID3 outputs one hypothesis no backtrack (greedy) → local minima It prefers smaller trees (as attributes with greater information gain are placed close to the root)

ID3 → C4.5 GainRatio() Continous features Missing feature values
Pruning

Attributes with many values
Gain() prefers attributes with many values GainRatio(S,A) = Gain(S,A) / Split(S,A), where Split(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S| Si is a subset of S where A értéke takes vi

Continous attributes Convert continous attributes to disrete intervalls Temperature=240C, Temperature =270C (Temperature > 20.00C) = {true, false} How to recognise a good threshold? Temperature 150C 180C 190C 220C 240C 270C Tennis No Yes

Missing attributes During training: at node n and attribute A
use the most frequent value of A in n use the most frequent value of A in n among the instances with the same classlabel use the expected value of A estimated from n Prediction time evaluate each possible values aggregate the prediction of the leaves (calculate the probability)

Overfitting

Pruning of decision trees
to avoid overtraining Stop growing the tree if the improvement is minor prune back the complete tree

Pruning of decision trees
Split your training data into training and validation sets and repeat until improves: Evaluate each tree where the lowest level branches are pruned (replace it with a leaf) Prune the one with greatest improvement (greedy iteration) Corollary: there will be heterogenous leaves

General questions of machine learning

Generalization ability overfitting bias-variance dilemma

Overfitting modell hH overfits if there is a h’H
errortrain(h) < errortrain(h’) and errorX(h) > errorX(h’)

Occam’s razor Among competing hypotheses, the one with the fewest assumptions should be selected. If a long hypothesis fits to the data it can be due to randomness (imagine that a shorter also fits to the data)

The bias-variance dilemma
Regression problem of the function F(x) g(x;D) is the prediction of the modell trained on D We use multiple D bias=error on the training data variance=the difference among modells learnt on various D

The bias-variance dilemma
Overfitting=low bias but great variance

„generalisation” metaparameter
Each machine learning approach has one (or a few) metaparameters to finetune the bias/variance tradeoff kNN: k Parzen windows: window size Naive Bayes: m-estimate decision tree: pruning

The curse of dimensionality

Feature space Introducing new features (information to the system) helps: BUT it can easily can indicate overfitting!

The curse of dimensionality
if we ha 100 binary features we’d need training samples… What does „similarity” mean at d=1000? In practice the performance can decrease by introducing new features  “Curse of dimensionality” parameter estimation becomes more complex easily overfits

Performance measure of supervised machine learners

Estimation of the performance of supervised learner
Supervised learning: Based on training examples, learn a modell which works fine on previously unseen examples. Selection among models: Selection of the machine learning approach Feature space construction Metaparameter optimisation (e.g. pruning of decision trees)

Leave-out technique Split your dataset into D = {(v1,y1),…,(vn,yn)} training (Dt) and test (Dv=D\Dt) sets Train Dt Test D\Dt Simulation of the performance on unseen data.

train-dev-test train test train dev test

n-fold cross validation
n-fold cross validation splits the training data D into n disjunct folds: D1,D2,…,Dk … Train your model on D\Di–n then evaluate on Di Repeat this n times and average the results achieved D1 D2 D3 Dk D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4 D1 D2 D3 D4

Summary Decision trees Concept learning Entropy ID3 -> C4.5
General questions of machine learning Generalization ability Curse of dimensionality Performance evaluation in supervised learning

Decision trees (concept learnig)

Similar presentations

Presentation on theme: "Decision trees (concept learnig)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decision trees (concept learnig)

Similar presentations

Presentation on theme: "Decision trees (concept learnig)"— Presentation transcript:

Similar presentations

About project

Feedback