Presentation is loading. Please wait.

Presentation is loading. Please wait.

Decision Trees References: "Artificial Intelligence: A Modern Approach, 3 rd ed" (Pearson) 18.3-18.4

Similar presentations


Presentation on theme: "Decision Trees References: "Artificial Intelligence: A Modern Approach, 3 rd ed" (Pearson) 18.3-18.4"— Presentation transcript:

1 Decision Trees References: "Artificial Intelligence: A Modern Approach, 3 rd ed" (Pearson) eng.utoronto.ca/~datamining/dmc/decision_tree_overfitting.htm

2 What are they? A "flowchart" of logic Example: – If my health is low: run to cover – Else: if an enemy is nearby: – Shoot it else: – scavenge for treasure

3 Another Example Goal: Decide if we'll wait for a table at a restaurant Factors: – Alternate: Is there another restaurant nearby? – Bar: Does the restaurant have a bar? – Fri / Sat: Is it a Friday or Saturday? – Hungry: Are we hungry? – Patrons: How many people {None, Some, Full} – Price: Price Range {$, $$, $$$} – Raining: Is it raining? – Reservation: Do we have a reservation? – Type: {French, Italian, Thai, Burger} – Wait: {0-10, 10-30, 30-60, >60}

4 Possible decision tree Patrons Wait AlternateHungry N Y None Some Full NY > Reservation Fri/SatAlternate Y NoYes No Yes Bar Raining YY Y NoYes No Yes N NoYes Y N Y N NoYes NoYes

5 Analysis Pluses: – Easy to traverse – Naturally expressed as if/else's Negatives: – how do we build an optimal tree?

6 Sample Input #AltBarFriHunPatPrRanResTypeWait?? 1YNNYS$$$NYFr0-10Y 2YNNYF$NNTh30-60N 3NYNNS$NNBu0-10Y 4YNYYF$YNTh10-30Y 5YNYNF$$$NYFr>60N 6NYNYS$$YYIt0-10Y 7NYNNN$YNBu0-10N 8NNNYS$$YYTh0-10Y 9NYYNF$YNBu>60N 10YYYYF$$$NYIt10-30N 11NNNNN$NNTh0-10N 12YYYYF$NNBu30-60Y

7 Sample Input, cont We can also think of these as "training data" – For a decision tree we want to model – In this context, the input: is that of "Experts" exemplifies the thinking you want to encode is raw data we want to mine … Note: – Doesn't contain all possibilities – There might be noise

8 Building a tree So how do we build a decision tree from input? A lot of possible trees: – O(2 n ) – Some are good, some are bad: good == shallowest bad == deepest – Intractable to find the best Using a greedy algorithm, we can find a pretty-good one…

9 ID3 algorithm By Ross Quinlan (RuleQuest Research) Basic idea: – Choose the best attribute, i – Create a tree with n children n is the number of values for attribute i – Divide the training set into n sub-sets Where all items in a subset have the same value for attribute i. If all items in the subset have the same output value, make this a leaf node. If not, recursively create a new sub-tree – Only use those training examples in this subset – Don't consider attribute i any more.

10 "Best" attribute Entropy (in information theory) – A measure of uncertainty. – Gaining info == lowering entropy A fair coin = 1 bit of entropy A loaded coin (always heads) = 0 bits of entropy – No uncertainty A fair roll of a d4 = 2 bits of entropy A fair roll of a d8 = 3 bits of entropy

11 Entropy, cont.

12 Example: – We have a loaded 4-sided dice – We get a {1:10%, 2:5%, 3:25%, 4:60%} Recall: The entropy of a fair d4 is 2.0, so this dice is slightly more predictable.

13 Information Gain The reduction in entropy In the ID3 algorithm, – We want to split the training cases based on attribute i – Where attribute i gives us the most information i.e. lowers entropy the most

14 Information Gain, cont.

15 Original Example

16 Original Example, cont. StepA2: Calculate H(E wait ) – 4 possible values, so we'd end up with 4 branches "0-10": {1, 3, 6, 7, 8, 11}; 4 Yes, 2 No "10-30": {4, 10}; 1 Yes, 1 No "30-60": {2, 12}; 1 Yes, 1 No ">60": {5, 9}; 2 No – Calculate the entropy of this split group

17 Original Example, cont. StepA3: Calculate H(E pat ) – 3 possible values, so we'd end up with 3 branches "Some": {1,3,6,8}; 4 Yes "Full": {2,4,5,9,10,12}; 2 Yes, 4 No "None": {7,11}; 2 No – Calculate the entropy of this split group So…which is better: splitting on wait, or pat?

18 Original Example, cont. Pat is much better (0.541 gain vs gain) Here is the tree so far: Now we need a subtree to handle the case where Patrons==Full – Note: The training set is smaller now (6 vs. 12) N Y Patrons Some Full None {1,3,6,8} {7,11} {2,4,5,9,10,12}

19 Original Example, cont. Look at two alternatives: Alt & Type Calculate entropy of remaining group: – We actually already calculated it (H("Full") in StepA3) – The value becomes H(E) for this recursive application of ID3. – H(E)≈0.918

20 Original Example, cont. Calculate entropy if we split on Alt – Two possible values: "Yes" and "No" "Yes“ (Alt): {2,4,5,10,12}; 2 Yes, 3 No (Result) "No“ (Alt): {9}; 1 No (Result)

21 Original Example, cont. Calculate entropy if we split on Type – 4 possible values: "French", "Thai", "Burger", and "Italian" "French": {5}; 1 No "Thai": {2,4}; 1 Yes, 1 No "Burger": {9,12}; 1 Yes, 1 No "Italian": {10}; 1 No Which is better: alt or type?

22 Original Example, cont. Type is better (0.251 gain vs gain) – Hungry, Price, Reservation, Est would give you same gain. Here is the tree so far: Recursively make two more sub-trees… Type N Y Patrons Some Full None {1,3,6,8} {7,11} {2,4,5,9,10,12} French Thai Italian Burger N N {5} {10} {2,4} {9,12}

23 Original Example, cont. Here's one possibility (skipping the details): N N Fri Type N Y Patrons Some Full None {1,3,6,8} {7,11} {2,4,5,9,10,12} French Thai Italian Burger {5} {10} {2,4} Alt {9,12} Yes No Yes No {4}{2}{12} {9} N N Y Y

24 Using a decision tree This algorithm will perfectly match all training cases. The hope is that this will generalize to novel cases. Let's take a new case (not found in training) – Alt="No", Bar="Yes", Fri="No", Pat="Full" – Hungry="Yes", Price="$$", Rain=Yes – Reservation="Yes", Type="Italian", Est="30-60" Will we wait?

25 N N Fri Type N Y Original Example, cont. Here's the decision process: Patrons Some Full None {1,3,6,8} {7,11} {2,4,5,9,10,12} French Thai Italian Burger {5} {10} {2,4} Alt {9,12} Yes No Yes No {4}{2}{12} {9} N N Y Y Alt="No" Bar="Yes" Fri="No" Pat="Full" Hungry="Yes" Price="$$" Rain=Yes Reservation="Yes" Type="Italian" Est="30-60" So…No, we won't wait.

26 Pruning Sometimes an exact fit is not necessary – The tree is too big (deep) – The tree isn't generalizing well to new cases (overfitting) – We don't have a lot of training cases: We would get close to the same results removing the attr node, and labeling it as a leaf (r1) Attr r1r2r1 {47} {98} {11, 41} v1 v2v3

27 Chi-Squared Test The chi-squared test can be used to determine if a decision node is statistically significant. Example1: Is there a strong significance between hair color and eye color? RAW DATA Hair Color LightDark Brown3212 Eye ColorGreen/Blue1422 Other69

28 Chi-Squared Test Example2: Is there a strong significance between console preference and passing etgg1803? RAW DATA Console Preference PS3PCXBox360WiiNone Pass Pass ETGG1803 ? Fail42542

29 Chi-Squared Test Steps: 1) Calculate row, column, and overall totals Hair Color LightDark Black3212 Eye ColorGreen/Blue1422 Other69 Hair Color LightDark Black Eye ColorGreen/Blu e Other

30 Chi-Squared Test 2) Calculate expected values of each cell – RowTotal * ColTotal / OverallTotal EXPECTED Hair Color LightDark Black Eye ColorGreen/Blue Other Hair Color LightDark Black Eye ColorGreen/Blu e Other *44/95 36*43/95

31 Chi-Squared Test CHI-SQUARED Hair Color LightDark Black Eye ColorGreen/Blue Other ( ) 2 /24.08 ( ) 2 /16.3 EXPECTED LightDark Black Green/Blue Other RAW LightDark Black3212 Green/Blue1422 Other69 χ 2 = = 10.71

32 Chi-Squared test 4) Look up your chi-squared value in a table – The degrees-of-freedom (dof) is (numRows- 1)*(numCols-1) – sq.html sq.html If the table entry (usually for 0.05) is less than your chi- squared, it's statistically significant. – scipy (www.scipy.org)www.scipy.org import scipy.stats if 1.0 – scipy.stats.chi2.cdf(chiSquared, dof) > 0.05 : # Statistically insignificant

33 Chi-squared test We have a χ 2 value of (dof = 2) The table entry for 5% probability (0.05) is is bigger than 5.99, so this is statistically significant For the console example – χ 2 = 8.16 – dof = 4 – table entry for 5% probability is 9.49 – So…this isn't a statistically significant connection.

34 Chi-Squared Pruning Bottom-up – Do a depth-first traversal – do your test after calling the function recursively on your children

35 Original Example, cont. Look at "Burger?" first N N Fri Type N Y Patrons Some Full None [4Y,0N] [0Y,2N] [2Y,4N] French Thai Italian Burger [0Y,1N] [1Y,1N] Alt [1Y,1N] Yes No Yes No [1Y,0N] [0Y,1N][1Y,0N] [0Y,1N] N N Y Y [6Y,6N]

36 Original Example, cont. Do a Chi-squared test: Burger Alt [1Y,1N] Yes No [1Y,0N] [0Y,1N] N Y YesNo Yes: Wait10 No: Don't01 YesNo Yes: Wait101 No: Don't YesNo Yes: Wait0.5 No: Don't0.5 YesNo Yes: Wait0.5 No: Don't0.5 χ 2 = = 2.0 dof = (2-1)*(2-1) = 1 Table(0.05, 1) = 3.84 So…prune it! Note: we'll have a similar case with Thai. So…prune it too! Original Totals Expected Chi's

37 Original Example, cont. Here's one possibility: N N Fri Type N Y Patrons Some Full None [4Y,0N] [0Y,2N] [2Y,4N] French Thai Italian Burger [0Y,1N] [1Y,1N] Alt [1Y,1N] Yes No Yes No [1Y,0N] [0Y,1N][1Y,0N] [0Y,1N] N N Y Y [6Y,6N] N N Type N Y Patrons Some Full None [4Y,0N] [0Y,2N] [2Y,4N] French Thai Italian Burger [0Y,1N] [1Y,1N] Y [6Y,6N] Y

38 Original Example, cont. N N Type [2Y,4N] French Thai Italian Burger [0Y,1N] [1Y,1N] Y Y I got a chi-squared value of 1.52, dof=3…prune it!

39 Original Example, cont. Here's one possibility: N N Y Patrons Some Full None [4Y,0N] [0Y,2N] [2Y,4N] [6Y,6N] N N Type N Y Patrons Some Full None [4Y,0N] [0Y,2N] [2Y,4N] French Thai Italian Burger [0Y,1N] [1Y,1N] Y [6Y,6N] Y

40 Pruning Example, cont. N N Y Patrons Some Full None [4Y,0N] [0Y,2N] [2Y,4N] [6Y,6N] I got a chi-squared value of 6.667, dof=2. So…keep it! Note: if the evidence were stronger (more training cases) in the burger, thai branch, we wouldn't have pruned it

41 Questions?


Download ppt "Decision Trees References: "Artificial Intelligence: A Modern Approach, 3 rd ed" (Pearson) 18.3-18.4"

Similar presentations


Ads by Google