1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.

1 Decision Trees

2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t play sunny6970false play overcast7290true play overcast8378false play overcast6465true play overcast8175false play rain7180true don’t play rain6570true don’t play rain7580false play rain6880false play rain7096false play 14 cases: 9 Play 5 DP 5 cases: 2 Play 3 DP 4 cases: 4 Play 0 DP 5 cases: 3 Play 2 DP Outlook= sunny overcastrain 2 cases: 2 Play 0 DP 3 cases: 0 Play 3 DP 2 cases: 0 Play 2 DP 3 cases: 3 Play 0 DP Humidity  75% >75% Windy= false true

3 Basic Tree Building Algorithm MakeTree ( Training Data D ): Partition( D ) Partition ( Data D ): if all points in D are in same class: return Evaluate splits for each attribute A Use best split found to partition D into D 1,D 2,…, D n for each D i : Partition (D i )

4 ID3, CART ID3 Use information gain to determine best split gain = H(D) –  i=1…n P(D i ) H(D i ) H(p 1,p 2,…,p m ) = –  i=1…m p i log p i like 20-question game – Which attribute is better to look for first: “Is it a living thing?” or “Is it a duster?” CART Only create two children for each node Goodness of a split (  )  = 2 P(D 1 ) P(D 2 )  i=1…m | P(C j / D 1 ) – P(C j / D 2 ) |

5 Shannon’s Entropy An expt has several possible outcomes In N expts, suppose each outcome occurs M times This means there are N/M possible outcomes To represent each outcome, we need log N/M bits. – This generalizes even when all outcomes are not equally frequent. – Reason: For an outcome j that occurs M times, there are N/M equi-probable events among which only one cp to j Since p i = M / N, information content of an outcome is -log p i So, expected info content: H = - Σ p i log p i

6 C4.5, C5.0 Handle missing data – During tree building, ignore missing data – During classification, predict value of missing data based on attribute values of other records Continuous data – Divide continuous data into ranges based on attribute values found in training data Pruning – Prepruning – Postpruning – replace subtree with leaf if error-rate doesn’t change much Rules Goodness of split: Gain-ratio – gain favours attributes with many values (leads to over-fitting) – gain-ratio = gain / H(P(D 1 ), P(D 2 ),…, P(D n )) C5.0 – Commercial version of C4.5 – Incorporated boosting; other secret techniques

7 Motivations: – Previous algorithms consider only memory-resident data – In determining the entropy of a split on a non- categorical attribute, the attribute values have to be sorted SLIQ

8 Data structure: class list and attribute lists SLIQ OutlookTemp ( o F)Humidity (%)Windy?Class sunny7570truePlay sunny8090trueDon’t Play sunny85 falseDon’t Play sunny7295falseDon’t Play sunny6970falsePlay overcast7290truePlay overcast8378falsePlay overcast6465truePlay overcast8175falsePlay rain7180trueDon’t Play rain6570trueDon’t Play rain7580falsePlay rain6880falsePlay rain7096falsePlay IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN1 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity

9 Data structure: class list and attribute lists SLIQ IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN1 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity Node: N1 ValueClassFrequency <=65Play1

10 Data structure: class list and attribute lists SLIQ IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN1 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity Node: N1 ValueClassFrequency <=65Play1 Node: N1 ValueClassFrequency <=65Play1 <=70Play2

11 Data structure: class list and attribute lists SLIQ IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN1 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity Node: N1 ValueClassFrequency <=65Play1 <=70Play2 Node: N1 ValueClassFrequency <=65Play1 <=70Play3

12 Data structure: class list and attribute lists SLIQ IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN1 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity The entropies of various split points can be calculated from these figures. The next attribute list is then scanned Node: N1 ValueClassFrequency <=65Play1 <=70Play3 <=70DP1 ……… <=96Play9 <=96DP5

13 Data structure: class list and attribute lists SLIQ IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN1 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity Assume the split point “Humidity <= 80%” is the best one among all possible splits of all attributes The Humidity attribute list is scanned again to update the class list N1 N2N3 Humidity <= 80%> 80%

14 Data structure: class list and attribute lists SLIQ IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN1 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN2 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list N1 N2N3 Humidity <= 80%> 80%

15 IndexClassLeaf 1PlayN1 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN2 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list Data structure: class list and attribute lists SLIQ Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity IndexClassLeaf 1PlayN2 2Don’t PlayN1 3Don’t PlayN1 4Don’t PlayN1 5PlayN1 6PlayN1 7PlayN1 8PlayN2 9PlayN1 10Don’t PlayN1 11Don’t PlayN1 12PlayN1 13PlayN1 14PlayN1 The class list N1 N2N3 Humidity <= 80%> 80%

16 Data structure: class list and attribute lists SLIQ Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity IndexClassLeaf 1PlayN2 2Don’t PlayN3 3Don’t PlayN3 4Don’t PlayN3 5PlayN2 6PlayN3 7PlayN2 8PlayN2 9PlayN2 10Don’t PlayN2 11Don’t PlayN2 12PlayN2 13PlayN2 14PlayN3 The class list N1 N2N3 Humidity <= 80%> 80%

17 Data structure: class list and attribute lists SLIQ Humidity (%)Class List Index 658 701 5 11 759 787 8010 8012 8013 853 902 6 954 9614 The attribute list for Humidity IndexClassLeaf 1PlayN2 2Don’t PlayN3 3Don’t PlayN3 4Don’t PlayN3 5PlayN2 6PlayN3 7PlayN2 8PlayN2 9PlayN2 10Don’t PlayN2 11Don’t PlayN2 12PlayN2 13PlayN2 14PlayN3 The class list Node: N2 ValueClassFrequency <=65Play1 ……… <=80Play7 <=80DP2 Node: N3 ValueClassFrequency <=85DP1 ……… <=96Play2 <=96DP3

18 Motivations (review): – Previous algorithms consider only memory-resident data At any time, only the class list and 1 attribute list in memory A new layer (vs. the child nodes of a single node) is created by at most 2 scans of each attribute list – In determining the entropy of a split on a non- categorical attribute, the attribute values have to be sorted Presorting: each attribute list is sorted only once SLIQ

19 Motivations: – The class list in SLIQ has to reside in memory, which is still a bottleneck for scalability – Can the decision tree building process be carried out by multiple machines in parallel? – Frequent lookup of the central class list produces a lot of network communication in the parallel case SPRINT

20 SPRINT Proposed Solutions Eliminate the class list 1. Class labels distributed to each attribute list => Redundant data, but the memory-resident and network communication bottlenecks are removed 2. Each node keeps its own set of attribute lists => No need to lookup the node information Each node is assigned a partition of each attribute list. The nodes are ordered so that the combined lists of non-categorical attributes remain sorted Each node produces its local histograms in parallel, the combined histograms can be used to find the best splits

1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.

Similar presentations

Presentation on theme: "1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.

Similar presentations

Presentation on theme: "1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t."— Presentation transcript:

Similar presentations

About project

Feedback