Presentation is loading. Please wait.

Presentation is loading. Please wait.

SLIQ and SPRINT for disk resident data. Shortcommings of ID3 Scalability ? requires lot of computation at every stage of construction of decision tree.

Similar presentations


Presentation on theme: "SLIQ and SPRINT for disk resident data. Shortcommings of ID3 Scalability ? requires lot of computation at every stage of construction of decision tree."— Presentation transcript:

1 SLIQ and SPRINT for disk resident data

2 Shortcommings of ID3 Scalability ? requires lot of computation at every stage of construction of decision tree Scalability ? needs all the training data to be in the memory It does not suggest any standard splitting index for range attributes

3 SLIQ SLIQ is a decision tree classifier that can handle both numerical and categorical attributes Builds compact and accurate trees Uses a pre-sorting technique in the tree growing phase Suitable for classification of large disk-resident datasets.

4 Issues There are two major, critical performance, issues in the tree- growth phase: –How to find split points –How to partition the data The well-known decision tree classifiers: –Grow trees depth-first –Repeatedly sort the data at every node SLIQ: –Replace this repeated sorting with one-time sort –Use new a data structure call class-list –Class-list must remain memory resident at all time

5 SLIQ Methodology: Generate attribute list for each attribute Sort attribute lists for NUMERIC Attributes Create decision tree by partitioning records StartEnd

6 SLIQ - Algorithm This is the algorithm used for split evaluation:

7 SLIQ - Algorithm Update class list:

8 Some Data ridagesalarymaritalcar 13060singlesports 22520singlemini 34080marriedvan 445100singleluxury 560150marriedluxury 635120singlesports 75070marriedvan 85590singlesports 96530marriedmini 1070200singleluxury

9 SLIQ - Attribute Lists ridage 130 225 340 445 560 635 750 855 965 1070 ridsalary 160 220 380 4100 5150 6120 770 890 930 10200 ridmarital 1single 2 3married 4single 5married 6single 7married 8single 9married 10single These are projections on (rid, attribute).

10 SLIQ - Sort Numeric, Group Categorical ridage 225 130 635 340 445 750 855 560 965 1070 ridsalary 220 930 160 770 380 890 4100 6120 5150 10200 ridmarital 3married 5 7 9 1single 2 4 6 8 10single

11 SLIQ - Class List ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1

12 SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 ridage 225 130 635 340 445 750 855 560 965 1070 sportsminivanluxury L0000 R3223 sportsminivanluxury L R sportsminivanluxury L R... N1 Evaluate each split, using GINI or Entropy. age  25 ? age  30 ?

13 SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 ridage 225 130 635 340 445 750 855 560 965 1070 sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R3123 sportsminivanluxury L1100 R2123... N1 Evaluate each split, using GINI or Entropy. age  25 age  30

14 SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R3123... ridsalary 220 930 160 770 380 890 4100 6120 5150 10200 sportsminivanluxury L0200 R3023 N1 Evaluate each split, using GINI or Entropy. salary  20 salary  30

15 SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 sportsminivanluxury Yes0121 No3102 sportsminivanluxury Yes3102 No0121 ridmarital 3married 5 7 9 1single 2 4 6 8 10single Married Single N1 Evaluate each split, using GINI or Entropy.

16 SLIQ - Perform best split and Update Class List ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 salary  60 N2 N3 ridsalary 220 930 160 770 380 890 4100 6120 5150 10200

17 SLIQ - Perform best split and Update Class List ridcarLEAF 1sportsN2 2miniN2 3vanN3 4luxuryN3 5luxuryN3 6sportsN3 7vanN3 8sportsN3 9miniN2 10luxuryN3 N1 salary  60 N2 N3 ridsalary 220 930 160 770 380 890 4100 6120 5150 10200

18 SLIQ - Histograms ridage 225 130 635 340 445 750 855 560 965 1070 sportsminivanluxury L0000 R1110 sportsminivanluxury L0000 R2023... ridcarLEAF 1sportsN2 2miniN2 3vanN3 4luxuryN3 5luxuryN3 6sportsN3 7vanN3 8sportsN3 9miniN2 10luxuryN3 N1 salary  60 N2 N3 sportsminivanluxury L R sportsminivanluxury L R N1 N2 N1 N2 Evaluate each split, using GINI or Entropy. age  25 ?

19 SLIQ - Histograms ridage 225 130 635 340 445 750 855 560 965 1070 sportsminivanluxury L0000 R1110 sportsminivanluxury L0000 R2023... ridcarLEAF 1sportsN2 2miniN2 3vanN3 4luxuryN3 5luxuryN3 6sportsN3 7vanN3 8sportsN3 9miniN2 10luxuryN3 N1 salary  60 N2 N3 sportsminivanluxury L0100 R1010 sportsminivanluxury L0000 R2023 N2 N3 N2 N3 Evaluate each split, using GINI or Entropy. age  25

20 SLIQ - Pseudocode Split evaluation: EvaluateSplits() for each numeric attribute A do for each value v in the attribute list do find the corresponding entry in the class list, and hence the corresponding class and the leaf node N i update the class histogram in leaf N i compute splitting score for test (A ≤ v) for N i for each categorical attribute A do for each leaf of the tree do find subset of A with best split

21 SLIQ - Pseudocode Updating the class list UpdateLabels() for each split leaf N i do Let A be the split attribute for N i. for each (rid,v) in the attribute list for A do find the corresponding entry in the class list e (using the rid) if the leaf referenced by e is N i then find the new leaf N j to which (rid,v) belongs (by applying the splitting test) update the leaf pointer for e to N j

22 SLIQ - bottleneck Class-list must remain memory resident at all time! –Although not a big problem with today's memories, still there might be cases where this is a bottleneck. So, what can we do when the class-list doesn't fit in main memory? –SPRINT is a solution...

23 SPRINT ridagecar 225mini 130sports 635sports 340van 445luxury 750van 855sports 560luxury 965mini 1070luxury ridsalarycar 220mini 930mini 160sports 770van 380van 890sports 4100luxury 6120sports 5150luxury 10200luxury ridmaritalcar 3marriedvan 5marriedluxury 7marriedvan 9marriedmini 1singlesports 2singlemini 4singleluxury 6singlesports 8singlesports 10singleluxury The main data structures used in SPRINT are: Attribute lists and Class histograms

24 SPRINT - Histograms sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R3123 sportsminivanluxury L1100 R2123... ridagecar 225mini 130sports 635sports 340van 445luxury 750van 855sports 560luxury 965mini 1070luxury age  25 age  30 Evaluate each split, using GINI or Entropy.

25 SPRINT - Histograms sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R3123... sportsminivanluxury L0200 R3023 ridsalarycar 220mini 930mini 160sports 770van 380van 890sports 4100luxury 6120sports 5150luxury 10200luxury salary  20 salary  30 Evaluate each split, using GINI or Entropy.

26 SPRINT - Histograms sportsminivanluxury Yes0121 No3102 sportsminivanluxury Yes3102 No0121 Married Single ridmaritalcar 3marriedvan 5marriedluxury 7marriedvan 9marriedmini 1singlesports 2singlemini 4singleluxury 6singlesports 8singlesports 10singleluxury Evaluate each split, using GINI or Entropy.

27 SPRINT - Performing Best Split Once the best split point has been found for a node, we execute the split by creating child nodes. Requires splitting the node’s lists for every attribute into two. Partitioning the attribute list of the winning attribute (salary) is easy. –We scan the list, apply the split test, and move the records to two new attribute lists - one for each new child.

28 SPRINT - Performing Best Split Unfortunately, for the remaining attribute lists of the node (age and marital), we have no test that we can apply to the attribute values to decide how to divide the records. Solution: use the rids. –As we partition the list of the splitting attribute (i.e. salary), we insert the rids of each record into a probe structure (hash table), noting to which child the record was moved. Once we have collected all the rids, we scan the lists of the remaining attributes and probe the hash table with the rid of each record. –The retrieved information tells us with which child to place the record.

29 SPRINT - Performing Best Split If the hash-table is too large for the memory, splitting is done in more than one step. –The attribute list for the splitting attribute is partitioned up to the attribute record for which the hash table will fit in memory; –Portions of attribute lists of non-splitting attributes are partitioned; and the process is repeated for the remainder of the attribute list of the splitting attribute.


Download ppt "SLIQ and SPRINT for disk resident data. Shortcommings of ID3 Scalability ? requires lot of computation at every stage of construction of decision tree."

Similar presentations


Ads by Google