Download presentation

Presentation is loading. Please wait.

Published byVirginia Reddington Modified about 1 year ago

1
SLIQ and SPRINT for disk resident data

2
SLIQ SLIQ is a decision tree classifier that can handle both numerical and categorical attributes Builds compact and accurate trees Uses a pre-sorting technique in the tree growing phase Suitable for classification of large disk-resident datasets.

3
Issues There are two major, critical performance, issues in the tree- growth phase: –How to find split points –How to partition the data The well-known decision tree classifiers: –Grow trees depth-first –Repeatedly sort the data at every node SLIQ: –Replace this repeated sorting with one-time sort –Use new a data structure call class-list –Class-list must remain memory resident at all time

4
Some Data ridagesalarymaritalcar 13060singlesports 22520singlemini 34080marriedvan singleluxury marriedluxury singlesports 75070marriedvan 85590singlesports 96530marriedmini singleluxury

5
SLIQ - Attribute Lists ridage ridsalary ridmarital 1single 2 3married 4single 5married 6single 7married 8single 9married 10single These are projections on (rid, attribute).

6
SLIQ - Sort Numeric, Group Categorical ridage ridsalary ridmarital 3married single single

7
SLIQ - Class List ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1

8
SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 ridage sportsminivanluxury L0000 R3223 sportsminivanluxury L R sportsminivanluxury L R... N1 Evaluate each split, using GINI or Entropy. age 25 ? age 30 ?

9
SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 ridage sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R3123 sportsminivanluxury L1100 R N1 Evaluate each split, using GINI or Entropy. age 25 age 30

10
SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R ridsalary sportsminivanluxury L0200 R3023 N1 Evaluate each split, using GINI or Entropy. salary 20 salary 30

11
SLIQ - Histograms ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 sportsminivanluxury Yes0121 No3102 sportsminivanluxury Yes3102 No0121 ridmarital 3married single single Married Single N1 Evaluate each split, using GINI or Entropy.

12
SLIQ - Perform best split and Update Class List ridcarLEAF 1sportsN1 2miniN1 3vanN1 4luxuryN1 5luxuryN1 6sportsN1 7vanN1 8sportsN1 9miniN1 10luxuryN1 salary 60 N2 N3 ridsalary

13
SLIQ - Perform best split and Update Class List ridcarLEAF 1sportsN2 2miniN2 3vanN3 4luxuryN3 5luxuryN3 6sportsN3 7vanN3 8sportsN3 9miniN2 10luxuryN3 N1 salary 60 N2 N3 ridsalary

14
SLIQ - Histograms ridage sportsminivanluxury L0000 R1110 sportsminivanluxury L0000 R ridcarLEAF 1sportsN2 2miniN2 3vanN3 4luxuryN3 5luxuryN3 6sportsN3 7vanN3 8sportsN3 9miniN2 10luxuryN3 N1 salary 60 N2 N3 sportsminivanluxury L R sportsminivanluxury L R N1 N2 N1 N2 Evaluate each split, using GINI or Entropy. age 25 ?

15
SLIQ - Histograms ridage sportsminivanluxury L0000 R1110 sportsminivanluxury L0000 R ridcarLEAF 1sportsN2 2miniN2 3vanN3 4luxuryN3 5luxuryN3 6sportsN3 7vanN3 8sportsN3 9miniN2 10luxuryN3 N1 salary 60 N2 N3 sportsminivanluxury L0100 R1010 sportsminivanluxury L0000 R2023 N1 N2 N1 N2 Evaluate each split, using GINI or Entropy. age 25

16
SLIQ - Pseudocode Split evaluation: EvaluateSplits() for each numeric attribute A do for each value v in the attribute list do find the corresponding entry in the class list, and hence the corresponding class and the leaf node N i update the class histogram in leaf N i compute splitting score for test (A ≤ v) for N i for each categorical attribute A do for each leaf of the tree do find subset of A with best split

17
SLIQ - Pseudocode Updating the class list UpdateLabels() for each split leaf N i do Let A be the split attribute for N i. for each (rid,v) in the attribute list for A do find the corresponding entry in the class list e (using the rid) if the leaf referenced by e is N i then find the new leaf N j to which (rid,v) belongs (by applying the splitting test) update the leaf pointer for e to N j

18
SLIQ - bottleneck Class-list must remain memory resident at all time! –Although not a big problem with today's memories, still there might be cases where this is a bottleneck. So, what can we do when the class-list doesn't fit in main memory? –SPRINT is a solution...

19
SPRINT ridagecar 225mini 130sports 635sports 340van 445luxury 750van 855sports 560luxury 965mini 1070luxury ridsalarycar 220mini 930mini 160sports 770van 380van 890sports 4100luxury 6120sports 5150luxury 10200luxury ridmaritalcar 3marriedvan 5marriedluxury 7marriedvan 9marriedmini 1singlesports 2singlemini 4singleluxury 6singlesports 8singlesports 10singleluxury The main data structures used in SPRINT are: Attribute lists and Class histograms

20
SPRINT - Histograms sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R3123 sportsminivanluxury L1100 R ridagecar 225mini 130sports 635sports 340van 445luxury 750van 855sports 560luxury 965mini 1070luxury age 25 age 30 Evaluate each split, using GINI or Entropy.

21
SPRINT - Histograms sportsminivanluxury L0000 R3223 sportsminivanluxury L0100 R sportsminivanluxury L0200 R3023 ridsalarycar 220mini 930mini 160sports 770van 380van 890sports 4100luxury 6120sports 5150luxury 10200luxury salary 20 salary 30 Evaluate each split, using GINI or Entropy.

22
SPRINT - Histograms sportsminivanluxury Yes0121 No3102 sportsminivanluxury Yes3102 No0121 Married Single ridmaritalcar 3marriedvan 5marriedluxury 7marriedvan 9marriedmini 1singlesports 2singlemini 4singleluxury 6singlesports 8singlesports 10singleluxury Evaluate each split, using GINI or Entropy.

23
SPRINT - Performing Best Split Once the best split point has been found for a node, we execute the split by creating child nodes. Requires splitting the node’s lists for every attribute into two. Partitioning the attribute list of the winning attribute (salary) is easy. –We scan the list, apply the split test, and move the records to two new attribute lists - one for each new child.

24
SPRINT - Performing Best Split Unfortunately, for the remaining attribute lists of the node (age and marital), we have no test that we can apply to the attribute values to decide how to divide the records. Solution: use the rids. –As we partition the list of the splitting attribute (i.e. salary), we insert the rids of each record into a probe structure (hash table), noting to which child the record was moved. Once we have collected all the rids, we scan the lists of the remaining attributes and probe the hash table with the rid of each record. –The retrieved information tells us with which child to place the record.

25
SPRINT - Performing Best Split If the hash-table is too large for the memory, splitting is done in more than one step. –The attribute list for the splitting attribute is partitioned up to the attribute record for which the hash table will fit in memory; –Portions of attribute lists of non-splitting attributes are partitioned; and the process is repeated for the remainder of the attribute list of the splitting attribute.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google