Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martin Ralbovský KIZI FIS VŠE 6.12.2007. The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations.

Similar presentations


Presentation on theme: "Martin Ralbovský KIZI FIS VŠE 6.12.2007. The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations."— Presentation transcript:

1 Martin Ralbovský KIZI FIS VŠE 6.12.2007

2 The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations in logics and statistics One of the main principles of the method is to provide “everything interesting” to the user.

3 Decision trees One of the most known classification methods There are several known algorithms for construction of decision trees (ID3, C4.5…) Algorithm outline: Iterate through attributes, in each step choose the best attribute for branching and make node from the attribute Best decision tree in output

4 Making decision trees GUHA (Petr Berka) A decision tree can be viewed as a GUHA verification/hypothesis But there is only 1 tree in the output Modification of the initial algorithm – ETree procedure We do not branch according to best attribute, but according to n best attributes In each iteration, nodes suitable for branching from existing trees are selected and branched Only sound decision trees to output

5 ETree parameters (Petr Berka) Criterion for attribute ordering -  2 Trees: Maximal tree depth (parameter) Allow only full length trees Number of attributes for branching Branching: Minimal node frequency Minimal node purity Stopping branching criterion (frequency, purity, frequency OR purity) Sound trees: Confusion matrix, F-Measure + any 4ft-quantifier in Ferda

6 How to branch I Attribute1 = {A,B,C} Attribute2 = {1,2} A A B B C C A B B C C Ad a) A A B B C C A A B B C C Ad b) A A B B C C A A B B C C

7 How to branch II A A B B C C A A B B C C 1 1 A A B B C C 2 2 A A B B C C 1 1 2 2 A A B B C C 1 1 2 2 Ad a)Ad b)

8 Pseudocode algorithm LIFO stack; stack.Push(MakeSeedTree()); while (stack.Length >= 0) { Tree processTree = stack.Pop(); foreach (Node n in NodesForBranching(processTree) { stack.Push(CreateTree(processTree,n); } if (QualityTree(processTree)) { PutToOutPut(processTree); }

9 Implementation in Ferda Instead of creating a new DM tool, modularity of Ferda was used Data preparation boxes 4ft-quantifiers can be used to measure quality of trees MiningProcessor (bit string generation engine) usage

10 ETree task example Existing data preparation boxes 4ft-quantifiers ETree task box

11 Output + settings example …

12 Experiment 1 - Barbora Barbora bank, cca. 6100 cliends, classification of client status from: Loan amount Client district Loan duration Client Salary Number of attributes for branching = 4 Minimal node purity = 0.8 Minimal node frequency = 61 (1% of data)

13 Results - Barbora Tree DepthF-ThresholdVerificationsHypothese s Best hypothesis 10.5520.75 20.71770.88 30.85193260.88 40.8579102220.90 Performance: 36 verifications/sec

14 Experiment 2: Forest tree cover UCI KDD Dataset for classification (10K sample) Classification of tree cover based on characteristics: Wilderness area Elevation Slope Horizontal + vertical distance to hydrology Horizontal distance to fire point Number of attributes for branching : 1,3,5 Minimal node purity: 0.8 Minimal node frequency: 100 (1% of dataset)

15 Results – Forest tree cover Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis 10.5210.50 20.7210.71 3 1080.71 40.7259210.72 Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis 10.5410.50 20.72150.72 3 273670.73 40.746673730.74 Tree DepthF-ThresholdVerificationsHypothesesBest hypothesis 10.5610.50 20.73170.73 3 551520.74 4 173961830.75 Attributes for branching: 1 Performance: 39VPS Attributes for branching: 3 Performance: 86VPS Attributes for branching: 5 Performance: 71VPS

16 Experiment 3: Forest tree cover Construction of trees for whole dataset (cca. 600K) Does increase of number attributes for branching result in better trees? Tree length = 3 the other parameters same as in experiment 2. Number of attributes for branching = 1 Best hypothesis: 0.30, 6VPS (strings in cache) Number of attributes for branching = 4 Best hypothesis: 0.52, 2VPS(strings in cache)

17 Verifications 4FT vs. ETree On tasks on similar data table length 4FT (in Ferda) approx. 5000 VPS ETree about 70 VPS The ETree verification is far more complicated: In addition to computing quantifier, counting  2 for each node suitable for branching Hard operations (sums) instead of easy operations (conjunctions…) Not only verification of a tree, but construction of trees derived from this tree

18 Further work How new/known is the method? Boxes for attribute selection criteria Classification box Better result browsing + result reduction Optimizing Elective classification - tree has a vote (Petr Berka) Experiments with various data sources Decision trees from fuzzy attributes Better estimation of relevant questions count


Download ppt "Martin Ralbovský KIZI FIS VŠE 6.12.2007. The GUHA method Provides a general mainframe for retrieving interesting information from data Strong foundations."

Similar presentations


Ads by Google