Presentation is loading. Please wait.

Presentation is loading. Please wait.

Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC.

Similar presentations


Presentation on theme: "Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC."— Presentation transcript:

1 Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC

2 C4.5 Classification using decision tree. Where to find the program? –C4.5 Release 8: by Ross Quinlan –http://www.cse.unsw.edu.au/~quinlan/http://www.cse.unsw.edu.au/~quinlan/ Running under Unix Reference book: “C4.5: programs for machine learning” J.Ross Quinlan

3 C4.5 Files Names files (filestem.names) –provides names for classes, attributes, and attribute values. –Consists of a series of entries, each starting on a new line and ending with a period. The first entry gives the class names, separated by commas. The rest of the files consists a single entry for each attribute. –Begins with the attribute name followed by a colon, then a specification of the values that the attribute can take. –Four specifications are possible: »ignore; causes the value of the attribute to be disregarded »continuous; attribute has numeric values »discrete N; N is a positive integer, specifies that the attribute has no more than N discrete values »A list of names separated by commas;

4 Example: Golf.names Play, Don't Play. | class labels outlook: sunny, overcast, rain. temperature: continuous. humidity: continuous. windy: true, false.

5 C4.5 Files (cont) Data file (filestem.data) –Data file describe the training cases for generating the decision tree and/or rules –Each line describe one case, providing values for all the attributes and then the case’s class, separated by commas and terminated by a period –Attribute values must appear in the same order that the attributes were given in the names file –For missing or unknown data, use ? to specify Test file (filestem.test) –Use to evaluate the classifier you have produced –In exactly the same format as the data file

6 Example:Golf.data | outlook, temperature, humidity, windy, class label sunny, 85, 85, false, Don't Play sunny, 80, 90, true, Don't Play overcast, 83, 78, false, Play rain, 70, 96, ?, Play rain, 68, ?, false, Play rain, 65, 70, true, Don't Play overcast, 64, 65, true, Play sunny, 72, 95, false, Don't Play sunny, 69, 70, false, Play overcast, 72, 90, true, Play overcast, 81, 75, false, Play rain, 71, 80, true, Don't Play

7 Running the programs C4.5: decision tree generation “c4.5 –f filestem [-u]” -f filestem (Default: DF) used to specify the filestem of the task -u (Default: no test set) This option is invoked when a test file has been prepared Example: only training: “c4.5 –f../Data/vote” training and testing: “c4.5 –f../Data/vote –u”

8 c4.5 output C4.5 [release 8] decision tree generator Fri Sep 12 12:02: Options: File stem Read 300 cases (16 attributes) from../Data/vote.data Decision Tree: physician fee freeze = n: | adoption of the budget resolution = y: democrat (151.0) | adoption of the budget resolution = u: democrat (1.0) | adoption of the budget resolution = n: | | education spending = n: democrat (6.0) | | education spending = y: democrat (9.0) | | education spending = u: republican (1.0) physician fee freeze = y: | synfuels corporation cutback = n: republican (97.0/3.0) | synfuels corporation cutback = u: republican (4.0) | synfuels corporation cutback = y: | | duty free exports = y: democrat (2.0) | | duty free exports = u: republican (1.0) | | duty free exports = n: | | | education spending = n: democrat (5.0/2.0) | | | education spending = y: republican (13.0/2.0) | | | education spending = u: democrat (1.0) physician fee freeze = u: | water project cost sharing = n: democrat (0.0) | water project cost sharing = y: democrat (4.0) | water project cost sharing = u: | | mx missile = n: republican (0.0) | | mx missile = y: democrat (3.0/1.0) | | mx missile = u: republican (2.0) The numbers at the leaves, in the form (N) or (N/E) N is the sum of cases that reach the leaf E is the number of cases that belong to the classes other than the nominated class

9 c4.5 output(cont) Simplified Decision Tree: physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: | mx missile = n: democrat (3.0/1.1) | mx missile = y: democrat (4.0/2.2) | mx missile = u: republican (2.0/1.0)

10 c4.5 output(cont) Evaluation on training data (300 items): Before Pruning After Pruning Size Errors Size Errors Estimate 25 8( 2.7%) 7 13( 4.3%) ( 6.9%) << Evaluation on test data (135 items): Before Pruning After Pruning Size Errors Size Errors Estimate 25 7( 5.2%) 7 4( 3.0%) ( 6.9%) << (a) (b) <-classified as (a): class democrat 1 51 (b): class republican

11 Running the programs (cont) C4.5rules: rule induction Should only be used after running the decision tree program c4.5, since it reads the unpruned file containning the unprunned tree. “c4.5rules –f filestem [-u]” Example: c4.5rules –f../Data/vote

12 c4.5rules output C4.5 [release 8] rule generator Fri Sep 12 12:07: Options: File stem Read 300 cases (16 attributes) from../Data/vote Processing tree 0 Final rules from tree 0: Rule 2: physician fee freeze = n -> class democrat [98.4%] Rule 9: synfuels corporation cutback = y duty free exports = y -> class democrat [97.5%] … Rule 13: physician fee freeze = u mx missile = u -> class republican [50.0%] Default class: democrat

13 c4.5rules output(cont) Evaluation on training data (300 items): Rule Size Error Used Wrong Advantage % (0.6%) -1 (0|1) democrat % 3 0 (0.0%) 0 (0|0) democrat % 3 0 (0.0%) 0 (0|0) democrat % 97 3 (3.1%) 21 (23|2) republican % 15 2 (13.3%) 11 (13|2) republican % 2 0 (0.0%) 2 (2|0) republican % 2 0 (0.0%) 2 (2|0) republican Drop rule 2 Rule Size Error Used Wrong Advantage % 54 0 (0.0%) 0 (0|0) democrat % 3 0 (0.0%) 0 (0|0) democrat % 97 3 (3.1%) 21 (23|2) republican % 15 2 (13.3%) 11 (13|2) republican % 3 0 (0.0%) 3 (3|0) republican % 2 0 (0.0%) 2 (2|0) republican Tested 300, errors 9 (3.0%) << (a) (b) <-classified as (a): class democrat (b): class republican Evaluation on test data (135 items): Rule Size Error Used Wrong Advantage % 24 2 (8.3%) 0 (0|0) democrat % 1 0 (0.0%) 0 (0|0) democrat % 41 0 (0.0%) 6 (6|0) republican % 8 3 (37.5%) 2 (5|3) republican % 2 0 (0.0%) 2 (2|0) republican Tested 135, errors 7 (5.2%) << (a) (b) <-classified as (a): class democrat 4 48 (b): class republican

14 confusion matrix & error rate Predicted class Actual class AB A 803 B 448 error rate of this classifier (4+3)/(83+52) = 5.2%

15 CBA Classification Based on Association –Download at –Use same data types as c4.5,i.e., *.names, *.data, and *.test –Refer to help topics –Discretization function, The discretization program sometime is not compatible with some systems, if errors occurs, then try to use the DOS version of the discretizer under the CBA directory. “discretize”

16 Data Repository online UCI machine learning repository ory.html


Download ppt "Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC."

Similar presentations


Ads by Google