Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott.

Similar presentations


Presentation on theme: "1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott."— Presentation transcript:

1 1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott

2 2

3 Yellow Morels 3

4 Black Morels 4

5 This set of overheads begins with the contents of the project check-off sheet After that an example project is given 5

6 CS 490 Data Mining Project Check-Off Sheet Student's name: _______ 1. Meets requirements for formatting. (No pts.) [ ] 2. Oral presentation given. (No pts.) [ ] 3. Attendance at Other Students' Presentations. Partial points for partial attendance. 20 pts.____ 6

7 I. Background Information on the Problem Domain and the Data Set 7

8 Name of Data Set: _______ I.A. Random Information Drawn from the Online Data Files Posted with the Data Set. 3 pts.___ I.B. Contents of the Data File. 3 pts.___ I.C. Summary of Background Information. 3 pts.___ I.D. Screen Shot of Open File. 3 pts.___ 8

9 II. Applications of Data Mining Algorithms to the Data Set 9

10 II. Case 1. This Needs to Be a Classification Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 10

11 II. Case 2. This Needs to Be a Clustering Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 11

12 II. Case 3. This Needs to Be an Association Mining Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 12

13 II. Case 4. Any Kind of Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 13

14 II. Case 5. Any Kind of Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 14

15 II. Case 6. Any Kind of Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 15

16 II. Case 7. Any Kind of Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 16

17 II. Case 8. Any Kind of Algorithm Name of Algorithm: _______ i. Output Results. 3 pts.___ ii. Explanation of Item. 2 pts.___ iii. Graphical or Other Special Purpose Additional Output. 2 pts.___ 17

18 III. Choosing the Best Algorithm Among the Results 18

19 III.A. Random Babbling. 6 pts.___ III.B. An Application of the Paired t-test. 6 pts.___ Total out of 100 points possible: _____ 19

20 Example Project The point of this sample project is to illustrate what you should produce for your project. In addition to the content of the project, information given in italics provides instructions or commentary or background information. 20

21 Needless to say, your project should simply contain all of the necessary content. You don't have to provide italicized commentary. 21

22 I. Background Information on the Problem Domain and the Data Set If you are working with your own data set you will have to produce this documentation entirely yourself. If you are working with a downloaded data set, you can use whatever information comes with the data set. You may paraphrase that information, rearrange it, do anything to it to help make your presentation clear. 22

23 You don't have to follow academic practice and try to document or footnote what you did when presenting the information. The goal is simply adaptation for clear and complete presentation. What I'm trying to say is this: There will be no penalty for "plagiarism". 23

24 What I would like you to avoid is simply copying and pasting, leading to a mass of information that is not relevant or helpful to the reader (the teacher—who will be making the grades) in understanding what you were doing. Reorganize and edit as necessary in order to make it clear. 24

25 Finally, include a screen shot of the explorer view of the data set after you've opened the file containing it. Already here you have a choice of what exactly to show and you need to write some text explaining what the screen shot displays. 25

26 I.A. Random Information Drawn from the Online Data Files Posted with the Data Set This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp ). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. 26

27 The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ''leaflets three, let it be'' for Poisonous Oak and Ivy. 27

28 Number of Instances: 8124 Number of Attributes: 22 (all nominally valued) Attribute Information: (classes: edible=e, poisonous=p) 28

29 1. cap-shape: bell=b,conical=c,convex=x,flat=f,knobb ed=k,sunken=s 2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,gr een=r,pink=p,purple=u,red=e,white=w,yell ow=y 29

30 4. bruises?:bruises=t,no=f 5. odor: almond=a,anise=l,creosote=c,fishy=y,f oul=f,musty=m,none=n,pungent=p,spicy=s 6. gill-attachment: attached=a,descending=d,free=f,notch ed=n 7. gill-spacing: close=c,crowded=w,distant=d 30

31 8. gill-size:broad=b,narrow=n 9. gill-color: black=k,brown=n,buff=b,chocolate=h,gr ay=g,green=r,orange=o,pink=p,purple=u,r ed=e,white=w,yellow=y 10. stalk-shape:enlarging=e,tapering=t 11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 31

32 12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,or ange=o,pink=p,red=e,white=w,yellow=y 32

33 15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,or ange=o,pink=p,red=e,white=w,yellow=y 16. veil-type:partial=p,universal=u 17. veil-color: brown=n,orange=o,white=w,yellow=y 18. ring-number:none=n,one=o,two=t 33

34 19. ring-type: cobwebby=c,evanescent=e,flaring=f,lar ge=l,none=n,pendant=p,sheathing=s,zone =z 20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,gr een=r,orange=o,purple=u,white=w,yellow= y 34

35 21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 22. habitat: grasses=g,leaves=l,meadows=m,paths =p,urban=u,waste=w,woods=d 35

36 Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11. Class Distribution: -- edible: 4208 (51.8%) -- poisonous: 3916 (48.2%) -- total: 8124 instances 36

37 Logical rules for the mushroom data sets. This is information derived by researchers who have already worked with the data set. Logical rules given below seem to be the simplest possible for the mushroom dataset and therefore should be treated as benchmark results. 37

38 Disjunctive rules for poisonous mushrooms, from most general to most specific: P_1) odor=NOT(almond.OR.anise.OR.none) 120 poisonous cases missed, 98.52% accuracy P_2) spore-print-color=green 48 cases missed, 99.41% accuracy 38

39 P_3) odor=none.AND.stalk-surface-below- ring=scaly.AND.(stalk-color-above- ring=NOT.brown) 8 cases missed, 99.90% accuracy P_4) habitat=leaves.AND.cap-color=white 100% accuracy Rule P_4) may also be P_4') population=clustered.AND.cap_color=whit e 39

40 These rules involve 6 attributes (out of 22). Rules for edible mushrooms are obtained as negation of the rules given above, for example the rule: odor=(almond.OR.anise.OR.none).AND.s pore-print-color=NOT.green gives 48 errors, or 99.41% accuracy on the whole dataset. 40

41 Several slightly more complex variations on these rules exist, involving other attributes, such as gill_size, gill_spacing, stalk_surface_above_ring, but the rules given above are the simplest we have found. 41

42 I.B. Contents of the Data File Here is a snippet of five records from the data file: p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g 42

43 Incidentally, the data file contents also exist in expanded form. Here is a record from that file: EDIBLE,CONVEX,SMOOTH,WHITE,BRUI SES,ALMOND,FREE,CROWDED,NARRO W,WHITE,TAPERING,BULBOUS,SMOOT H,SMOOTH,WHITE,WHITE,PARTIAL,WH ITE,ONE,PENDANT,PURPLE,SEVERAL, WOODS 43

44 Section I.C should be written by you. You should summarize the information given above, which is largely copy and paste, in a brief, well-organized paragraph that you write yourself and which conveys the basics in a concise way. 44

45 The idea is that a reader who really doesn't want or need to know the details could go to this paragraph and find out everything they needed to know in order to keep reading the rest of your write-up and have some idea of what is going on. 45

46 I.C. Summary of Background Information The problem domain is the classification of mushrooms as either poisonous/inedible or non-poisonous/edible. There are 8124 instances in the data set consisting of 22 nominal attributes apiece. Roughly half of the instances are poisonous and half are non-poisonous. 46

47 There are 2480 cases of missing attribute values, all on the same attribute. As is to be expected with non-original data sets, this set has already been extensively studied. Other researchers have provided sets of rules they have derived which would serve as benchmarks when considering the results of the application of further data mining algorithms to the data set. 47

48 I.D. Screen Shot of Open File ***What this shows: The cap-shape attribute is chosen out of the list on the left. Its different values are given in the table in the upper right. In the lower right, the Edible attribute is selected from a (hidden) drop down list. 48

49 The graph shows the proportion of edible and inedible mushrooms among the instances containing different values of cap-shape. 49

50 50

51 II. Applications of Data Mining Algorithms to the Data Set The overall requirement is that you use the Weka explorer and run up to 8 different data mining algorithms on your data set. Here is a preview of what is involved: 51

52 i. You will get full credit for all 8 cases if among the 8 there is at least one each of classification, clustering, and association rule mining. In order to make it clear that this has been done, the first case should be a classification, the second case should be a clustering, and the third case should be an application of association rule mining. 52

53 The grading check-off sheet will reflect this requirement. All remaining cases can be of your choice, given in any order you want to. 53

54 ii. You will have to either copy a screen shot or copy certain information out of the Weka explorer interface and paste it into your report. The stuff you need to do this for in the different kinds of cases is simply illustrated. I won't try and list it all out here. 54

55 At every point, ask yourself this question: "Was it immediately apparent to me what I was looking at and what it meant?“ If the answer to that question was no, you should include explanatory remarks with whatever you chose to show from Weka. 55

56 For consistency's sake in these cases you can label your remarks "***What this shows:". 56

57 iii. The most obvious kind of results that you would reproduce would be the percent correct and percent incorrect classification for a classification scheme, for example. In addition to this, the Weka output would include things like a confusion matrix, the Kappa statistic, and so on. 57

58 For each case that you examine, you will be expected to highlight one aspect of the output and to provide your own brief, written explanation of it. Note that this is an "educational" aspect of this project. 58

59 On the job, the expectation would be that you as a user knew what it all meant. Here, as a student, the goal is to show that you know what it all meant. 59

60 iv. Finally, there is an additional aspect of Weka that you should use and illustrate. I will not try to describe it in detail here. You will see examples in the content below. 60

61 In short, for the different algorithms, if you right click on the results, you will be given options to create graphs, charts, and various other kinds of output. For each case that you cover you should take one of these options. Again, there is an educational, as opposed to practical aspect to this. 61

62 For the purposes of this project, just cycle through the different options that are available to show that you are familiar with them. For each one, provide a sentence or two making it clear that you know what this additional output means. 62

63 II. Case 1. This Needs to Be a Classification Algorithm Name of Algorithm: J48 63

64 i. Output Results ***What this shows: This shows the classifier tree generated by the J48 algorithm. 64

65 65

66 ***What this shows: This gives the analysis of the output of the algorithm. The most notable thing that should jump out at you is that this is a "perfect" tree. The output shows 100% correct classification and no misclassification. 66

67 67

68 ii. Explanation of Item There is no need to repeat the screen shot. For this item I have chosen the confusion matrix. It is very easy to understand. It shows 0 false positives and 0 false negatives. 68

69 It is interesting to note that you need to know the values for the attributes in the data file to be sure which number represents TP and which represents TN. Referring back to the earlier the screen shot, the same is true for the bars. What do the blue and red parts of the bars represent, edible or inedible? 69

70 iii. Graphical or Other Special Purpose Additional Output ***What this shows: Going back to the previous screen shot, if you right click on the item highlighted in blue—the results of running J48 on the data set, you get several options. One of them is "Visualize tree". This screen shot shows the result of taking that option. 70

71 71

72 II. Case 2. This Needs to Be a Clustering Algorithm Name of Algorithm: SimpleKMeans 72

73 i. Output Results ***What this shows: This shows the results of the SimpleKMeans clustering algorithm with the edible/inedible attribute ignored. The results compare the clusters/classifications with the ignored attribute. The algorithm finds 2 clusters based on the remaining attributes. 73

74 74

75 ii. Explanation of Item At the bottom of the screen shot there is an item, "Incorrectly clustered instances". 37.6% of the clustered instances don't fall into the desired edible/inedible category. The algorithm finds 2 clusters, but these 2 clusters don't agree with the 2 classifications of the attribute that was ignored. 75

76 iii. Graphical or Other Special Purpose Additional Output ***What this shows: Going back to the previous screen shot, if you right click on the item highlighted in blue—the results of running SimpleKMeans on the data set, you get several options. One of them is "Visualize cluster assignments". 76

77 This screen shot shows the result of taking that option. Since it isn't possible to visualize the clusters in n-dimensional space, the screen provides the option of picking which individual attribute to visualize. 77

78 This screen shows the instances in order by number along the x-axis. The y-axis shows the cluster placements for the different values for the cap-shape attribute. The drop down box allows you to change what the axes represent. 78

79 79

80 II. Case 3. This Needs to Be an Association Mining Algorithm Name of Algorithm: Apriori 80

81 i. Output Results ***What this shows: This shows the results of the Apriori association rule mining algorithm. 81

82 82

83 ii. Explanation of Item Various relevant parameters are shown on the screen shot. The system defaults to a minimum support level of.95 and a minimum confidence level of.9. The system lists the 10 best rules found. The first 9 have confidence levels of 1. On the one hand, this is good. 83

84 From a practical point of view, what this tends to suggest is that the data are effectively redundant. Just to take the first rule for example, if you know the color of the veil, you know the type of the veil. The 10 th rule provides an interesting reverse insight into this. It tells you that if you know the type, you only know the color with.98 confidence. 84

85 iii. Graphical or Other Special Purpose Additional Output There don't appear to be any other output options for association rules. There is no standard visualization for them so nothing is included for this point. 85

86 II. Case 4. Any Kind of Algorithm Name of Algorithm: ADTree 86

87 i. Output Results ***What this shows. These are the results of running the ADTree classification algorithm. I haven't bothered to scroll up and show the ASCII representation of the tree. Instead, I've just shown the critical output at the bottom. 87

88 88

89 ii. Explanation of Item There are two items I'd like to highlight: a. Notice that this tree generation algorithm didn't get 100% classified correctly. If I'm reading the data correctly, there were 8 false positives on the attribute of interest, which is named Edible. This is not good. 89

90 False negatives deprive you of a tasty gustatory and culinary experience. False positives deprive you of your health or your life. I point this out in contrast to the J48 results given above. 90

91 b. Notice that the time taken to build the model was.73 seconds. This is about 10 times slower than J48, but I'm mainly interested in comparing with the following algorithm. 91

92 iii. Graphical or Other Special Purpose Additional Output ***What this shows: This is the visualization of the tree. There are other graphical options, but they are difficult to interpret for the mushroom data set, so this is given for comparison with the J48 tree. 92

93 93

94 II. Case 5. Any Kind of Algorithm Name of Algorithm: BFTree 94

95 i. Output Results ***What this shows: This shows the results of using the BFTree classification algorithm. 95

96 96

97 ii. Explanation of Item This algorithm also doesn't give a tree that classifies with 100% accuracy. It gives the same kind of error as the ADTree, although there are 3 fewer. 97

98 The additional item I'd like to highlight is that the time taken to build the model was seconds. As a matter of fact, that information came out first and then additional, significant amounts of time were taken to run through each fold of the data. This was quite time consuming compared to the other trees produced so far. 98

99 iii. Graphical or Other Special Purpose Additional Output ***What this shows: This screen shot is the result of taking the "Visualize classifier errors" option on the results of the algorithm. I believe what this screen illustrates is a decision point in the tree on the cap- surface attribute. 99

100 In one of the cases, symbolized by the blue rectangle, an incorrect classification is made on this basis while 7 other instances classify correctly based on this attribute. 100

101 101

102 II. Case 6. Any Kind of Algorithm Name of Algorithm: Naïve Bayes 102

103 i. Output Results ***What this shows: This screen shot shows the bottom of the output for the Naïve Bayes classification algorithm. The upper part of the output shows conditional probability counts for all of the attributes in the data. 103

104 If the cost of an error wasn't so high, this algorithm by itself does OK. It's time cost is only.03 seconds and it achieves 95.8% correct classification. 104

105 105

106 ii. Explanation of Item I'm running out of items to highlight which are particularly meaningful for the example in question. Notice that the output includes the Mean absolute error, the Root mean squared error, the Relative absolute error, and the Root relative squared error. 106

107 These differ in magnitude because of the way they're calculated, but they are all indicators of the same general thing. As pointed out in the book, when comparing two different data mining approaches, if you compared the same measure for both, you will tend to have a valid comparison regardless of which of the measures you used. 107

108 iii. Graphical or Other Special Purpose Additional Output ***What this shows: Two graphical output screen shots are given below. They show a cost-benefit analysis. Such an analysis is more appropriate to something like direct mailing, but it is possible to illustrate something by changing one of the parameters in the display. 108

109 Both screen shots show a threshold curve and a cost-benefit curve where the button to minimize cost/benefit has been clicked. In the first screen shot the costs of FP and FN are equal, at 1. In the second, the cost of a false positive has been raised to 1,000. Notice how the shape of the curve changes. 109

110 Roughly speaking, I would interpret the second screenshot to mean that you have effectively no costs as long as you are correctly predicting TP, but your cost rises linearly with the increasing probability of FP predictions later in the data set. 110

111 111

112 112

113 II. Case 7. Any Kind of Algorithm Name of Algorithm: BayesNet 113

114 i. Output Results ***What this shows. This screen shot shows the results of the BayesNet classification algorithm. 114

115 115

116 ii. Explanation of Item This is not a new item to explain, but it is an observation related to the values and results previously obtained. The association rule mining algorithm seemed to suggest that there were heavy dependencies among some of the attributes in the data set. BayesNet is supposed to take these into account, while Naïve Bayes does not. 116

117 However, when you compare the rate of correctly classified instances, here you get 96.2% vs. 95.8% for Naïve Bayes. It seems fair to ask what difference it really made to include the dependencies in the analysis. 117

118 iii. Graphical or Other Special Purpose Additional Output What this shows: This shows the result of taking the "Visualize cost curve" option on the results of the data mining. Honestly, I've about reached the limit of what I understand without further research. I present this here without further explanation. 118

119 This is one of the reasons I advertise this sample project write-up as an example of a B, rather than an A effort. Everything that has been asked for is included, but in this point, for example, the explanation isn't complete. It sure is pretty though… 119

120 120

121 II. Case 8. Any Kind of Algorithm Name of Algorithm: RIDOR 121

122 i. Output Results ***What this shows: This screen shows the results of applying the RIDOR algorithm to the data set. RIDOR was the technique based on rules and exceptions. Look at the top of the output. Here you see clearly that the default classification is edible, with exceptions listed underneath. 122

123 Philosophically, this goes against my point of view on mushrooms. The logical default should be inedible, but there are more edible mushrooms in the data set than inedible. So it goes. 123

124 124

125 ii. Explanation of Item The last set of items that appears in these output screens are the Precision, Recall, F-Measure, and ROC values. This is probably not the best example for illustrating what they mean. It's apparent that things like recall would be better suited to document retrieval for example. 125

126 Maybe the best illustration that they don't really apply is that they are all 1 or.999. On the other hand, maybe that's realistic for a classification scheme that gives 99.95% correct results. 126

127 iii. Graphical or Other Special Purpose Additional Output Once again, the fact that this is a "B" example rather than an "A" example comes into play. I'm not showing a new bit of graphical output. I'm showing the cost curve, like for the previous data mining algorithm. 127

128 The main reason for choosing to show it again is that this picture looks so much like the simple picture in the text that they used to illustrate some of the cost concepts graphically. 128

129 129

130 III. Choosing the Best Algorithm Among the Results Depending on the problem domain and your level of ambition, you might compare algorithms on the basis of lift charts, cost curves, and so on. For simple classification, the tools will give results showing the percent classified correctly and the percent classified incorrectly. 130

131 It would be natural to simply choose the one with the highest percent classified incorrectly. However, this is not good enough for credit on this item. I have chosen to illustrate what you need to do with a simple basic example. 131

132 I consider the two classification algorithms that gave the highest percent classified correctly. I then apply the paired t-test to see whether or not there is actually a statistically significant different between them. 132

133 If there is, that's the correct basis for preferring one over the other. For the purposes of illustration, I do this by hand and explain what I'm doing. You may find tools that allow you to make a valid comparison of results. That's OK, as long as you explain. 133

134 The point simply is that it's not sufficient to just list a bunch of percents and pick the highest one. Illustrate the use of some advanced technique, whether involving concepts like lift charts or cost curves or statistics. You may also have noticed that Weka tells you the run time for doing an analysis. 134

135 When making a decision about which algorithm is the best, at a minimum take into account an advanced comparison of the two apparent best, and you may want to make an observation about the apparent complexity or time cost of the algorithms. 135

136 III.A. Random Babbling The concept of "Cost of classification" seems relevant to this example. It takes a human expert to tell if a mushroom is poisonous. If you're not an expert, you can tell by eating a mushroom and seeing what happens. 136

137 The cost of finding out that the mushroom is poisonous is about as high as it gets. I guess if you're truly dedicated, you'd be willing to die for science. Directly related to this is the cost of a misclassification. It seems to be on the infinite side… 137

138 The J48 tree approach, given first, even though it's apparently been pruned, still classifies 100% correctly. This seems to be at odds with claims made at various point that you don't want a perfect classifier because it will tend to be overtrained. 138

139 On the other hand, since the cost of a misclassification is so high, maybe it would be best to bias the training. Lots of false "It's poisonous" results would be desirable. I remember learning this rule from my parents: Don't eat any wild mushrooms. 139

140 It's also interesting to compare with the commentary provided at the beginning. "Experts" who have examined the data wanted to get a minimal rule set. They apparently considered that a success. But they were willing to live with errors. I'm not sure living with errors is consistent with this data set. 140

141 III.B. An Application of the Paired t-test Pick any two of your results above, identify them and the success rate values they gave, and compare them using the paired t-test. Give a statistically valid statement that tells whether or not the two cases you're comparing are significantly different. 141

142 What is shown is my attempt to interpret and apply what the book says about the paired t-test. I do not claim that I have necessarily done this correctly. Students who have recently taken statistics may reach different conclusions about how this is done. 142

143 However, I have gone through the motions. To get credit for this section, you should do the same, whether following my example or following your own understanding. 143

144 I have chosen to compare the percent of correct classifications by Naïve Bayes (NB) and BayesNet (BN) given above. 144

145 Taken from Weka results: NB sample mean = % NB root mean squared error =.1757 Squaring the value above: NB mean squared error =

146 Taken from Weka results: BN sample mean = % BN root mean squared error =.1639 Squaring the value above: BN mean squared error =

147 This is my estimate of the standard deviation of the t statistic where the divisor is 10 because I opted for the default 10- fold cross-validation in Weka: Estimate of paired root mean squared error (EPRMSE) = square root(( NB mean squared error / 10) + (BN mean squared error / 10)) =

148 t statistic = (NB sample mean – BN sample mean) / EPRMSE =

149 The book says this is a two-tailed test. For a 99% confidence interval I want to use a threshold of.5%. The book's table gives a value of

150 The computed value, is greater than the table value of This means you reject the null hypothesis that the means of the two distributions are the same. 150

151 In other words, you conclude that there is a statistically significant difference between the percent of correct classifications resulting from the Naïve Bayes and the Bayesian Network algorithms on the mushroom data. 151

152 The End 152


Download ppt "1 CS 490 Sample Project Mining the Mushroom Data Set Kirk Scott."

Similar presentations


Ads by Google