Download presentation

Presentation is loading. Please wait.

Published byDesmond Simkin Modified over 3 years ago

1
Final Project- Mining Mushroom World

2
Agenda Motivation and Background Determine the Data Set (2) 10 DM Methodology steps (19) Conclusion

3
Motivation and Background To distinguish between edible mushrooms and poisonous ones by how they look To know whether we can eat the mushroom, to survive in the wild To survive outside the computer world

4
Determine the Data Set (1/2) Source of data ： UCI Machine Learning Repository Mushrooms Database From Audobon Society Field Guide Documentation ： complete, but missing statistical information Described in terms of physical characteristics Classification ： poisonous or edible All attributes are nominal-valued *Large database: 8124 instances (2480 missing values for attribute #12)

5
Determine the Data Set (2/2) 1. Past Usage Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Iba,W., Wogulis,J., & Langley,P. (1988). ICML, 73-79 2. No other mushrooms data

6
10 DM Methodology steps Step 1. Translate the Business Problem into a Data Mining Problem a.Data Mining Goal ： separate edible mushrooms from poisonous ones b.How will the Results be Used- increase the survival rate c.How will the Results be Delivered- Decision Tree, Naïve Bayes, Ripper, NeuralNet

7
10 DM Methodology steps Step 2. Select Appropriate Data a.Data Source –The Audubon Society Field guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf –Jeff Schlimmer donated these data on April 27th, 1987 b.Volumes of Data -Total 8124 instances -4208(51.8%) edible; 3916(48.2%) poisonous -2480(30.5%) missing in attribute “stalk-root”

8
10 DM Methodology steps Step 2. Select Appropriate Data c.How Many Variables- 22 attributes -cap-shape, cap-color, odor, population, habitat and so on…… d.How Much History is Required- no seasonality *As long as we can eat them when we see them

9
10 DM Methodology steps Step 3. Get to Know the Data a.Examine Distributions ： Use “Weka” to visualize all the 22 attributes with histograms b.Class ： edible=e, poisonous=p

10
Step 3. Get to Know the Data a.Examine Distributions: there are 2 types of historgrams b.First- all kinds of values appear c.(Attribute 21) population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y

11
Step 3. Get to Know the Data 1. Examine Distributions ： there are 2 types of historgrams –Second- only some kinds of value appear –(Attribute 7) gill-spacing ： close=c, crowded=w, distant=d

12
Step 3. Get to Know the Data 1. Examine Distributions ： there are exceptions –Exception 1- missing values in the attribute –(Attribute 11) stalk-root ： bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? 2480 of this attribute have missing values (Total 8124)

13
Step 3. Get to Know the Data 1. Examine Distributions ： there are exceptions –Exception 2- undistinguishable attribute –(Attribute 16) veil-type ： partial=p, universal=u

14
Step3. Get to Know the Data 2. Compare Values with Descriptions –no unexpected values except for missing values

15
10 DM Methodology steps Step 4. Create a Model Set –Creating a Balanced Sample- 75%(6093) as training data, 25%(2031) as test data –Rapid Miner’s “cross-validation” function: k-1 as training, 1 as test

16
10 DM Methodology steps Step 5. Fix Problems with the Data –Dealing with Missing Values- the attribute “stalk- root” has 2480 missing values –replace all missing values with the average of “stalk-root” value –We replaced ‘?’ with the average value ‘b’

17
10 DM Methodology steps Step 6. Transform Data to Bring Information to the Surface –all nominal attribute, no numerical analysis in this step

18
10 DM Methodology steps Step 7. Build Model 1. Decision Tree Performance –A–Accuracy ： 99.11% –L–Lift ： 189.81% True pTrue eClass precision Pred. p9610100% Pred. e18105298.32% Class recall98.16%100.00% True pTrue eClass precision Pred. p9610100% Pred. e18105298.32% Class recall98.16%100.00%

19
10 DM Methodology steps Step 7. Build Model 2. Naïve Bayes Performance –A–Accuracy ： 95.77% –L–Lift ： 179.79% True pTrue eClass precision Pred. p902999.01% Pred. e77104393.12% Class recall92.13%99.14% True pTrue eClass precision Pred. p902999.01% Pred. e77104393.12% Class recall92.13%99.14%

20
10 DM Methodology steps Step 7. Build Model 3. Ripper Performance –A–Accuracy ： 100% –L–Lift ： 193.06% True pTrue eClass precision Pred. p9790100.00% Pred. e01052100.00% Class recall100.00% True pTrue eClass precision Pred. p9790100.00% Pred. e01052100.00% Class recall100.00%

21
10 DM Methodology steps Step 7. Build Model 4. NeuralNet Performance –A–Accuracy ： 91.04% –L–Lift ： 179.35% True pTrue eClass precision Pred. p90711089.18% Pred. e7294292.90% Class recall92.65%89.54% True pTrue eClass precision Pred. p90711089.18% Pred. e7294292.90% Class recall92.65%89.54%

22
10 DM Methodology steps Step 8. Assess Models –Accuracy ： Ripper and Decision Tree have better performances

23
10 DM Methodology steps Step 8. Assess Models –Lift (to compare the performances of different classification models) ： Ripper and Decision Tree have higher lifts

24
10 DM Methodology steps Step 9. Deploy Models –We haven’t go out and find real mushrooms Step 10. Assess Results Conclusion and questions –Maybe ripper and decision tree are better models for nominal data –How Rapid Miner separates training data from test data

Similar presentations

Presentation is loading. Please wait....

OK

Chapter 3 Data Mining Methodology and Best Practices

Chapter 3 Data Mining Methodology and Best Practices

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on water pollution in the world Doc convert to ppt online Ppt on instrument landing system theory Ppt on total parenteral nutrition complications Ppt on mathematics programmed instruction Ppt on access control system Ppt on personality test Ppt on forest resources in india Ppt on barack obama leadership failure Ppt on patient monitoring system