Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining CSCI 307, Spring 2019 Lecture 8

Similar presentations


Presentation on theme: "Data Mining CSCI 307, Spring 2019 Lecture 8"— Presentation transcript:

1 Data Mining CSCI 307, Spring 2019 Lecture 8
WEKA

2 Classification Predicted target must be categorical/nominal
Implemented methods Decision trees (J48, etc.) Rules (ZeroR, OneR, etc.) Naïve Bayes Evaluation methods Test data set Cross-validation

3 Classification Algorithms
ZeroR: Ignores all the attributes, and relies only on the target class. Always predicts the majority value. OneR: Make one rule for each attribute (based on frequency of outcomes for each value of the attribute). Choose the rule/attribute that gives the smallest error. Naive Bayes: A probabilistic classifier based on Bayes' Theorem. Assumes all attributes are independent.

4 Evaluation Methods Test Data Set Cross Validation:
Train on all data; Test on all data (not recommended) Split the data (E.g. 66% for training, 34% for testing). Use separate files, one with training instances, one with testing instances. Cross Validation: Divide data set into groups (e.g. 10 groups of instances) Choose one group for testing, use the rest for training Repeat multiple times with different group for testing each time. (E.g. repeat 10 times using one of the 10 original groups for testing each time, and the rest for training). Average the results of all the testing.

5 WEKA Data Formats Data can be imported from a file in various formats:
ARFF (Attribute Relation File Format) has two sections: the Header information defines attribute name, type and relations. the Data section lists the data records (instances). CSV: Comma Separated Values (text file) C4.5: A format used by a decision induction algorithm, requires two separate files Name file: defines the names of the attributes Data file: lists the records (samples) binary Data can also be read from a URL or from an SQL database (using JDBC; Java DataBase Connectivity is an API for Java that defines how a client may access a database)

6 Attribute Relation File Format (arff)
ARFF files consist of two distinct sections: the Header section defines attribute name, type and relations, start with a keyword. @relation <data-name> @attribute <attribute-name> <type> or {range} the Data section lists the data records, starts with @data list of data instances Comment: Any line starting with %

7 Breast Cancer data in ARFF
% Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence- events: 85) % Part 1: Definitions of attribute name, types and relations @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59'} @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-32','33-35','36-39'} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute irradiat {'yes','no'} @attribute Class {'no-recurrence-events','recurrence-events'} % Part 2: Data Section @data '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' …… % source: % NOTE: not sure about those single quote marks.

8 Interpreting the Output: The Confusion Matrix
The confusion matrix shows how many of each class value was classified in each classification category. The Confusion Matrix: a b <-- classified as | a = no-recurrence-events | b = recurrence-events 56 no-recurrence events (a) were classified correctly as (a) 8 no-recurrence events (a) were incorrectly classified as (b) 23 recurrence events (b) were incorrectly classified as (a) 10 recurrence events (b) were correctly classified as (b) Items on the main diagonal are correct classifications

9 Interpreting the Output
Text representation of a tree: J48 pruned tree node-caps = yes | deg-malig = 1: recurrence-events (1.01/0.4) | deg-malig = 2: no-recurrence-events (26.2/8.0) | deg-malig = 3: recurrence-events (30.4/7.4) node-caps = no: no-recurrence-events (228.39/53.4) Number of Leaves : 4 Size of the tree :

10 WEKA Explorer Click the Explorer on Weka GUI
On the Explorer window, Click "Open File" To open a data file, e.g. Breast Cancer data: breast_cancer.arff Or (if you don’t have this data set), the data folder provided by the WEKA package e.g. iris.arff or weather_nominal.arff

11 WEKA Explorer: Open Data File
Open Breast Cancer data. Click an attribute, e.g. age, then its distribution will be displayed in a histogram.

12 WEKA Explorer: Classifiers
After loading a data file, click Classify Tab Choose a classifier, Under Classifier Click Choose Button From drop-down menu, Click Trees Folder Select J48 – a decision tree algorithm Choose a test option Select Percentage Split Radio Button Use default ratio 66% for training and 34% for testing Click Start Button to train and test the classifier. The training and testing information will be displayed in classifier output window.

13 WEKA Explorer: Results
97 cases used in test. Correct: 66 (68%) Wrong: 31 (32%)

14 Result and Model Options
Point to result list window, and right/option click mouse. Menu will display options available about the model.

15 Choose Visualize Tree

16 View Classifier Errors
Correctly predicted cases Wrong cases

17 Save the Model and Results
Right/option click on result. Choose Save model and Save result buffer to save the classifier and the results,

18 Summary Weka is open source data mining software that offers
GUI interfaces: Explorer, Experimenter, Knowledge Flow Functions and Tools Methods for classification: decision trees, rule learners, naive Bayes, etc. Methods for regression/prediction: linear regression, model tree generators, etc. Methods for clustering Methods for feature selection And More...


Download ppt "Data Mining CSCI 307, Spring 2019 Lecture 8"

Similar presentations


Ads by Google