Presentation is loading. Please wait.

Presentation is loading. Please wait.

Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.

Similar presentations


Presentation on theme: "Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand."— Presentation transcript:

1 Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand

2 Overview Classifiers, Regressors, and clusterers Multiple evaluation schemes Bagging and Boosting Feature Selection: –right features and data key to successful learning Experimenter Visualizer Text not up to date. They welcome additions.

3 Learning Tasks Classification: given examples labelled from a finite domain, generate a procedure for labelling unseen examples. Regression: given examples labelled with a real value, generate procedure for labelling unseen examples. Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.

4 Data Format: IRIS @RELATION iris @ATTRIBUTE sepallengthREAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidthREAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa Etc. General from @atttribute attribute-name REAL or list of values

5 J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 #..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0)

6 Cross-validation Correctly Classified Instances 143 95.3% Incorrectly Classified Instances 7 4.67 % Default 10-fold cross validation i.e. –Split data into 10 equal sized pieces –Train on 9 pieces and test on remainder –Do for all possibilities and average

7 J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica

8 Precision, Recall, and Accuracy Precision: probability of being correct given that your decision. –Precision of iris-setosa is 49/49 = 100% –Specificity in medical literature Recall: probability of correctly identifying class. –Recall accuracy for iris-setosa is 49/50 = 98% –Sensitity in medical literature Accuracy: # right/total = 143/150 =~95%

9 Other Evaluation Schemes Leave-one-out cross-validation –Cross-validation where n = number of training instanced Specific train and test set –Allows for exact replication –Ok if train/test large, e.g. 10,000 range.

10 Bootstrap sampling Randomly select n with replacement from n Expect about 2/3 to be chosen for training –Prob of not chosen = (1-1/n)^n ~ 1/e. Testing on remainder Repeat about 30 times and average. Avoids partition bias


Download ppt "Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand."

Similar presentations


Ads by Google