Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand
Overview Classifiers, Regressors, and clusterers Multiple evaluation schemes Bagging and Boosting Feature Selection: –right features and data key to successful learning Experimenter Visualizer Text not up to date. They welcome additions.
Learning Tasks Classification: given examples labelled from a finite domain, generate a procedure for labelling unseen examples. Regression: given examples labelled with a real value, generate procedure for labelling unseen examples. Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.
Data Format: sepalwidth petallength class 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa Etc. General attribute-name REAL or list of values
J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 #..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0)
Cross-validation Correctly Classified Instances % Incorrectly Classified Instances % Default 10-fold cross validation i.e. –Split data into 10 equal sized pieces –Train on 9 pieces and test on remainder –Do for all possibilities and average
J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as | a = Iris-setosa | b = Iris-versicolor | c = Iris-virginica
Precision, Recall, and Accuracy Precision: probability of being correct given that your decision. –Precision of iris-setosa is 49/49 = 100% –Specificity in medical literature Recall: probability of correctly identifying class. –Recall accuracy for iris-setosa is 49/50 = 98% –Sensitity in medical literature Accuracy: # right/total = 143/150 =~95%
Other Evaluation Schemes Leave-one-out cross-validation –Cross-validation where n = number of training instanced Specific train and test set –Allows for exact replication –Ok if train/test large, e.g. 10,000 range.
Bootstrap sampling Randomly select n with replacement from n Expect about 2/3 to be chosen for training –Prob of not chosen = (1-1/n)^n ~ 1/e. Testing on remainder Repeat about 30 times and average. Avoids partition bias