An Exercise in Machine Learning http://www.cs.iastate.edu/~cs573x/bbsilab.html Machine Learning Software Preparing Data Building Classifiers Interpreting Results Test-driving WEKA
Machine Learning Software Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SIPINA List from KDNuggets (Various) Specific Classification: C4.5, SVMlight Association Rule Mining Bayesian Net … … Commercial vs. Free vs. Programming
What does WEKA do? Implementation of state-of-art learning algorithm Main strengths in the classification Regression, Association Rules and clustering algorithms Extensible to try new learning schemes Large variety of handy tools (transforming datasets, filters, visualization etc…)
WEKA resources API Documentation, Tutorial, Source code. WEKA mailing list Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Weka-related Projects: Weka-Parallel - parallel processing for Weka RWeka - linking R and Weka YALE - Yet Another Learning Environment Many others…
Getting Started Installation (Java runtime +WEKA) Setting up the environment (CLASSPATH) Reference Book and online API document Preparing Data sets Running WEKA Interpreting Results
ARFF Data Format Attribute-Relation File Format Header – describing the attribute types Data – (instances, examples) comma-separated list Use the right data format: Filestem, CSV ARFF format Use C45Loader and CSVLoader to convert
Launching WEKA
Load Dataset into WEKA
Data Filters Useful support for data preprocessing Removing or adding attributes, resampling the dataset, removing examples, etc. Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. Typically split data as 2/3 in training and 1/3 in testing
Building Classifiers A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. Decision Tree and Naïve Bayes Classifiers Which one is the best? No Free Lunch!
Building Classifier
(1) weka.classifiers.rules.ZeroR Building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class). (2) weka.classifiers.bayes.NaiveBayes Class for building a Naive Bayesian classifier
(3) weka.classifiers.trees.J48 Class for generating an unpruned or a pruned C4.5 decision tree.
Test Options Percentage Split (2/3 Training; 1/3 Testing) Cross-validation estimating the generalization error based on resampling when limited data; averaged error estimate. stratified 10-fold leave-one-out (Loo) 10-fold vs. Loo
Understanding Output
Decision Tree Output (1) === Error on training data === Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0% Root relative squared error 0% Total Number of Instances 14 === Detailed Accuracy By Class === TP FP Precision Recall F-Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no === Confusion Matrix === a b <-- classified as 0 | a = yes 0 5 | b = no J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8
Decision Tree Output (2) === Stratified cross-validation === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60% Root relative squared error 97.6586 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no
Performance Measures Accuracy & Error rate Mean absolute error Root mean-squared root (square root of the average quadratic loss) Confusion matrix – contingency table True Positive rate & False Positive rate Precision & F-Measure
Decision Tree Pruning Overcome Over-fitting Pre-pruning and Post-pruning Reduced error pruning Subtree raising with different confidence Comparing tree size and accuracy.
Subtree replacement Bottom-up: tree is considered for replacement once all its subtrees have been considered
Subtree Raising Deletes node and redistributes instances Slower than subtree replacement
Naïve Bayesian Classifier Output CPT, same set of performance measures By default, use normal distribution to model numeric attributes. Kernel density estimator could improve performance if normality assumption is incorrect. (-k option)
Data Sets to work on Data sets were preprocessed into ARFF format Three data sets from UCI repository Two data sets from Computational Biology Protein Function Prediction Surface Residue Prediction
Protein Function Prediction Build a Decision Tree classifier that assign protein sequences into functional families based on characteristic motif compositions Each attribute (motif) has a Prosite access number: PS#### Class label use Prosite Doc ID: PDOC#### 73 attributes (binary) & 10 classes (PDOC). Suggested method: Use 10-fold CV and Pruning the tree using Sub-tree raising method
Surface Residue Prediction Prediction is based on the identity of the target residue and its 4 sequence neighbors Window Size = 5 Target residue is on Surface or not? 5 attributes and binary classes. Suggested method: Use Naïve Bayesian Classifier with no kernels X1 X2 X3 X4 X5
Your Turn to Test Drive!