An Exercise in Machine Learning

An Exercise in Machine Learning
Machine Learning Software Preparing Data Building Classifiers Interpreting Results Test-driving WEKA

Machine Learning Software
Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SIPINA List from KDNuggets (Various) Specific Classification: C4.5, SVMlight Association Rule Mining Bayesian Net … … Commercial vs. Free vs. Programming

What does WEKA do? Implementation of state-of-art learning algorithm
Main strengths in the classification Regression, Association Rules and clustering algorithms Extensible to try new learning schemes Large variety of handy tools (transforming datasets, filters, visualization etc…)

WEKA resources API Documentation, Tutorial, Source code.
WEKA mailing list Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Weka-related Projects: Weka-Parallel - parallel processing for Weka RWeka - linking R and Weka YALE - Yet Another Learning Environment Many others…

Getting Started Installation (Java runtime +WEKA)
Setting up the environment (CLASSPATH) Reference Book and online API document Preparing Data sets Running WEKA Interpreting Results

ARFF Data Format Attribute-Relation File Format
Header – describing the attribute types Data – (instances, examples) comma-separated list Use the right data format: Filestem, CSV  ARFF format Use C45Loader and CSVLoader to convert

Launching WEKA

Load Dataset into WEKA

Data Filters Useful support for data preprocessing
Removing or adding attributes, resampling the dataset, removing examples, etc. Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. Typically split data as 2/3 in training and 1/3 in testing

Building Classifiers A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. Decision Tree and Naïve Bayes Classifiers Which one is the best? No Free Lunch!

Building Classifier

(1) weka.classifiers.rules.ZeroR
Building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class). (2) weka.classifiers.bayes.NaiveBayes Class for building a Naive Bayesian classifier

(3) weka.classifiers.trees.J48
Class for generating an unpruned or a pruned C4.5 decision tree.

Test Options Percentage Split (2/3 Training; 1/3 Testing)
Cross-validation estimating the generalization error based on resampling when limited data; averaged error estimate. stratified 10-fold leave-one-out (Loo) 10-fold vs. Loo

Understanding Output

Decision Tree Output (1)
=== Error on training data === Correctly Classified Instance % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances === Detailed Accuracy By Class === TP FP Precision Recall F-Measure Class yes no === Confusion Matrix === a b <-- classified as 0 | a = yes 0 5 | b = no J48 pruned tree outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

Decision Tree Output (2)
=== Stratified cross-validation === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class yes no === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no

Performance Measures Accuracy & Error rate Mean absolute error
Root mean-squared root (square root of the average quadratic loss) Confusion matrix – contingency table True Positive rate & False Positive rate Precision & F-Measure

Decision Tree Pruning Overcome Over-fitting
Pre-pruning and Post-pruning Reduced error pruning Subtree raising with different confidence Comparing tree size and accuracy.

Subtree replacement Bottom-up: tree is considered for replacement once all its subtrees have been considered

Subtree Raising Deletes node and redistributes instances
Slower than subtree replacement

Naïve Bayesian Classifier
Output CPT, same set of performance measures By default, use normal distribution to model numeric attributes. Kernel density estimator could improve performance if normality assumption is incorrect. (-k option)

Data Sets to work on Data sets were preprocessed into ARFF format
Three data sets from UCI repository Two data sets from Computational Biology Protein Function Prediction Surface Residue Prediction

Protein Function Prediction
Build a Decision Tree classifier that assign protein sequences into functional families based on characteristic motif compositions Each attribute (motif) has a Prosite access number: PS#### Class label use Prosite Doc ID: PDOC#### 73 attributes (binary) & 10 classes (PDOC). Suggested method: Use 10-fold CV and Pruning the tree using Sub-tree raising method

Surface Residue Prediction
Prediction is based on the identity of the target residue and its 4 sequence neighbors Window Size = 5 Target residue is on Surface or not? 5 attributes and binary classes. Suggested method: Use Naïve Bayesian Classifier with no kernels X1 X2 X3 X4 X5

Your Turn to Test Drive!

An Exercise in Machine Learning

Similar presentations

Presentation on theme: "An Exercise in Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Exercise in Machine Learning

Similar presentations

Presentation on theme: "An Exercise in Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback