An Exercise in Machine Learning

Slides:



Advertisements
Similar presentations
COMP3740 CR32: Knowledge Management and Adaptive Systems
Advertisements

Florida International University COP 4770 Introduction of Weka.
Demo: Classification Programs C4.5 CBA Minqing Hu CS594 Fall 2003 UIC.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Classification by Machine Learning Approaches - Exercise Solution Michael J. Kerner – Center for Biological Sequence.
WEKA Evaluation of WEKA Waikato Environment for Knowledge Analysis Presented By: Manoj Wartikar & Sameer Sagade.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
March 25, 2004Columbia University1 Machine Learning with Weka Lokesh S. Shrestha.
An Extended Introduction to WEKA. Data Mining Process.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
1 Statistical Learning Introduction to Weka Michel Galley Artificial Intelligence class November 2, 2006.
1 How to use Weka How to use Weka. 2 WEKA: the software Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
An Exercise in Machine Learning
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.
Evaluation – next steps
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
WEKA and Machine Learning Algorithms. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of.
Appendix: The WEKA Data Mining Software
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Weka: a useful tool in data mining and machine learning Team 5 Noha Elsherbiny, Huijun Xiong, and Bhanu Peddi.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
For ITCS 6265/8265 Fall 2009 TA: Fei Xu UNC Charlotte.
W E K A Waikato Environment for Knowledge Analysis Branko Kavšek MPŠ Jožef StefanNovember 2005.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Introduction to Weka Xingquan (Hill) Zhu Slides copied from Jeffrey Junfeng Pan (UST)
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
W E K A Waikato Environment for Knowledge Aquisition.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Weka Tutorial. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.1: Decision Trees Rodney Nielsen Many /
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
DECISION TREES An internal node represents a test on an attribute.
Prepared by: Mahmoud Rafeek Al-Farra
Waikato Environment for Knowledge Analysis
Decision Tree Saed Sayad 9/21/2018.
WEKA.
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Machine Learning with Weka
Tutorial for WEKA Heejun Kim June 19, 2018.
CSCI N317 Computation for Scientific Applications Unit Weka
Lecture 10 – Introduction to Weka
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Practice Project Overview
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

An Exercise in Machine Learning http://www.cs.iastate.edu/~cs573x/bbsilab.html Machine Learning Software Preparing Data Building Classifiers Interpreting Results Test-driving WEKA

Machine Learning Software Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SIPINA List from KDNuggets (Various) Specific Classification: C4.5, SVMlight Association Rule Mining Bayesian Net … … Commercial vs. Free vs. Programming

What does WEKA do? Implementation of state-of-art learning algorithm Main strengths in the classification Regression, Association Rules and clustering algorithms Extensible to try new learning schemes Large variety of handy tools (transforming datasets, filters, visualization etc…)

WEKA resources API Documentation, Tutorial, Source code. WEKA mailing list Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Weka-related Projects: Weka-Parallel - parallel processing for Weka RWeka - linking R and Weka YALE - Yet Another Learning Environment Many others…

Getting Started Installation (Java runtime +WEKA) Setting up the environment (CLASSPATH) Reference Book and online API document Preparing Data sets Running WEKA Interpreting Results

ARFF Data Format Attribute-Relation File Format Header – describing the attribute types Data – (instances, examples) comma-separated list Use the right data format: Filestem, CSV  ARFF format Use C45Loader and CSVLoader to convert

Launching WEKA

Load Dataset into WEKA

Data Filters Useful support for data preprocessing Removing or adding attributes, resampling the dataset, removing examples, etc. Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. Typically split data as 2/3 in training and 1/3 in testing

Building Classifiers A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. Decision Tree and Naïve Bayes Classifiers Which one is the best? No Free Lunch!

Building Classifier

(1) weka.classifiers.rules.ZeroR Building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class). (2) weka.classifiers.bayes.NaiveBayes Class for building a Naive Bayesian classifier

(3) weka.classifiers.trees.J48 Class for generating an unpruned or a pruned C4.5 decision tree.

Test Options Percentage Split (2/3 Training; 1/3 Testing) Cross-validation estimating the generalization error based on resampling when limited data; averaged error estimate. stratified 10-fold leave-one-out (Loo) 10-fold vs. Loo

Understanding Output

Decision Tree Output (1) === Error on training data === Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0% Root relative squared error 0% Total Number of Instances 14 === Detailed Accuracy By Class === TP FP Precision Recall F-Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no === Confusion Matrix === a b <-- classified as 0 | a = yes 0 5 | b = no J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

Decision Tree Output (2) === Stratified cross-validation === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60% Root relative squared error 97.6586 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no

Performance Measures Accuracy & Error rate Mean absolute error Root mean-squared root (square root of the average quadratic loss) Confusion matrix – contingency table True Positive rate & False Positive rate Precision & F-Measure

Decision Tree Pruning Overcome Over-fitting Pre-pruning and Post-pruning Reduced error pruning Subtree raising with different confidence Comparing tree size and accuracy.

Subtree replacement Bottom-up: tree is considered for replacement once all its subtrees have been considered

Subtree Raising Deletes node and redistributes instances Slower than subtree replacement

Naïve Bayesian Classifier Output CPT, same set of performance measures By default, use normal distribution to model numeric attributes. Kernel density estimator could improve performance if normality assumption is incorrect. (-k option)

Data Sets to work on Data sets were preprocessed into ARFF format Three data sets from UCI repository Two data sets from Computational Biology Protein Function Prediction Surface Residue Prediction

Protein Function Prediction Build a Decision Tree classifier that assign protein sequences into functional families based on characteristic motif compositions Each attribute (motif) has a Prosite access number: PS#### Class label use Prosite Doc ID: PDOC#### 73 attributes (binary) & 10 classes (PDOC). Suggested method: Use 10-fold CV and Pruning the tree using Sub-tree raising method

Surface Residue Prediction Prediction is based on the identity of the target residue and its 4 sequence neighbors Window Size = 5 Target residue is on Surface or not? 5 attributes and binary classes. Suggested method: Use Naïve Bayesian Classifier with no kernels X1 X2 X3 X4 X5

Your Turn to Test Drive!