Weka Project assignment 3

Slides:



Advertisements
Similar presentations
Generalized Index-Set Splitting Christopher Barton Arie Tal Bob Blainey Jose Nelson Amaral.
Advertisements

Rule-Based Classifiers. Rule-Based Classifier Classify records by using a collection of “if…then…” rules Rule: (Condition)  y –where Condition is a conjunctions.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Trees with Numeric Tests
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
RIPPER Fast Effective Rule Induction
Chapter 10 Introduction to Arrays
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Classification Techniques: Decision Tree Learning
Decision Tree Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
What you need to know to get started with writing code for machine learning.
Decision Trees.
Decision Trees Jeff Storey. Overview What is a Decision Tree Sample Decision Trees How to Construct a Decision Tree Problems with Decision Trees Decision.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Computer Science II Recursion Professor: Evan Korth New York University.
Decision Tree Rong Jin. Determine Milage Per Gallon.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
IMPUTING MISSING VALUES FOR HIERARCHICAL POPULATION DATA Overview of Database Research Muhammad Aurangzeb Ahmad Nupur Bhatnagar.
Induction of Decision Trees
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Recursion Road Map Introduction to Recursion Recursion Example #1: World’s Simplest Recursion Program Visualizing Recursion –Using Stacks Recursion Example.
Rule induction: Ross Quinlan's ID3 algorithm Fredda Weinberg CIS 718X Fall 2005 Professor Kopec Assignment #3.
Three kinds of learning
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Classification.
Recursion Chapter 7. Chapter 7: Recursion2 Chapter Objectives To understand how to think recursively To learn how to trace a recursive method To learn.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Chapter 7 Decision Tree.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
An Exercise in Machine Learning
Data Mining: Classification
Issues with Data Mining
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
Recursion Chapter 7. Chapter Objectives  To understand how to think recursively  To learn how to trace a recursive method  To learn how to write recursive.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
224 3/30/98 CSE 143 Recursion [Sections 6.1, ]
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.3: Decision Trees Rodney Nielsen Many of.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
CS690L Data Mining: Classification
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Slides for “Data Mining” by I. H. Witten and E. Frank.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
An Exercise in Machine Learning
CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
CSE343/543 Machine Learning: Lecture 4.  Chapter 3: Decision Trees  Weekly assignment:  There are lot of applications and systems using machine learning.
Data Mining Chapter 4 Algorithms: The Basic Methods - Constructing decision trees Reporter: Yuen-Kuei Hsueh Date: 2008/7/24.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSC 143 P 1 CSC 143 Recursion [Chapter 5]. CSC 143 P 2 Recursion  A recursive definition is one which is defined in terms of itself  Example:  Compound.
DECISION TREES An internal node represents a test on an attribute.
Decision Trees an introduction.
Introduction to Recursion
Artificial Intelligence
Data Science Algorithms: The Basic Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Decision Tree Saed Sayad 9/21/2018.
Classification and Prediction
Decision Trees Jeff Storey.
Presentation transcript:

Weka Project assignment 3 by Jason Chang

Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development

Assignment requirement Write a program to implement Quinlan's basic two-category ID3 program. Test it on the weather data ( weather.nominal.arff.) implement a way to deal with missing value in the program. Test it on the breast-cancer data (in breast-cancer.arff.) Extend your program to work with multiple categories Run a series of tests on the soybean data (in soybean.arff)

ID3 revisit J. Ross Quinlan originally developed ID3 at the University of Sydney Decision tree classifier Make use of entropy – “degree of doubt” Information gain of Class from an attribute A

ID3 revisit cont. quick summarize of algorithm For each attribute, compute its entropy with respect to the conclusion class. Compute and Select the attribute (say A) with highest Gain. Divide the data into separate sets so that within a set, A has a fixed value. Build a tree with each branch represent an attribute value. For each subtree, repeat this process from step 1. At each iteration, one attribute gets removed from consideration. The process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the conclusion class.

ID3 implementation I cheated!! Modify Weka’s ID3 code Understand Weka’s implementation

Weka data objects Instances - table Instance - row Attribute Attribute value Class Eatable skin color size yes rough Brown large no Green Large smooth Red This is essential if you are to implement algorithm that make use of Weka data processing

Weka’s ID3 A nested Id3 object with array Split attribute id3 Index value represent attribute value id3 id3 Class value Class attribute Class distribution

Modification to Weka’s ID3 Build a two-category ID3 ? Add to buildClassifier method if(data.classAttribute().numClasses() != 2) { throw new UnsupportedClassTypeException("Id3: class with two category only, please."); }

Modification to Weka’s ID3 cont. Add missing value handling feature? A naive approach!! 1.Find row with most similar attribute value match 2. Copy and replace the missing value 3. If no row is found or match ratio is low, Delete the row with missing value

Modification to Weka’s ID3 cont. My algorithm reduce dataset to row with similar attributes only It loop through all attributes and rows

Modification to Weka’s ID3 cont. Part of actual code Instances tempdata = new Instances(data); // create a copy of original instances Instances tempInstance = new Instances(data, data.numInstances()); // create a empty one int attrnum = therow.numAttributes(); boolean noMatch = false; int count; // count how many attribute looked that produce similiar value Enumeration enum = tempdata.enumerateInstances(); count = 0; tempdata.delete(instindex); // delete row with missing attribute // loop through all attribute for(int i=0; i< attrnum; i++) { Attribute cur_attr = therow.attribute(i); enum = tempdata.enumerateInstances(); // loop through all rows and ignore it if the attribute i am looking at is the missing one while(enum.hasMoreElements() && !cur_attr.equals(attr)) Instance cur_inst = (Instance)enum.nextElement(); // current row has same attribute value as row with missing value if(!cur_inst.isMissing(cur_attr) && !cur_inst.isMissing(attr) && cur_inst.value(cur_attr) == therow.value(i)) tempInstance.add(cur_inst); // add to temp table } if(tempInstance.numInstances() == 1 && count >= attrnum/3) // only 1 left! must be it tempdata = tempInstance; break;

Result on breast-cancer data Bad! Training data Correctly Classified Instances 279 97.5524 % Incorrectly Classified Instances 7 2.4476 % Cross-validation Correctly Classified Instances 162 56.6434 % Incorrectly Classified Instances 92 32.1678 % Overfitting!!

Result on breast-cancer data How about simply ignore missing value!? 1 line of code Training data Correctly Classified Instances 275 96.1538 % Incorrectly Classified Instances 10 3.4965 % Cross-validation Correctly Classified Instances 166 58.042 % Incorrectly Classified Instances 87 30.4196 % A little better but still bad! 1 line of code (58%) vs 30 lines of code (56%)

Result on Soybean data Fairly decent performance Training data Correctly Classified Instances 679 99.4143 % Incorrectly Classified Instances 4 0.5857 % Cross-validation Correctly Classified Instances 606 88.7262 % Incorrectly Classified Instances 64 9.3704 %

Lesson learned Simpler approach might work better!! Ignoring missing value in dataset with large volume of missing values is not appropriate Familiar with Weka implementation Understand ID3 more clearly

Problems!! It won’t compile! Solution: Javac Id3m.java –classpath weka.jar It won’t run with Weka explorer! run in simple CLI with (soybean in this sample) java weka.classifiers.trees.Id3m -t data/soybean.arff

Conclusion Breast-cancer dataset seems to work universally bad on tree classifier High information gain is not always the way to go ID3 handle multi-class data well My algorithm bias similar rows show up early in the record

Future development Further modify ID3 to satisfy assignment requirement Try something else to improve result A innovative way to compute unique value per rows. This will increase speed and eliminate bias problem

References Rule induction: Ross Quinlan's ID3 algorithm http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html Weka 3: Data Mining Software in Java http://www.cs.waikato.ac.nz/~ml/weka/index.html Weka javadoc Data Mining: Practical Machine Learning Tools and Techniques http://web.archive.org/web/20011112215049/www.mkp.com/books_catalog/weka/teaching_material/Assignment3.html