Weka Project assignment 3

Weka Project assignment 3
by Jason Chang

Overview Assignment ID3 revisit Weka Development process Result
Problems Conclusion Future development

Assignment requirement
Write a program to implement Quinlan's basic two-category ID3 program. Test it on the weather data ( weather.nominal.arff.) implement a way to deal with missing value in the program. Test it on the breast-cancer data (in breast-cancer.arff.) Extend your program to work with multiple categories Run a series of tests on the soybean data (in soybean.arff)

ID3 revisit J. Ross Quinlan originally developed ID3 at the University of Sydney Decision tree classifier Make use of entropy – “degree of doubt” Information gain of Class from an attribute A

ID3 revisit cont. quick summarize of algorithm
For each attribute, compute its entropy with respect to the conclusion class. Compute and Select the attribute (say A) with highest Gain. Divide the data into separate sets so that within a set, A has a fixed value. Build a tree with each branch represent an attribute value. For each subtree, repeat this process from step 1. At each iteration, one attribute gets removed from consideration. The process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the conclusion class.

ID3 implementation I cheated!! Modify Weka’s ID3 code
Understand Weka’s implementation

Weka data objects Instances - table Instance - row Attribute
Attribute value Class Eatable skin color size yes rough Brown large no Green Large smooth Red This is essential if you are to implement algorithm that make use of Weka data processing

Weka’s ID3 A nested Id3 object with array Split attribute id3
Index value represent attribute value id3 id3 Class value Class attribute Class distribution

Modification to Weka’s ID3
Build a two-category ID3 ? Add to buildClassifier method if(data.classAttribute().numClasses() != 2) { throw new UnsupportedClassTypeException("Id3: class with two category only, please."); }

Modification to Weka’s ID3 cont.
Add missing value handling feature? A naive approach!! 1.Find row with most similar attribute value match 2. Copy and replace the missing value 3. If no row is found or match ratio is low, Delete the row with missing value

My algorithm reduce dataset to row with similar attributes only It loop through all attributes and rows

Part of actual code Instances tempdata = new Instances(data); // create a copy of original instances Instances tempInstance = new Instances(data, data.numInstances()); // create a empty one int attrnum = therow.numAttributes(); boolean noMatch = false; int count; // count how many attribute looked that produce similiar value Enumeration enum = tempdata.enumerateInstances(); count = 0; tempdata.delete(instindex); // delete row with missing attribute // loop through all attribute for(int i=0; i< attrnum; i++) { Attribute cur_attr = therow.attribute(i); enum = tempdata.enumerateInstances(); // loop through all rows and ignore it if the attribute i am looking at is the missing one while(enum.hasMoreElements() && !cur_attr.equals(attr)) Instance cur_inst = (Instance)enum.nextElement(); // current row has same attribute value as row with missing value if(!cur_inst.isMissing(cur_attr) && !cur_inst.isMissing(attr) && cur_inst.value(cur_attr) == therow.value(i)) tempInstance.add(cur_inst); // add to temp table } if(tempInstance.numInstances() == 1 && count >= attrnum/3) // only 1 left! must be it tempdata = tempInstance; break;

Result on breast-cancer data
Bad! Training data Correctly Classified Instances % Incorrectly Classified Instances % Cross-validation Correctly Classified Instances % Incorrectly Classified Instances % Overfitting!!

Result on breast-cancer data
How about simply ignore missing value!? 1 line of code Training data Correctly Classified Instances % Incorrectly Classified Instances % Cross-validation Correctly Classified Instances % Incorrectly Classified Instances % A little better but still bad! 1 line of code (58%) vs 30 lines of code (56%)

Result on Soybean data Fairly decent performance Training data
Correctly Classified Instances % Incorrectly Classified Instances % Cross-validation Correctly Classified Instances % Incorrectly Classified Instances %

Lesson learned Simpler approach might work better!!
Ignoring missing value in dataset with large volume of missing values is not appropriate Familiar with Weka implementation Understand ID3 more clearly

Problems!! It won’t compile! Solution:
Javac Id3m.java –classpath weka.jar It won’t run with Weka explorer! run in simple CLI with (soybean in this sample) java weka.classifiers.trees.Id3m -t data/soybean.arff

Conclusion Breast-cancer dataset seems to work universally bad on tree classifier High information gain is not always the way to go ID3 handle multi-class data well My algorithm bias similar rows show up early in the record

Future development Further modify ID3 to satisfy assignment requirement Try something else to improve result A innovative way to compute unique value per rows. This will increase speed and eliminate bias problem

References Rule induction: Ross Quinlan's ID3 algorithm Weka 3: Data Mining Software in Java Weka javadoc Data Mining: Practical Machine Learning Tools and Techniques

Weka Project assignment 3

Similar presentations

Presentation on theme: "Weka Project assignment 3"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Weka Project assignment 3

Similar presentations

Presentation on theme: "Weka Project assignment 3"— Presentation transcript:

Similar presentations

About project

Feedback