Weka Project assignment 3 by Jason Chang
Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development
Assignment requirement Write a program to implement Quinlan's basic two-category ID3 program. Test it on the weather data ( weather.nominal.arff.) implement a way to deal with missing value in the program. Test it on the breast-cancer data (in breast-cancer.arff.) Extend your program to work with multiple categories Run a series of tests on the soybean data (in soybean.arff)
ID3 revisit J. Ross Quinlan originally developed ID3 at the University of Sydney Decision tree classifier Make use of entropy – “degree of doubt” Information gain of Class from an attribute A
ID3 revisit cont. quick summarize of algorithm For each attribute, compute its entropy with respect to the conclusion class. Compute and Select the attribute (say A) with highest Gain. Divide the data into separate sets so that within a set, A has a fixed value. Build a tree with each branch represent an attribute value. For each subtree, repeat this process from step 1. At each iteration, one attribute gets removed from consideration. The process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the conclusion class.
ID3 implementation I cheated!! Modify Weka’s ID3 code Understand Weka’s implementation
Weka data objects Instances - table Instance - row Attribute Attribute value Class Eatable skin color size yes rough Brown large no Green Large smooth Red This is essential if you are to implement algorithm that make use of Weka data processing
Weka’s ID3 A nested Id3 object with array Split attribute id3 Index value represent attribute value id3 id3 Class value Class attribute Class distribution
Modification to Weka’s ID3 Build a two-category ID3 ? Add to buildClassifier method if(data.classAttribute().numClasses() != 2) { throw new UnsupportedClassTypeException("Id3: class with two category only, please."); }
Modification to Weka’s ID3 cont. Add missing value handling feature? A naive approach!! 1.Find row with most similar attribute value match 2. Copy and replace the missing value 3. If no row is found or match ratio is low, Delete the row with missing value
Modification to Weka’s ID3 cont. My algorithm reduce dataset to row with similar attributes only It loop through all attributes and rows
Modification to Weka’s ID3 cont. Part of actual code Instances tempdata = new Instances(data); // create a copy of original instances Instances tempInstance = new Instances(data, data.numInstances()); // create a empty one int attrnum = therow.numAttributes(); boolean noMatch = false; int count; // count how many attribute looked that produce similiar value Enumeration enum = tempdata.enumerateInstances(); count = 0; tempdata.delete(instindex); // delete row with missing attribute // loop through all attribute for(int i=0; i< attrnum; i++) { Attribute cur_attr = therow.attribute(i); enum = tempdata.enumerateInstances(); // loop through all rows and ignore it if the attribute i am looking at is the missing one while(enum.hasMoreElements() && !cur_attr.equals(attr)) Instance cur_inst = (Instance)enum.nextElement(); // current row has same attribute value as row with missing value if(!cur_inst.isMissing(cur_attr) && !cur_inst.isMissing(attr) && cur_inst.value(cur_attr) == therow.value(i)) tempInstance.add(cur_inst); // add to temp table } if(tempInstance.numInstances() == 1 && count >= attrnum/3) // only 1 left! must be it tempdata = tempInstance; break;
Result on breast-cancer data Bad! Training data Correctly Classified Instances 279 97.5524 % Incorrectly Classified Instances 7 2.4476 % Cross-validation Correctly Classified Instances 162 56.6434 % Incorrectly Classified Instances 92 32.1678 % Overfitting!!
Result on breast-cancer data How about simply ignore missing value!? 1 line of code Training data Correctly Classified Instances 275 96.1538 % Incorrectly Classified Instances 10 3.4965 % Cross-validation Correctly Classified Instances 166 58.042 % Incorrectly Classified Instances 87 30.4196 % A little better but still bad! 1 line of code (58%) vs 30 lines of code (56%)
Result on Soybean data Fairly decent performance Training data Correctly Classified Instances 679 99.4143 % Incorrectly Classified Instances 4 0.5857 % Cross-validation Correctly Classified Instances 606 88.7262 % Incorrectly Classified Instances 64 9.3704 %
Lesson learned Simpler approach might work better!! Ignoring missing value in dataset with large volume of missing values is not appropriate Familiar with Weka implementation Understand ID3 more clearly
Problems!! It won’t compile! Solution: Javac Id3m.java –classpath weka.jar It won’t run with Weka explorer! run in simple CLI with (soybean in this sample) java weka.classifiers.trees.Id3m -t data/soybean.arff
Conclusion Breast-cancer dataset seems to work universally bad on tree classifier High information gain is not always the way to go ID3 handle multi-class data well My algorithm bias similar rows show up early in the record
Future development Further modify ID3 to satisfy assignment requirement Try something else to improve result A innovative way to compute unique value per rows. This will increase speed and eliminate bias problem
References Rule induction: Ross Quinlan's ID3 algorithm http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html Weka 3: Data Mining Software in Java http://www.cs.waikato.ac.nz/~ml/weka/index.html Weka javadoc Data Mining: Practical Machine Learning Tools and Techniques http://web.archive.org/web/20011112215049/www.mkp.com/books_catalog/weka/teaching_material/Assignment3.html