Weka Tutorial
WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by researchers at the University of Waikato in New Zealand Java based
WEKA:: Installation Download software from – If you are interested in modifying/extending weka there is a developer version that includes the source code Set the weka environment variable for java – setenv WEKAHOME /usr/local/weka/weka – setenv CLASSPATH $WEKAHOME/weka.jar:$CLASSPATH Download some ML data from
WEKA:: Introduction.contd Routines are implemented as classes and logically arranged in packages Comes with an extensive GUI interface – Weka routines can be used stand alone via the command line Eg. java weka.classifiers.j48.J48 -t $WEKAHOME/data/iris.arff
WEKA:: Interface
WEKA:: Data format Uses flat text files to describe the data Can work with a wide variety of data files including its own “.arff” format and C4.5 file formats Data can be imported from a file in various formats: – ARFF, CSV, C4.5, binary Data can also be read from a URL or from an SQL database (using JDBC)
WEKA:: ARRF file 'enamelability' cholesterol shape { COIL, class '?','C','A',0,60,'T','?','?',0,'?','?','G','?','?','?','?','M','?','?','?','?','?','?','?','?','?','?','?','?','?','?','COIL',2.801,385.1,0,'?','0', '?','3' '?','C','A',0,60,'T','?','?',0,'?','?','G','?','?','?','?','B','Y','?','?','?','Y','?','?','?','?','?','?','?','?','?','SHEET',0.801,255,269,'?','0','?','3' '?','C','A',0,45,'?','S','?',0,'?','?','D','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','?','COIL',1.6,610,0,'?','0','?', '3'... A more thorough description is available here
WEKA:: Explorer: Preprocessing Pre-processing tools in WEKA are called “filters” WEKA contains filters for: – Discretization, normalization, resampling, attribute selection, transforming, combining attributes, etc
Annealing dataset : Description Annealing dataset is from the UCI repository of datasets. It contains information about data being annealed and its various properties. There are 38 attributes in this dataset in which 6 are continuous, 3 are integer valued and remaining 29 are nominal. This dataset consists of missing values and in total has 798 records along with 6 major classes. The notion of classes will be explained later during classification.
Data Cleaning: Removing missing values:
Data Cleaning: Removing useless attributes Earlier 38 now 32
Data transformation: Discretizing the attributes Implies 15 bins First-last means all attributes
Data reduction: Supervised attribute selection Reducing data size from 32 to 10
Viewing and understanding the transformed data This can be done using the ARFF viewer option in Weka. It allows us to save files in other formats also like CSV and others. arfftocsv convertor option and vice versa is also there. Such files can then be imported into mysql databases and others easily after this conversion.
Data is now ready for data mining !