Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor:

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

Slides from: Doug Gray, David Poole
Brief introduction on Logistic Regression
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
For Wednesday Read chapter 19, sections 1-3 No homework.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Classification Techniques: Decision Tree Learning
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
x – independent variable (input)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Chapter 6: Multilayer Neural Networks
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Experimental Evaluation
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
CS Instance Based Learning1 Instance Based Learning.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Radial Basis Function Networks
A Neural Network Approach to Predicting Stock Performance John Piefer ECE/CS 539 Project Presentation.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Bayesian Networks. Male brain wiring Female brain wiring.
Inductive learning Simplest form: learn a function from examples
597 APPLICATIONS OF PARAMETERIZATION OF VARIABLES FOR MONTE-CARLO RISK ANALYSIS Teaching Note (MS-Excel)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Learning from Observations Chapter 18 Through
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Linear Discrimination Reading: Chapter 2 of textbook.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)
Linear Classification with Perceptrons
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Neural Networks References: “Artificial Intelligence for Games” "Artificial Intelligence: A new Synthesis"
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Evaluation of Gender Classification Methods with Automatically Detected and Aligned Faces Speaker: Po-Kai Shen Advisor: Tsai-Rong Chang Date: 2010/6/14.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Machine Learning with Spark MLlib
Fall 2004 Backpropagation CS478 - Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Predict House Sales Price
Data Mining Lecture 11.
Data Mining Practical Machine Learning Tools and Techniques
Prepared by: Mahmoud Rafeek Al-Farra
Ensemble learning.
ECE/CS/ME 539 Artificial Neural Networks Final Project
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Final Project: Project 9 Part 1: Neural Networks Part 2: Overview of Classifiers Aparna S. Varde April 28, 2005 CS539: Machine Learning Course Instructor: Prof. Carolina Ruiz

Part 1: Neural Networks Data: The data sets used in this project are as follows. CPU Data Set: –Attributes describe features of computer CPUs such as vendors. –Target attribute denotes CPU performance: real. –Other attributes are mixture of real, nominal. –8 attributes, 209 instances. Iris Data Set: –Attributes describe features of Iris flowers such as sepals and petals. –Target attribute denotes species of Iris: nominal. –5 attributes, 150 instances. Covtype Set: –Attributes describe features of forests such as soil type, elevation. –Target attribute denotes covertype of Forest: nominal. –55 attributes, approximately 58,000 instances. –Attributes describe 12 features, some are Boolean namely type1, type 2 etc., so 54 attributes, plus target.

Preprocessing of Data Resampling: –WEKA instance-based unsupervised filter used as a preprocessing step for the Covtype data to select a subset of instances for running experiments. –3 subsets selected, with 5000, 3000 instances and 1000 instances respectively. –This was done to observe the impact of the neural network on data sets of different sizes. Supervised Discretization: –To convert continuous attributes to ranges for the Iris data, to observe impact on accuracy. –Supervised discretization done with default MDL since the target class is nominal. –Done using the WEKA attribute-based supervised preprocessing filter. –Followed by nominal to binary conversion which is directly done in the neural net. This will be discussed in the experiments section. Unsupervised Discretization: –Done for the CPU data set only to be able to compare it with the other classifiers such as decision trees and Zero R. –Target initially not nominal, so simple discretization done by binning. Discretization done for this data because J4.8 classifier used for comparison works with nominal targets only. –Discretiztion done using the WEKA attribute-based unsupervised preprocessing filter.

Experiments with Covertype Data Experiments conducted: 4-fold-cv used for testing. 1.Data set size: 1000, 3000, 5000 other parameters default. 2.Learning Rate: 0.1, 0.2, 0.3 … 1.0 with best settings from above. 3.Momentum: 0.1, 0.2, 0.3 … 1.0 with best settings from above. 4.Number of epochs: 100, 200, 300 … 1000 with best settings from above. 5.Validation Set: 0%, 5%, 10%, 15% …. 50% of data set, best settings above. 6.Validation Threshold: 10, 20, 30 …. 100 epochs, best settings above. 7.Number of Hidden Layer units: a, i, o, t, with best settings. i = number of input values o= number of output classes t = i+o a = (i+o)/2 8.Two Hidden Layers: “x,a”… “x,t” where “x” is best setting from above. 9.Normalization: True/False, experiment for default/best settings.

Experiment 1: Effect of Data Set Size The highest accuracy is obtained for the data set with 1000 instances. This 1000 instances data set also requires the least time to model. The lowest accuracy is obtained for the 3000 instances data set. The highest time to model was with the 5000 instances data set. Based on this, 1000 instances data set selected for remaining experiments. Default Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 2: Effect of Learning Rate Maximum time to model is seconds for Learning Rate of 0.1 Minimum time to model is seconds for Learning Rate of 0.3 The lowest accuracy 86% is for Learning Rate of 1.0 In general as learning rate increases, accuracy tends to reduce. Also time model is less though the drop in time is not as steady as the drop in accuracy. The highest accuracy is 87.85% obtained for learning rates of 0.2 and 0.4 However, time to model is less for learning rate of 0.4 than for 0.2 Thus the learning rate of 0.4 is selected as the setting for further experiments. Settings: Data Size = 1000, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 3: Effect of Momentum The lowest accuracy is 34.28% obtained for momentum of 0.9 The highest accuracy is 87.85% obtained for momentum of 0.2 The longest time to model is seconds for momentum of 0.8 The shortest time to model is seconds for momentum of 1 In general accuracy drops down after momentum 0.7 and gets really low for momentum of 1 The setting selected for further experiments is with momentum of 0.2 since it gives the highest accuracy of 87.85% Settings: Data Size = 1000, Learning Rate = 0.4, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 4: Effect of Number of Epochs Accuracy increases as the number of epochs increase The time to model obviously increases as the number of epochs increase The best accuracy of 88.17% is obtained for number of epochs = 900 The lowest accuracy is 84% for number of epochs = 100 The setting used for further experiments is Number of Epochs = 900 Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 5: Effect of Validation Set Size As the size of the validation set increases, the training time tends to go down for most cases. The accuracy tends to go down as the validation set size increases. The best accuracy is actually obtained for validation set size of 0%, i.e., no validation set. However this model has the risk of overfitting the training data. Hence the setting selected for further experiments is one that is likely to avoid overfitting, i.e., one with a validation set. With a validation set of 50%, the accuracy is as low as 81%, and this could be due to the fact that less data is available for training. Setting selected is with a validation set of 10%. –First of all, this model gives very high accuracy. –Secondly, since this model is considerably fast compared to the others. Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Number of Epochs = 900, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 6: Effect of Validation Threshold The accuracy stays constant at and after validation threshold = 50 The time to model is also more or less the same after validation threshold = 50 Validation thresholds of 10 and 20 require distinctly less time to model than others. The setting selected for further experiments is the one that gives highest accuracy 87.79%, with validation threshold of 20 and time to model seconds. Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10 Number of Epochs = 900, Number of Hidden Layer Units = “a”, Normalization = True.

Experiment 7: Effect of Number of Units The number of hidden units = t = i + o requires the longest time to model and gives the lowest accuracy. The number of hidden units = a = (i + o)/2 requires the shortest time to model and gives the highest accuracy. The setting selected for the next experiments is the one with number of hidden units = a, which gives accuracy of 87.79% and time to model = seconds Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10 Number of Epochs = 900, Validation Threshold = 20, Normalization = True.

Experiment 8: Effect of 2 Hidden Layers The highest time to model and also the lowest accuracy is obtained for “a,i”, which means “a” units in the first layer and “i” units in the second. The fastest model is obtained with “a,o” topology. The highest accuracy is with the “a,a” topology. This also happens to be the highest accuracy in all the experiments so far. Hence this is considered as the best overall setting and is used for the next experiment. Settings: Data Size = 1000, Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10 Number of Epochs = 900, Validation Threshold = 20, Normalization = True.

Experiment 9: Effect of Normalized Attributes The settings without normalization give distinctly lower accuracy than those with normalization, implying that normalization favorably affects accuracy. However the settings with normalization require much more time to model, implying that normalization makes learning slower. The best accuracy obtained in all the covertype experiments with neural nets so far is 89.14% with the best settings from the previous experiments and with normalization. The time required to build this model is seconds. Best Settings: Learning Rate = 0.4, Momentum = 0.2, Validation Set = 10, Number of Epochs = 900, Validation Threshold = 20, Hidden Units = “a,a”. Default Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”

Experiments with CPU and Iris Data CPU –Exp 1: Learning Rate varied from 0.1 to 1.0, other parameters default –Exp 2: Momentum varied from 0.1 to 1.0, best settings from above –Exp 3: Normalize Numeric Class, True / False with default and best settings Iris –Exp 1: Number of Units in 1 hidden layer as “a”, “i”, “o”, “t”, with other parameters default –Exp 2: Number of Units in 2 hidden layers, with 1 st layer having best settings from above –Exp 3: Nominal to Binary Conversion, True / False with default settings and best settings overall

CPU Experiment 1: Effect of Learning Rate The highest correlation coefficient is observed for Learning Rate 0.1 The lowest correlation coefficient is for Learning Rate 1.0 In general correlation coefficient decreases as learning rate increases The time to model is almost the same for this data set and is very fast compared to the CPU data set. For the next experiment, the setting selected is the one that gives the best correlation, i.e. the one with learning rate 0.1 Settings: Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalize Numeric Class = True.

CPU Experiment 2: Effect of Momentum The highest correlation is achieved for momentum 0.1 The lowest correlation is achieved for momentum 1.0 For most cases, correlation coefficient has a tendency to decrease as the momentum increases The best setting is selected as the one that shows the highest correlation. This is for momentum 0.1 Settings: Learning Rate = 0.1, Number of epochs = 500, Validation Set = 0, Number of Hidden Layer Units = “a”, Normalize Numeric Class = True.

CPU Experiment 3: Effect of Normalizing Numeric Class The default settings with no normalization give a negative correlation coefficient implying that the attributes are not well correlated For both the settings, correlation coefficient increases with normalization. The best overall setting for the CPU data set is selected as the last one in the above table, i.e., with learning rate = 0.1, momentum = 0.1, normalize numeric class = true and other parameters default. Best Settings: Learning Rate = 0.1, Momentum = 0.1 Default Settings: Learning Rate = 0.3, Momentum = 0.2

Iris Experiment 1: Effect of Units in One Hidden Layer The highest accuracy is 98% observed for number of units = “i” The lowest accuracy and also the longest time to model is observed for number of units = “t”. The shortest time to model is for number of units = “o” The best setting selected is the one with number of units = “i” because it gives the highest accuracy of 98% Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Nominal to Binary = True

Iris Experiment 2: Effect of Units in Two Hidden Layers In general two hidden layers give lower accuracy than one hidden layer for this data set. The best accuracy obtained is for the “i,a” and “i,o” settings, however this is still lower than the best accuracy with 1 hidden layer The lowest accuracy is for the “i,i” and “i,t” topologies. The time to model is the longest with “i,t” topology The fastest time to model is with “i,a” topology Settings: Learning Rate = 0.3, Momentum = 0.2, Number of epochs = 500, Validation Set = 0, Units in 1 st Hidden Layer = “i”, Nominal to Binary = True

Iris Experiment 3: Effect of Nominal To Binary Conversion The best accuracy obtained is 96% which is still lower than the best one with the raw data set. This is without nominal to binary conversion The lowest accuracy is obtained for the same settings with nominal to binary conversion Default Topology: Hidden Units = “a” Best Topology: Hidden Units = “i” Data: Discretized Data Set

Summary of Results L.R. is Learning Rate, M is Momentum, H is hidden units, V is validation set size percent and T is validation threshold. Covertype has longest time to model, Iris has shortest Iris gives highest accuracy Best Models Obtained

Summary (Contd.) CPU data set shows a negative correlation for Zero R, while the best neural net model shows a very high positive correlation of The best accuracy for Iris is 98% with neural networks which is better than that with decision trees. Covertype gives a very high accuracy with decision trees, but best model with neural nets gives accuracy of 89.14% which is even higher. Comparison with Other Classifiers

Part 2: Overview of Classifiers Decision Trees Neural Networks Bayesian Classifiers Genetic Algorithms Instance-Based Learning Classification Rules Final Project: Neural Networks Improved

Conclusions Machine Learning: Very good course Excellent Professor Great Classmates Very Interactive, Learned a Lot Thank you…