Gary M. Weiss Alexander Battistin Fordham University.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
An Analysis of Machine Learning Algorithms for Condensing Reverse Engineered Class Diagrams Hafeez Osman, Michel R.V. Chaudron and Peter van der Putten.
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
CMPUT 466/551 Principal Source: CMU
Lecture Notes for Chapter 4 Introduction to Data Mining
Evaluation.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? Gary Weiss, Kate McCarthy, Bibi Zabar Fordham.
A Combinatorial Fusion Method for Feature Mining Ye Tian, Gary Weiss, D. Frank Hsu, Qiang Ma Fordham University Presented by Gary Weiss.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Decision Tree Algorithm
1 Abstract This study presents an analysis of two modified fuzzy ARTMAP neural networks. The modifications are first introduced mathematically. Then, the.
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Ensemble Learning: An Introduction
Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn.
Experimental Evaluation
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
CLassification TESTING Testing classifier accuracy
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Benk Erika Kelemen Zsolt
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Weka: Experimenter and Knowledge Flow interfaces Neil Mac Parthaláin
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Additive Data Perturbation: the Basic Problem and Techniques.
Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Predicting Good Probabilities With Supervised Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Ensemble Methods in Machine Learning
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Progressive Sampling Instance Selection and Construction for Data Mining Ch 9. F. Provost, D. Jensen, and T. Oates 신수용.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Classification and Regression Trees
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Data Science Credibility: Evaluating What’s Been Learned
Machine Learning: Ensemble Methods
Learning to Detect and Classify Malicious Executables in the Wild by J
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Supervised Time Series Pattern Discovery through Local Importance
SAD: 6º Projecto.
Mobile Sensor-Based Biometrics Using Common Daily Activities
Data Mining Practical Machine Learning Tools and Techniques
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
Ensembles.
Classification Breakdown
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Gary M. Weiss Alexander Battistin Fordham University

Motivation  Classification performance related to amount of training data  Relationship visually represented by learning curve  Performance increases steeply at first  Slope begins to decrease with adequate training data  Slope approaches 0 as more data barely helps  Training data often costly  Cost of collecting or labeling  “Good” learning curves can help identify optimal amount of data 7/22/ DMIN 2014

Exploiting Learning Curves  In practice will only have learning curve up to current number of examples when deciding on whether to acquire more  Need to predict performance for larger sizes  Can do iteratively and acquire in batches  Can even use curve fitting  Works best if learning curves are well behaved  Smooth and regular 7/22/ DMIN 2014

Prior Work Using Learning Curves  Provost, Jensen and Oates 1 evaluated progressive sampling schemes to identify point where learning curves begin to plateau  Weiss and Tian 2 examined how learning curves can be used to optimize learning when performance, acquisition costs, and CPU time are considered  “Because the analyses are all driven by the learning curves, any method for improving the quality of the learning curves (i.e., smoothness, monotonicity) would improve the quality of our results, especially the effectiveness of the progressive sampling strategies.” 1 Provost, F., Jensen, D., and Oates, T. (1999).Efficient progressive sampling. In Proc. 5 th Int. Conference on knowledge Discovery & Data Mining, Weiss, G.,M., and Tian, Y Maximizing classifier utility when there are data acquisition and modeling costs. Data Mining and Knowledge Discovery, 17(2): /22/ DMIN 2014

What We Do  Generate learning curves for six data sets  Different classification algorithms  Random sampling and cross validation  Evaluate curves  Visually for smoothness and monotonicity  “Variance” of the learning curve 7/22/ DMIN 2014

The Data Sets Name# ExamplesClasses# Attributes Adult32, Coding20, Blackjack15,00024 Boa111, Kr-vs-kp3, Arrhythmia /22/2014DMIN

Experiment Methodology  Sampling strategies  10-fold cross validation: 90% available for training  Random sampling: 75% available for training  Training set sizes sampled at regular 2% intervals of available data  Classification algorithms (from WEKA)  J48 Decision Tree  Random Forest  Naïve Bayes 7/22/2014DMIN

Results: Accuracy DatasetJ48Random Forest Naïve Bayes Adult Coding Blackjack Boa Kr-vs-kp Arrhythmia Average /22/2014DMIN Accuracy is not our focus, but a well behaved learning curve for a method that produces poor results is not useful. These results are for the largest training set size (no reduction) J48 and Random Forest are competitive so we will focus on them

Results: Variances DatasetJ48Random Forest Naïve Bayes Adult Coding Blackjack Boa Kr-vs-kp Arrhythmia /22/2014DMIN Variance for a curve equals average variance in performance for each evaluated training set size. The results are for 10- fold cross validation. Naïve Bayes is best followed by J48. But Naïve Bayes had low accuracy (see previous slide)

J48 Learning Curves (10 xval) 7/22/2014DMIN

Random Forest Learning Curves 7/22/2014DMIN

Naïve Bayes Learning Curves 7/22/2014DMIN

Closer Look at J48 and RF (Adult) 7/22/2014DMIN

A Closer Look at J48 and RF (kr-vs-kp) 7/22/2014DMIN

Closer Look at J48 and RF (Arrhythmia) 7/22/2014DMIN

Now lets compare cross validation to Random Sampling, which we find generates less well behaved curves 7/22/2014DMIN Curves used Cross validation

J48 Learning Curves (Blackjack Data Set) 7/22/2014DMIN

RF Learning Curves (Blackjack Data Set) 7/22/2014DMIN

Conclusions  Introduced the notion of well-behaved learning curves and methods for evaluating this property  Naïve Bayes seemed to produce much smoother curves, but less accurate  Low variance may be because they consistently reach a plateau early  J48 and Random Forest seem reasonable  Need more data sets to determine which is best  Cross validation clearly generates better curves than random sampling (less randomness?) 7/22/2014DMIN

Future Work  Need more comprehensive evaluation  Many more data sets  Compare more algorithms  Additional metrics  Count number of drops in performance with greater size (i.e., “blips”). Need better summary metric.  Vary number of runs. More runs almost certainly yields smoother learning curves.  Evaluate in context  Ability to identify optimal learning point  Ability to identify plateau (based on some criterion) 7/22/2014DMIN

If Interested in This Area Provost, F., Jensen, D., and Oates, T. (1999).Efficient progressive sampling. In Proc. 5 th Int. Conference on knowledge Discovery & Data Mining, Weiss, G.,M., and Tian, Y Maximizing classifier utility when there are data acquisition and modeling costs. Data Mining and Knowledge Discovery, 17(2):  Contact me if you want to work on expanding this paper 7/22/2014DMIN