A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

R for Classification Jennifer Broughton Shimadzu Research Laboratory Manchester, UK 2 nd May 2013.
Computational Learning An intuitive approach. Human Learning Objects in world –Learning by exploration and who knows? Language –informal training, inputs.
Random Forest Predrag Radenković 3237/10
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Model generalization Test error Bias, variance and complexity
Indian Statistical Institute Kolkata
Feature Selection Presented by: Nafise Hatamikhah
Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Evaluating Classifiers
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
A Significance Test-Based Feature Selection Method for the Detection of Prostate Cancer from Proteomic Patterns M.A.Sc. Candidate: Qianren (Tim) Xu The.
Evaluation – next steps
6/28/2014 CSE651C, B. Ramamurthy1.  Classification is placing things where they belong  Why? To learn from classification  To discover patterns  To.
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Machine learning system design Prioritizing what to work on
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Learning from observations
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED Machine Learning and Failure Prediction in Hard Disk Drives Dr. Amit Chattopadhyay Director.
Evaluating Classification Performance
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Validation methods.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Machine Learning – Classification David Fenyő
Evaluating Classifiers
An Empirical Comparison of Supervised Learning Algorithms
Boosting and Additive Trees (2)
CSSE463: Image Recognition Day 11
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Source: Procedia Computer Science(2015)70:
CS 8520: Artificial Intelligence
Data Mining Classification: Alternative Techniques
Features & Decision regions
Machine Learning Week 1.
Mitchell Kossoris, Catelyn Scholl, Zhi Zheng
Model generalization Brief summary of methods
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Predicting Loan Defaults
COSC 4368 Intro Supervised Learning Organization
Credit Card Fraudulent Transaction Detection
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
An introduction to Machine Learning (ML)
Presentation transcript:

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015

Outline Classification ExamplesClassic ClassifiersTrain the ClassifierEvaluation MethodApply the Classifier

Classification Examples  Spam filtering  Fraud detection  Self-piloting automobile

The Classification Problem

The Classification Problem

The Classification Problem

Classic Classifiers  Naïve Bayes  Decision Tree : J48(C4.5)  KNN  RandomForest  SVM : SMO, LibSVM  Neural Network  …

How to Choose the Classifier?  Observe your data: amount, features  Your application: precision/recall, explainable, incremental, complexity  Decision Tree is easy to understand, but can't predict numerical values and is slow.  Naïve Bayes is robust for somehow, easy to increment.  Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no.  ! Never Mind: You can try all of them.  Model Selection with Cross Validation

How to Choose the Classifier?

 Discussions:  machine-learning-classifier machine-learning-classifier  to-try-first to-try-first  kind-of-classifier-to-use-1.html kind-of-classifier-to-use-1.html  _based_on_the_data-set_provided _based_on_the_data-set_provided

Train Your Classifier

Obtain Training Set  Instances should be labeled.  From running systems in practice  Annotate by multi-experts (Inter-rater agreement)  Crowdsourcing (Google’s Captcha)  …

Obtain Training Set  Large Enough  More data can reduce the noises  The benefit of enough data even can dominate that of the classification algorithms  Redundant data will do little help.  Selection Strategies: nearest neighbors, ordered removals, random sampling, particle swarms or evolutionary methods

Obtain Training Set  Unbalanced Training Instances for Different Classes  Evaluation: For simple measures, precision/recall,only the instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.)  No enough information for the features to find the class boundaries.

Obtain Training Set  Strategies:  divide into L distinct clusters, train L predictors, and average them as the final one.  Generate synthetic data for rare class. SMOTESMOTE  Reduce the imbalance level. Cut down the majority class  …

Obtain Training Set  More materials  unbalanced-training-set unbalanced-training-set  unbalanced-test-data-set-and-balanced-training-data-in- classification unbalanced-test-data-set-and-balanced-training-data-in- classification  He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009):

Feature Selection  Why  Unrelated Features  noise, heavy computation  Interdependent Features  redundant features  Better Model Guyon and Elisseeff in “An Introduction to Variable and Feature Selection” (PDF)An Introduction to Variable and Feature Selection

Feature Selection  Feature Selection Method  Filter methods: apply a statistical measure to assign a scoring to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores.  Wrapper methods: consider the selection of a set of features as a search problem.  Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.

Evaluation Method  Basic Evaluation Method  Precision  Confusion matrix  Per-class accuracy  AUC(Area Under the Curve)  The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives

Evaluation Method  Cross Validation  Random Subsampling  K-fold Cross Validation  Leave-one-out Cross Validation

Cross Validation  Random Subsampling

Cross Validation  K-fold Cross Validation

Cross Validation  Leave-one-out Cross Validation

Cross Validation  Three-way data splits

Apply the Classifier  Save the Model  Make the Model dynamic

Thank you!